by Wayne Smith
Search Engine tasks
Index the sites people are most likely to be looking for; More often than not priority is given to established sites known for providing helpful content. Update the index to appear to stay fresh with priority given to topics people are searching for.
Search Engine Bots, Crawling Priority
Crawling: This step provides the meta data, (http header request): If the request is for the full page, it may, (or may not), collect any schema from the page, navigation and other page links. The full page provides some ranking factors related to both crawling and indexing priority are available.
Google's freshness algo focuses on trending topics:
If the site is known in the database, (A factor created in ranking which can rank sites for indexing priority. might be the same as trust ranking but can be a completely different data entry), the schema and meta could be used to create a listing. Although a great number of sites provide meta data which is of poor quality or is even spammy -- This can also be true from a established and trusted sites. The factor can also be used to prioritize rendering.
It is efficent to use available information about the site in prioritizing newly discovered URLs
Examples of how priority value can be used by a internet spider:
0.0 URL is now known to GoogleBot, but has no priority.
1.0 URL is on a priority site, (IE New York Times, ESPN), Needs to be indexed now.
0.01 - 0.09 URL was already known but now has links from other sites.
0.1 URL is on a site with a priority of 0.1 -- the spider will get to it when it has time.
0.9 URL anchor text, (or schema), suggests URL is relevant to a priority topic and links on this site are trustworthy.
Note: Some sites are search engine friendly and provide a last modified http header, and some provide an Etag. Either the last modified or the Etag can be used to check if the page has been updated. These values are part of the http header, a request can be made to read the header without reading the whole page ... when available they allow a bot to quickly scan a site to see what pages where updated.
Rendering The Graphical Representation of the page.
After the page, CSS, and potentially the images have been downloaded by the bot ... to determine the graphical structue of the page, what information is above the fold, (visable to the user when they open the page without scrolling), and how much realestate the words that are above the fold are using -- The above the fold is an important zone for ranking the page.
Ranking Sites (pre-search scores) pre and post indexing
Before search and before indexing pages they can be scored and categorized on a large number of factors. It makes zero sence for many factors to do them over and over again on every search. Factors such as topical relevance remain constant until the page is changed. Ranking is or can require more CPU resources than rendering.
Not all sites need the same number of zones or level of factors. Incoming links, (the anchor text), and off page schema, can suggest additional keywords or topics for a page. These external factors can signal additional zones on the page, which need to be evaluated higher than it was when it was first indexed. This information can be stored and used when other pages are index from the site.
Many of Google search sniplet features are available via a tool where the data is highlighted. Once the data on one page has been highlighted and unambiguously marked -- Google applies the knowledge to other parts of the page and other pages as the structure to look for, (zone to use), to find the information, (keywords).
The pre-sorting and categoration of pages can be very site specific. Historical established sites may easily rank with a few keywords in the text, while new sites can only rank for those keywords if they are found in the title and headlines.
NOTE: It is normal for a page to gain or grow in traffic and long tail keyword search volumn. Before going crazy to try to expand keyword zones on ones pages ... It is also normal for Google to do Core Updates and re-evaluate pages to determine how helpful they are to Google Search Listings and purge long tail keyword terms, which are deemed as not helpful to their listings.
Adding to the index
After scoring, Google may realize that the page is never going to rank in the top 2000 for the terms on the page. Or, the page may be spammy. Or, the page is a copy of a page from another site. Google indexes only 4% of the documents it encounters when crawling the web (400 billion/10 trillion). Search Engine Land: "Spammy AI content all over the web - Invisible junk: Google claims to protect searchers from spam in 99% of query clicks"