September 05, 2017

Understanding the Google “Crawl Budget”

By Daniel Olagbami Senior Technical Executive

Daniel Olagbami

Daniel Olagbami

Senior Technical Executive

Google recommends that minimal time should be wasted on pages that drain crawl activity

Daniel Olagbami, Senior Technical Executive, explains the Google "Crawl Budget".


What is a crawler and how does it work?

Part of the focus for search marketers when optimising a website is to ensure it is fully crawlable and accessible. This is important as all the necessary site content should be available for indexation by a search engine crawler in order for it to be accessible to users.

A crawler is like a web-spider, traversing through the internet and pages on various sites to collect information and data. In this context and relation to SEO, the main “web crawler” responsible for sites appearing in the Google search engine index is called the “Googlebot”.

As there are billions of pages on the web, therefore it is unfeasible for the Googlebot to crawl through every single one, every second of the day. Doing so can result in bandwidth issues and slowing down of sites drastically, not to mention sites with sensitive firewall settings against aggressive crawling.

To allay this concern, Google allocates a “crawl budget” for each website. This helps it to determine how often the Googlebot will crawl a site to index its pages.

What is a crawl budget?

As defined by Google, crawl budget is known as “the number of URLs Googlebot can and wants to crawl.” The number of pages that is crawled on a site daily may vary slightly, but generally is quite stable. This number or “budget” is normally determined by the size of a site, its “health” and number of links pointing to it. In short, crawl rate limit and crawl demand together help to make up the crawl budget set by Google.

Unfortunately, it is not possible to predict exactly how a site’s crawl budget is formed, but generally, Google mentions two factors it takes into consideration when determining crawl budget:

• Site Popularity — sites with pages that are more popular e.g. high organic, referral and direct traffic
• Staleness of Content — Google tends to crawl more frequently pages that are updated often

Factors affecting crawl budget

Different things can affect the level of crawl budget allocated to a site. For instance, having many low-value-add URLs (pages that add little value to the site’s search performance) negatively affect a site’s crawling and indexing. We found that the low-value-add URLs fall into these categories, in order of significance:

• Duplicate content
• Soft error pages
• Hacked pages
• Low quality and spam pages

How to know when there is an issue with crawl budget?

There are certain ways of determining whether a site has crawl budget issues, to quickly determine whether your site has a crawl budget issue, the following steps can help (best for small to medium size sites)

1. Firstly, determine how many pages are currently on the site (checking the sitemap XML file(s) may help.
2. Go into Google Search Console.
3. Go to Crawl -> Crawl stats and take note of the average pages crawled per day.
4. Divide the number of pages by the “Average crawled per day” number.
5. If you end up with a number higher than ~10 (10x more pages than what Google crawls each day), you should optimize your crawl budget. If you end up with a number lower than 3, then there is no need to be concerned.

Daniel O_PIC

 

Increasing crawl budget

If you have realised that your site’s crawl budget may be much lower than it is meant to be and would like to increase it, there are a few things that can be considered, to possibly improve it.

Block parts of your site

There are almost certainly going to be parts of the site you do not need Google to index, and these folders/pages should simply be blocked via the site’s robots.txt file. This should only be done if you know what you’re doing and are aware of the possible consequences of blocking the wrong folder. A good example of effective blocking is of filter pages, usually prominent in large e-commerce websites where there are endless combinations of URL possibilities created with these filters. You really want to make sure that Googlebot is only indexing one of two of these and not all of them.

Reduce errors

One of the most important steps in ensuring all your pages are crawled efficiently is by ensuring as many as possible are returning a 200 (OK) status code, rather than a 301 (Redirect) or 404 (Not found) code. These other codes extend the journey of the crawler to find an indexable page, and by reducing the number of these you also increase the change of improving the site’s crawl rate.

Reduce redirect chains

As mentioned above, redirect chains can reduce the efficiency of Googlebot in crawling your site’s pages. When URL is redirected (using a 301 permanent redirect) Google often adds the new URL to the to-do list. It might not always follow the redirect to this new page immediately and this page is added to its to-do list, whilst it crawls all other active pages. Having redirect chains (more than two redirects before a final URL destination) can lead to it taking longer for Googlebot to either crawl or index the redirected page.

Increase links to the site

This is usually easier said than done. Increasing the number of backlinks to the site is not about tooting your own horn, but making others see that your site offers great content that should be seen by as many people as possible. Utilising the correct PR and social channels to get your content shared and linked from other reputable publishers ensures Google also “passes your way” more frequently whilst crawling those sites.

Overall, Google recommends that minimal time should be wasted on pages that drain crawl activity, especially if they provide no real value to the user as they can slow down the time take for great content to be discovered and severely inflate the index. Although, it may not always be possible to control exactly what is indexed in the search results, utilising the correct robots.txt disallow directives and meta robots tags (to ‘noindex’ unwanted pages) can still help to provide a strong enough control to improve crawl budget efficiency especially for large domains.