Crawl Budget

A crawl budget is allocated to every site and determines how many pages and resources can be crawled by search engines. Our SEO Office Hours Notes below cover recommendations for the optimization of crawl budget, as well as providing insights from Google around how crawl budget is controlled.

APIs & Crawl Budget: Don’t block API requests if they load important content

June 22, 2022 Source

An attendee asked whether a website should disallow subdomains that are sending API requests, as they seemed to be taking up a lot of crawl budget. They also asked how API endpoints are discovered or used by Google.

John first clarified that API endpoints are normally used by JavaScript on a website. When Google renders the page, it will try to load the content served by the API and use it for rendering the page. It might be hard for Google to cache the API results, depending on your API and JavaScript set-up — which means Google may crawl a lot of the API requests to get a rendered version of your page for indexing. 

You could help avoid crawl budget issues here by making sure the API results are cached well and don’t contain timestamps in the URL. If you don’t care about the content being returned to Google, you could block the API subdomains from being crawled, but you should test this out first to make sure it doesn’t stop critical content from being rendered. 

John suggested making a test page that doesn’t crawl the API, or uses a broken URL for it,  and see how the page renders in the browser (and for Google).

Having a high ratio of ‘noindex’ vs indexable URLs could affect website crawlability

November 17, 2021 Source

Having noindex URLs normally does not affect how Google crawls the rest of your website—unless you have a large number of noindexed pages that need to be crawled in order to reach a small number of indexable pages.

John gave the example of if a website that has millions of pages with 90% of them noindexed, as Google needs to crawl a page first in order to see the noindex, Google could get bogged down with crawling millions of pages just to find those 100 indexable ones. If you have a normal ratio of indexable / no-indexable URLs and the indexable ones can be discovered quickly, he doesn’t see that as an issue to crawlability. This is not due to quality reasons, but more of a technical issue due to the high number of URLs that will need to be crawled to see what is there.

Rendered Page Resources Are Included in Google’s Crawl Rate

March 20, 2020 Source

The resources that Google fetches when they render a page are included in Google’s crawling budget and reported in the Crawl Stats data in Search Console.

Redirects Can Impact Crawl Budget Due to Added Time for URLs to be Fetched

August 9, 2019 Source

If there are a lot of redirects on a site, this can impact crawl budget as Google will detect that URLs are taking longer to fetch and will limit the number of simultaneous requests to the website to avoid causing any issues to the server.

Excluded Pages in GSC Are Included in Overall Crawl Budget

August 9, 2019 Source

The pages that have been excluded in the GSC Index Coverage report count towards overall crawl budget. However, your important pages that are valid for indexing will be prioritised if your site has crawl budget limitations.

Crawl Budget Not Affected by Response Time of Third Party Tags

May 10, 2019 Source

For Google, crawl budget is determined by how many pages and resources they fetch from a website per day. If a page has a large response time they may crawl the site less to avoid overloading the server, but this will not be affected by any third party tags on the page.

Putting Resources on a Separate Subdomain May Not Optimize Crawl Budget

May 3, 2019 Source

Google can still recognise if subdomains are part of the same server and will therefore distribute crawl budget for the server as a whole as it is still having to process all of the requests. However, putting static resources on a CDN will balance crawling across the two sources independently.

Check Server Logs If More Pages Crawled Than Expected

May 1, 2019 Source

If Googlebot is crawling many more pages than it actually needs to be crawled on the site, John recommends checking the server logs to determine exactly which pages Googlebot is crawling. For example, it could be that JavaScript files with a session ID attached are being crawled and bloating the total no. crawled pages.

Crawl Budget Limitations May Delay JavaScript Rendering

December 21, 2018 Source

Sometimes the delay in Google’s JavaScript rendering is caused by crawl budget limitations. Google is actively working on reducing the gap between crawling pages, and rendering them with JavaScript, but it will take some time, so they recommend dynamic, hybrid or server-side rendering content for sites with a lot of content.

Related Topics

Crawling Indexing Crawl Errors Crawl Rate Disallow Directives in Robots.txt Sitemaps Last Modified Nofollow Noindex RSS Canonicalization Fetch and Render