Disallow Directives in Robots.txt

The disallow directive (added within a website’s robots.txt file) is used to instruct search engines not to crawl a page on a site. This will normally also prevent a page from appearing within search results.

Within the SEO Office Hours recaps below, we share insights from Google Search Central on how they handle disallow directives, along with SEO best practice advice and examples.

Use rel=”canonical” or robots.txt instead of nofollow tags for internal linking

June 22, 2022 Source

A question was asked about whether it was appropriate to use the nofollow attribute on internal links to avoid unnecessary crawl requests for URLs that you don’t wish to be crawled or indexed.

John replied that it’s an option, but it doesn’t make much sense to do this for internal links. In most cases, it’s recommended to use the rel=canonical tag to point at the URLs you want to be indexed instead, or use the disallow directive in robots.txt for URLs you really don’t want to be crawled.

He suggested figuring out if there is a page you would prefer to have indexed and, in that case, use the canonical — or if it’s causing crawling problems, you could consider the robots.txt. He clarified that with the canonical, Google would first need to crawl the page, but over time would focus on the canonical URL instead and begin to use that primarily for crawling and indexing.

APIs & Crawl Budget: Don’t block API requests if they load important content

June 22, 2022 Source

An attendee asked whether a website should disallow subdomains that are sending API requests, as they seemed to be taking up a lot of crawl budget. They also asked how API endpoints are discovered or used by Google.

John first clarified that API endpoints are normally used by JavaScript on a website. When Google renders the page, it will try to load the content served by the API and use it for rendering the page. It might be hard for Google to cache the API results, depending on your API and JavaScript set-up — which means Google may crawl a lot of the API requests to get a rendered version of your page for indexing. 

You could help avoid crawl budget issues here by making sure the API results are cached well and don’t contain timestamps in the URL. If you don’t care about the content being returned to Google, you could block the API subdomains from being crawled, but you should test this out first to make sure it doesn’t stop critical content from being rendered. 

John suggested making a test page that doesn’t crawl the API, or uses a broken URL for it,  and see how the page renders in the browser (and for Google).

Either Disallow Pages in Robots.txt or Noindex Not Both

August 23, 2019 Source

Noindexing a page and blocking it in robots.txt will mean the noindex will not be seen, as Googlebot won’t be able to crawl it. Instead, John recommends using one or the other.

Disallowed Pages With Backlinks Can be Indexed by Google

July 9, 2019 Source

Pages blocked by robots.txt cannot be crawled by Googlebot. However, if they a disallowed page has links pointing to it Google can determine it is worth being indexed despite not being able to crawl the page.

Google Supports X-Robots Noindex to Block Images for Googlebot

December 21, 2018 Source

Google respects x-robots noindex in image response headers.

Focus on Search Console Data When Reviewing Links to Disavow

August 21, 2018 Source

If you choose to disavow links, use the data in Google Search Console as this will give you an accurate picture of what you need to focus on.

Block Videos From Search By Adding Video URL & Thumbnail to Robots.txt or Setting Expiration Date in Sitemap

July 13, 2018 Source

You can signal to Google for a video not to be included in search by blocking the video file and thumbnail image in robots.txt or by specifying an expiration date using a video sitemap file.

Don’t Rely on Unsupported Robots Directives in Robots.txt Being Respected By Google

July 13, 2018 Source

Don’t rely on noindex directives in robots.txt as they are aren’t officially supported by Google. John says it’s fine to use robots directives in robots.txt, but make sure you have a backup in case they don’t work.

Google Uses the Most Specific Matching Rule in Robots.txt

January 12, 2018 Source

When different levels of detail exist in robots.txt Google will follow the most specific matching rule.

Related Topics

Crawling Indexing Crawl Budget Crawl Errors Crawl Rate Sitemaps Last Modified Nofollow Noindex RSS Canonicalization Fetch and Render