If URLs that are blocked by robots.txt are getting indexed by Google, it may point to insufficient content on the site’s accessible pages
Why might an eCommerce site’s faceted or filtered URLs that are blocked by robots.txt (and have a canonical in place) still get indexed by Google? Would adding a noindex tag help? John replied that the noindex tag would not help in this situation, as the robots.txt block means it would not be seen by Google.
He pointed out that URLs might get indexed without content in this situation (as Google cannot crawl them with the block in robots.txt), but they would be unlikely to show up for users in the SERPs, so should not cause issues. He went on to mention that, if you do see these blocked URLs being returned for practical queries, then it can be a sign that the rest of your website is hard for Google to understand. It could mean that the visible content on your website is not sufficient for Google to understand that the normal (and accessible) pages are relevant for those queries. So he would first recommend looking into whether or not searchers are actually finding those URLs that are blocked by robots.txt. If not, then it should be fine. Otherwise, you may need to look at other parts of the website to understand why Google might be struggling to understand it.
Empty or Thin Pages Can be Served if Different Content is Shown Depending on Location
Empty or thin pages may be displayed in Google’s index if different content is served based on the visitor’s location. E.g. if a full content page is served to US visitors but not to non-US visitors, the page might still be indexed as Googlebot crawls from the US but non-US visitors wouldn’t see the content.
Noindex Thin Pages That Provide Value to Users on Site But Not in Search
Some pages on your site may have thin content so it won’t be as valuable to have them indexed and shown in search, but if they are useful to users navigating your website then you can noindex them rather than removing them.
Focus on Creating Fewer Stronger Pages Rather Than Splitting Them Up
John recommends focusing on having fewer, stronger pages rather than splitting up longer pieces of content into separate pages to target different queries.
A Small Proportion of Thin Pages Is Not an Issue
Thin content is a normal occurrence on websites and shouldn’t be considered a critical issue if it only impacts a small proportion of pages e.g. large news publishers may have some shorter articles which still provide unique content.
A Small Proportion of Thin Content Pages is Fine
Thin content pages can be a natural part of a site, like on category pages, and isn’t an issue with Google providing it is a small proportion of a site’s pages.
Microsites Can Be Seen as Doorway Pages
Microsites often look like a collection of doorway pages. If you are looking to build these microsites up in the long run then this might be an option, but if they don’t have value beyond driving traffic to another site, then microsites aren’t recommended for search and should be noindexed.
Google Tries to Figure Out Full Content When Encounters 206 Response Code
For pages returning the 206 response code (don’t have full content), Google follows that response code and tries to figure out the full content of the page so they can index it. Google doesn’t do anything special for this 206 response code, they try to follow the HTTP standards.
Doorway’ Pages May Result in a Manual Penalty
A large number of thin pages, with boilerplate content and nothing unique except for a few changed keywords, may be considered doorway pages which could result in a manual penalty from the spam team.