How to Crawl a Website with Deepcrawl

Sam Marsden
Sam Marsden

On 18th October 2021 • 10 min read

Running frequent and targeted crawls of your website is a key part of improving its technical health and improving rankings in organic search.

In this guide, you’ll learn how to crawl a website efficiently and effectively with Deepcrawl.

The six steps to crawling a website include:

1. Understanding the domain structure
2. Configuring the URL sources
3. Running a test crawl
4. Adding crawl restrictions
5. Testing your changes
6. Running your crawl
 

Step 1: Understanding the Domain Structure

Before starting a crawl, it’s a good idea to get a better understanding of your site’s domain structure:

Check the www/non-www and http/https configuration of the domain when you add the domain.

You can also enable JavaScript rendering to analyze the rendered version of the pages on your site.

Understanding the Domain Structure with Deepcrawl

 
 

Step 2: Configuring the URL sources

There are seven types of URL sources you can include in your Deepcrawl projects. Consider running a crawl with as many URL sources as possible, to supplement your linked URLs with XML Sitemap and Google Analytics, and other data.

  1. Web crawl: Crawl only the site by following its links to deeper levels. 

If you are not sure about sub-domains, check the ‘Crawl all sub-domains’ option and they will automatically be discovered if they are linked.

You can also choose to crawl both HTTP and HTTPS pages, to get visibility over pages served over a non-secure protocol.

  1. Sitemaps: Crawl a set of sitemaps, and the URLs in those sitemaps. Links on these pages will not be followed or crawled.
  2. Backlinks: Upload backlink source data, and crawl the URLs, to discover additional URLs with backlinks on your site. The backlink data will be available in various reports.

If our Majestic Integration is part of your package, use it to automatically bring in data such as backlink count and backlink domain count.

  1. Google Search Console: Use our Google Search Console integration to enrich your reports with data such as search impressions, positions on a page, devices used etc. and to discover additional pages on your site which may not be linked. 
  2. Analytics: Similarly, you can use our Google Analytics or Adobe Analytics integration, or upload analytics source data to discover additional landing pages on your site which may not be linked. The analytics data will be available in various reports.
  3. Log Files: Upload log file summary data from log file analyzer tools, such as Splunk and Logz.io to get a view of how bots interact with your site. You also have the option to upload log file data manually.
  4. URL Lists: Crawl a fixed list of URLs. Links on these pages will not be followed or crawled.

Configuring URL sources with Deepcrawl
 

Step 3: Running a Test Crawl

Start with a small ‘Web Crawl’ to look for signs that the site is uncrawlable.

Before starting the crawl, ensure that you have set the ‘Crawl Limit’ to a low quantity. This will make your first checks more efficient, as you won’t have to wait very long to see the results.

Running a test crawl with Deepcrawl

Problems to watch for include:

When you run the test crawl, also watch for performance issues caused by the crawler.

If you see connection errors, or multiple 502/503 type errors, you may need to reduce the crawl rate.

Setting the crawl speed in Deepcrawl
If you have a robust hosting solution, you may be able to crawl the site at a faster rate.

The crawl rate can be increased at times when the site load is reduced – 4 a.m. for example.

For more information about crawl troubleshooting, you can also refer to the article, How to Fix Your Failed Website Crawls.

 
 

Step 4: Adding Crawl Restrictions

One you’ve confirmed that your site is crawlable, you can launch your crawl. 

Ideally, a website should be crawled in full (including every linked URL on the site). However, very large websites, or sites with many architectural problems, may not be able to be fully crawled immediately. It may be necessary to restrict the crawl to certain sections of the site, or limit specific URL patterns. Such restrictions can be added within the ‘Advanced Settings’ tab.

Filter URLs and URL Paths

Use the ‘Include/Exclude’ URL fields to limit the crawl to specific areas of interest. 

Adding restrictions ensures you are not wasting time (or credits) crawling URLs that are not important to you.

Adding crawl restrictions in Deepcrawl - Filter URLs and URL paths

Add Crawl Limits for Groups of Pages

Use the ‘Page Grouping’ feature, under ‘Advanced Settings’ to restrict the number of URLs crawled for groups of pages based on their URL patterns.

Add crawl limits for groups of pages in Deepcrawl

Here, you can create a new group and add a name.

In the ‘Page URL Match’ column you can add a regular expression.

Set the crawl limit percentage. For example, if you were crawling your product pages (e.g. /products) and set this to 10%, 90% of your product pages would be excluded from the crawl.

URLs matching the designated path are counted. When the limits have been reached, all further matching URLs go into the ‘Page Group Restrictions’ report and are not crawled.

Use Robots Overwrite Settings

Deepcrawl’s ‘Robots Overwrite’ feature allows you to identify additional URLs that can be excluded using a custom robots.txt file – allowing you to test the impact of pushing a new file to a live environment.

Upload the alternative version of your robots file under ‘Advanced Settings’ and select ‘Use Robots Overwrite’ when starting the crawl: 

Use robot overwrite settings in Deepcrawl

Remove Parameters

If you have excluded any parameters from search engine crawls with URL parameter tools like Google Search Console, enter these in ‘Remove URL Parameters’ under ‘Advanced Settings’.

Remove URL parameters in Deepcrawl
 

Step 5: Testing Your Changes

Run one more test crawl to ensure your configuration is correct and you’re ready to run a full crawl.

 
 

Step 6: Running your Crawl

Ensure you’ve increased the ‘Crawl Limit’ before running a more in-depth crawl.

Set ‘Scheduling’ for your crawls to track your progress. Click in the ‘Schedule crawl (optional)’ field and choose the crawl frequency. You’ll then be able to choose the date and time of your first crawl. 

Running your crawl in Deepcrawl
 

Handy Tips

Settings for Specific Requirements

If you have a test/sandbox site you can run a ‘Comparison Crawl’ by adding your test site domain and authentication details in ‘Advanced Settings’ (see the ‘Test Site Domain and Custom DNS’ option).

For more about the Test vs Live feature, check out our guide to Comparing a Test Website to a Live Website.

To crawl an AJAX-style website, with an escaped fragment solution, use the ‘URL Rewriting’ function in ‘Advanced Settings’ to modify all linked URLs to the escaped fragment format.

Analyze Outbound Links

Sites with a large quantity of external links may want to ensure that users are not directed to dead links.

To check this, take a look at ‘Crawl Restrictions and ‘Link Restrictions’ under ‘Advanced Settings’. This will add an HTTP status code next to external links within your report.

Analyze outbound links with Deepcrawl

Change User Agent

See your site through a variety of crawlers’ eyes (Googlebot, Googlebot Smartphone, etc.) by changing the ‘User Agent’ in ‘Advanced Settings’.

Add a custom user agent to determine how your website responds.

Change the user agent in Deepcrawl

Remember, the more you experiment and crawl, the closer you get to becoming an expert crawler.

 

 

Author

Sam Marsden
Sam Marsden

Sam Marsden is Deepcrawl's Former SEO & Content Manager. Sam speaks regularly at marketing conferences, like SMX and BrightonSEO, and is a contributor to industry publications such as Search Engine Journal and State of Digital.

Choose a better way to grow

With tools that will help you realize your website’s true potential, and support to help you get there, growing your enterprise business online has never been so simple.

Book a Demo