Demystifying the Crawl Budget

Cole Dunlop
February 21, 2018

Crawl budget is a bit like New Yorker cartoons or modern art. We’ve all seen it and like to pretend we understand what it is, but the truth is many SEO experts don’t have the deep understanding on crawl budget optimization nor really understand how it can be improved.

Most importantly, it’s important to know what impact an increased crawl budget has on SEO and page ranking. To help demystify crawl budget, I’ve put together this guide that will answer these key questions and more.

At the end of this post, you’ll have the insight into crawl budget optimization to not only impress your SEO friends around the watercooler, but also improve your SEO strategies and deliver even better results to clients.

You’re on your own with interpreting New Yorker cartoons and modern art, though. #sorrynotsorry

What Is Crawl Budget?

Googlebot and other web spiders are mysterious, busy creatures. They spend all of their time visiting, or crawling, web pages to collect information. In terms of SEO, you really just want to focus on search engine bots, as these are the ones that index pages and help determine results pages. To a much lesser extent, we’re also concerned about certain web service crawlers that help empower the SEO tools we use.

Your website’s crawl budget is how many times these crawling bots hit your website in a given period of time. As busy as these spiders are, they can’t allocate the same amount of time and resources to every website each day, so they divide their attention across all websites and prioritize those that need the most attention (the highest crawl budget).

Why Should I Worry About Crawl Budget?

The truth is that the majority of websites don’t need to worry about their crawl budget all that much, which is arguably why the topic doesn’t get enough attention.

Crawl budget plays a much bigger role in larger websites or ones that automatically create pages based on URL parameters. Unfortunately, a lot of SEO experts hear this and decided that crawl budget isn’t something they need to pay any attention to.

While your crawl budget stats are not at the same level as keyword rankings, organic traffic, page load speeds and other more vital SEO metrics, it isn’t something that should be ignored just because your website doesn’t have a lot of pages.

After all, you want Googlebot and other crawlers to find all of the most important pages on your websites. And, you want them to find new content quickly. With proper crawl budget optimization, you can achieve both of these goals.

How Do I Know What My Crawl Budget Is?

Determining your crawl budget is rather simple. Google Search Console, Bing Webmaster Tools and other SEO tools track crawl budget and have these stats available to users. In Google Search Console, you can find your crawl stats under the Crawl drop down menu. This will show you your average daily crawl stats based on the last day.

From there, calculating your monthly crawl budget is simply a matter of taking the average number of crawls your pages receive each day and multiplying it by 30 days. This will give you a nice baseline for your crawl budget to determine how many pages from your site you can expect to be crawled over the month.

You can also look at crawl stats of individual pages, but it is a little trickier because there’s no immediate way to do it using only Google Search Console. Instead, you have to look at server logs to track how the bots move across your pages.

You may have to get in contact with your system administrator or website hosting provider to get access to these logs. For servers configured with Apache, you can typically find these log files in either of these files:

/var/log/httpd/access_log
/var/log/apache2/access.log
/var/log/httpd-access.log

Once you’ve found the raw log file, you need a separate tool to effectively analyze and “read” the information. Weblog Expert and AWStats are two great options.

How Do I Get The Most Out Of My Crawl Budget?

There’s a number of steps you can take to improve your crawl budget and the efficiency at which bots crawl your pages. Ultimately, the goal here isn’t to increase crawl budget (that comes later), but to make sure that your current crawl budget is being used appropriately and you aren’t accidentally wasting any of it.

1. Determine Which Pages Need To Be Crawled The Most

The very first step to optimizing your crawl budget is to ensure that the pages that are most valuable and critical to your website are crawled first. While you can’t directly control how your site is crawled, you can help guide the bots a little bit by steering them away from pages that are less valuable and don’t need to be crawled.

For example, you may have a lot of non-indexable pages that are getting in the way of the pages that actually matter.

You can check this by looking at the total number of pages that the crawlers found (available under your crawl stats) and then query Google or Bing to see how many pages your website has indexed.

You can do this by simply searching site:yourwebsite.com. At the top of the search page will give you the total number of results, or all of the pages that Google has indexed.

If that number is far less than the number of pages crawled, then you may have a lot of non-indexable pages being crawled, which are wasting your budget. To ensure that these pages are not crawled, your first step is to set your robots.txt to “Disallow.” This does not guarantee that the page won’t be indexed.

So, you’ll want to take your efforts a step further by using the noindex meta tag (noindex” />) in the <head> section of the pages you don’t want crawled. Or, you can put the X-Robots-Tag:noindex tag in the HTTP header response.

Both of these send a signal to future crawling bots not to index a page, so they’ll move right along to other, more valuable pages.

2. Be Wary Of Rich Media And Site Features

Websites have really evolved over the years and, for the most part, so too have crawling bots. They used to not be able to crawl Flash, JavaScript and other types of rich media content.

That’s changed for the most part, but they could still slow down your site’s ability to be crawled. As an alternative to indexing the rich media itself, you could create text versions of pages that use Flash, Silverlight or HTML files a lot.

Other features, like product filters, blog tags and search bars, can also really hurt your crawl budget. Product filters, for example, use a number of different criterias and values within each filter option.

This is helpful for the user because it allows them to quickly sift through products based on their desired criteria. But, it is a nightmare for crawl bots because each unique combination of filter options can create its own URL. With just a few filter criteria options you can quickly have an endless stream of pages for crawl bots to deal with. The same goes for an internal search function and its subsequent results pages.

Your best solution is to take the necessary steps to configure these pages to not be crawled or indexed.

3. Fix Any And All Redirect Chains Or Broken Links

When links become broken or there is a long chain of redirects, it’s essentially a dead end for crawler bots. Not only will the destination page not be indexed if there is a series of 301 or 302 redirects, but each redirect wastes valuable crawl budget.

Thus, you should really use redirects as sparingly as possible and only include a maximum of two in a series. There’s a number of SEO tools that will allow you to see a complete listing of your redirects.

Fixing or removing long redirect chains and broken links is really a habit that any SEO professional should get in the habit of. These dead links and redirect loops hurt the user experience of your websites as much as they hurt your SEO efforts.

4. Clean Up Your XML Sitemaps

Search engines use XML sitemaps a lot when crawling and indexing pages because they are a convenient way for bots to find all of the pages on your site right away. To prepare your XML sitemaps for crawling, you want to make sure that none of those non-indexable URLs or broken links end up in your XML sitemap.

You should do this regularly by accessing the Sitemaps menu in your Google Search Console or Bing Webmaster Tools.

To really optimize your XML sitemaps and get the most out of your crawl budget, you can split your XML sitemaps into even smaller sitemaps.

Not only will this help you stay organized and bolster the strength of your internal links, especially if you design a sitemap for each section of your website, but it will also make your life a lot easier when you run into an issue with your pages being indexed.

For example, if you have one sitemap with 2,000 links, but only 1,500 are being indexed, there’s probably an issue somewhere. Figuring out where the issue is can be a challenge because the sitemap includes so many links.

Conversely, if you divided that sitemap into four smaller chunks that had roughly 500 links each, it will be easier to see where the problems are occuring. There will be two to three sitemaps where almost all of the links are indexed, while another one may only have half or less of its links indexed. Well, there’s your problem!

5. Set URL Parameter When Necessary

URL Parameters help notify Googlebot and other crawlers that your website may have pages where lots of different URLs point to the same page, such as the case with pages that use lots of dynamic URLs. Without URL parameters, bots will treat these as separate pages, thereby harming your crawl budget.

To help Googlebot correctly crawl these dynamic URLs, you’ll want to access your Google Search Console. Under the Crawl menu there is a Search Parameters option. This menu will allow you to make the necessary changes so that these dynamic URLs are treated as a single link and not copies of the same content.

6. Build External Links And Gain Authority

Like many of these steps so far, building external links is something that you’re (hopefully) already doing in your SEO strategies. It may also be the secret code to increasing your crawl budget. Matt Cutts, a former Google webspam team lead, made mention of this in an interview:

“The number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well.”

We may not be able to see PageRank values anymore, but they essentially measure a page’s authority. Thus, we can surmise that there’s a strong connection between page authority and crawl budget. Having lots of external sites with links pointing to your page helps build that authority.

MarketingProfs’ Aleh Barysevich tested this idea that growing an external link profile will enhance crawl budget. They found a strong correlation between the number of external links they built and the number of crawl bot hits their site received.

Conclusion

The beauty behind improving your crawl budget is that most of these tactics improve your overall SEO too. You’re probably already diligently doing many of the tasks on the list, so increasing crawl budget doesn’t take a lot of extraneous effort.

By routinely checking in on your crawl budget, you can help ensure that your pages are not just optimized for search engines, but also for the crawling bots that index and help determine those search engines’ results pages.