Robots.txt Guide For Humans

Cole Dunlop
January 26, 2015

Did you know that you can control how Google ranks your site with a single line of text? It’s true!

Of course, that’s not the whole story. It would be more accurate to say you can control IF Google ranks your website AT ALL.

That’s right. You have a nuclear option to keep your site from being crawled by the “spidering” program called Googlebot that hops from website to website, following each link and communicating it’s findings back to the Indexation program. You can do this with the Robots.txt file, a simple text document hosted at the root of your site, which can be added by plugins like Yoast SEO to WordPress sites.

Robots Get Lost

That’s right, robots are easily confused by what may seem obvious to humans. Our ability to ignore chaos is legendary, if you need proof just visit any teenagers room or any men’s restroom. Your website may SEEM visually appealing and accessible, but it’s very likely that the way you have configured your site may give Googlebot a migraine, and accidentally trigger 1,000 pages to appear as part of you site when you really just have seven.

Avoiding Robots.txt Mistakes

The ability to direct Googlebot on how to crawl your site is a powerful power and as pretty much every comic book reader can tell you, that comes with it’s own measure of responsibility. Here’s a checklist of problems to avoid when you’re configuring your Robots.txt file. Some of them are simple and several get trickier as the size of your site and URL complexity increase.

Nuking Your Site By Accident

“If the client says it’s not disallowed, it still might be disallowed.” – Doc Sheldon

Did you check to see if the site crawl is being disallowed? This line looks like

Disallow: /

You’d think that people could avoid pressing “The Big Red Button” but it happens more often than it really should.

According to Alan Morte sometimes this happens because “the robots.txt file never gets changed from a devopement site (Why it’s not locked down, let alone on the web, I don’t know) that replaces a currently live site. In short, they disallow every page on their site with ‘*’ and drop goes the search rankings.”

Slowing Down Google’s Crawl Is Dumb

You may see this in older robots.txt files

Delay: 10

This crawl delay directive communicates to spread the crawling of your site over a number of seconds. This is dumb. Partially because it’s only used by crawlers from Ask, Yandex and possibly Bing but not Google. Google Webmaster Tools has a crawl speed tuning function, but really it has it’s own programming to be the most efficient crawler so you should leave it up to Google anyways.

The “aside from all that” is that your site hosting should not be so overburdened by the piddling amount of requests these crawls represent. That’s an indicator of a crappy hosting situation that should be resolved.

Blocking The Wrong Things

Stop reading! Keep reading! Keep reading! Stop reading!

This is what you’re doing when you include a URL in your site’s XML sitemap but have listed it in your Robots.txt file. Long story short: Don’t cross the streams.

The Dreaded Trailing Slash/

“Instead of blocking ‘/example’ (the intended page), they block the whole directory, /example/. Oh ho… your highest ranking category for your ecommerce site just dropped from search… DOHHH!” Alan Morte of Three Ventures

The Oops Index

“Someone[air quotes] removes the admin / login / core functionality pages from being disallowed for wordpress, drupal, or other CMS, and search engines indexes pages you don’t want indexed.” Alan Morte

What you SHOULD Block

“I block the plugins directory [ like this: Disallow: /wp-content/plugins/ ] because some plugin developers have the annoying habit of adding index.php files to their plugin directories that link back to their websites.” Joost de Valk

Wildcard Parameters

Disallow: *? These wildcard parameters can be powerful, so use them carefully.

Wildcards can be a lifesaver to disallow a few directories deep. Always disallow secure/back end pages just in case. Tanner Petroff of Fit Marketing

Panda 4.0 Dropped Traffic to sites blocking CSS & Template via Robots.txt

http://www.youtube.com/watch?v=KBdEwpRQRD0
If you’re not allowing Google to crawl elements that are part of your site’s design and template then you may be getting a penalty from the Panda algorithm.

“We recommend making sure Googlebot can access any embedded resource that meaningfully contributes to your site’s visible content or its layout. Make sure your css/js resources are crawlable. Use the fetch as Google to make sure they are rendering and remember to prioritize the solid server performance.” – Maile Ohye

Double Bag It: No Index/No Follow

“[The] disallow is…dumb. I prefer to double up with meta no index/no follow on pages/directories I’m serious about.” Tanner Petroff

What Else Can You Do With Your Robots.Txt File?

Hurt People’s Feelings

# IF YOU READ THIS THEN YOU ARE CLEARLY BORED # AND A BIG FAT NERD User-Agent: * Disallow: Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-content/themes/ Sitemap: http://www.techchuff.com/sitemap.xml

http://techchuff.com/robots.txt Ouch man…Ouch.

Protect The Human Race With Robots.txt

http://yelp.com/robots.txt
# As always, Asimov’s Three Laws are in effect: 
# 1. A robot may not injure a human being or, through inaction, allow a human 
# being to come to harm. 
# 2. A robot must obey orders given it by human beings except where such 
# orders would conflict with the First Law. 
# 3. A robot must protect its own existence as long as such protection does 
# not conflict with the First or Second Law. User-Agent: Googlebot

http://www.last.fm/robots.txt
Disallow: /harming/humans Disallow: /ignoring/human/orders Disallow: /harm/to/self Allow: /
http://www.google.com/killer-robots.txt
User-Agent: T-1000 User-Agent: T-800 Disallow: /+LarryPage Disallow: /+SergeyBrin

Order A Cup of Coffee with Robots.txt

http://www.starbucks.co.uk/robots.txt
#A guy walks into Starbucks and orders a Double Ristretto Venti Half-Soy Nonfat Decaf Organic Chocolate Brownie Iced Vanilla Double-Shot Gingerbread Frappuccino Extra Hot With Foam Whipped Cream Upside Down Double Blended, One Sweet’N Low and One Nutrasweet, and Ice.
User-agent: *

Sitemap: http://www.starbucks.co.uk/sitemap/NavigationSitemap.ashx
Sitemap: http://www.starbucks.co.uk/sitemap/VideoSitemap.ashx

Hire A SEO via Robots.txt

Seems like it’s a good place to recruit some curious SEO talent.

http://www.tripadvisor.com/robots.txt

# Hi there, # # If you’re sniffing around this file, and you’re not a robot, we’re looking to meet curious folks such as yourself. # # Think you have what it takes to join the best white-hat SEO growth hackers on the planet? # # Run – don’t crawl – to apply to join TripAdvisor’s elite SEO team # # Email [email protected] # # Or visit https://tripadvisor.taleo.net/careersection/2/jobdetail.ftl?job=41102

Make Art With Your Robots.txt

Robots.txt A Self Portrait by Malcolm Coles – http://www.malcolmcoles.co.uk/robots.txt

Well Keyboard cat at http://sharkseo.com/robots.txt – play us out!

User-Agent: Contributors Allow /thanks https://twitter.com/NChimonas Nicolas Chimonas http://www.boom-online.co.uk/ @wayneb77 Wayne Barker http://ghergich.com @SEO A.J Ghergich http://www.tannerpetroff.com/ @TannerPetroff Tanner Petroff