Did you know that you can control how Google ranks your site with a single line of text? It’s true!
Of course, that’s not the whole story. It would be more accurate to say you can control IF Google ranks your website AT ALL.
That’s right. You have a nuclear option to keep your site from being crawled by the “spidering” program called Googlebot that hops from website to website, following each link and communicating it’s findings back to the Indexation program. You can do this with the Robots.txt file, a simple text document hosted at the root of your site, which can be added by plugins like Yoast SEO to WordPress sites.
Robots Get Lost
That’s right, robots are easily confused by what may seem obvious to humans. Our ability to ignore chaos is legendary, if you need proof just visit any teenagers room or any men’s restroom. Your website may SEEM visually appealing and accessible, but it’s very likely that the way you have configured your site may give Googlebot a migraine, and accidentally trigger 1,000 pages to appear as part of you site when you really just have seven.
Avoiding Robots.txt Mistakes
The ability to direct Googlebot on how to crawl your site is a powerful power and as pretty much every comic book reader can tell you, that comes with it’s own measure of responsibility. Here’s a checklist of problems to avoid when you’re configuring your Robots.txt file. Some of them are simple and several get trickier as the size of your site and URL complexity increase.
Nuking Your Site By Accident
“If the client says it’s not disallowed, it still might be disallowed.” – Doc Sheldon
Did you check to see if the site crawl is being disallowed? This line looks like
Disallow: /
You’d think that people could avoid pressing “The Big Red Button” but it happens more often than it really should.
According to Alan Morte sometimes this happens because “the robots.txt file never gets changed from a devopement site (Why it’s not locked down, let alone on the web, I don’t know) that replaces a currently live site. In short, they disallow every page on their site with ‘*’ and drop goes the search rankings.”
Slowing Down Google’s Crawl Is Dumb
You may see this in older robots.txt files
Delay: 10
This crawl delay directive communicates to spread the crawling of your site over a number of seconds. This is dumb. Partially because it’s only used by crawlers from Ask, Yandex and possibly Bing but not Google. Google Webmaster Tools has a crawl speed tuning function, but really it has it’s own programming to be the most efficient crawler so you should leave it up to Google anyways.
The “aside from all that” is that your site hosting should not be so overburdened by the piddling amount of requests these crawls represent. That’s an indicator of a crappy hosting situation that should be resolved.
Blocking The Wrong Things
Stop reading! Keep reading! Keep reading! Stop reading!
This is what you’re doing when you include a URL in your site’s XML sitemap but have listed it in your Robots.txt file. Long story short: Don’t cross the streams.
The Dreaded Trailing Slash/
“Instead of blocking ‘/example’ (the intended page), they block the whole directory, /example/. Oh ho… your highest ranking category for your ecommerce site just dropped from search… DOHHH!” Alan Morte of Three Ventures
The Oops Index
“Someone[air quotes] removes the admin / login / core functionality pages from being disallowed for wordpress, drupal, or other CMS, and search engines indexes pages you don’t want indexed.” Alan Morte
What you SHOULD Block
“I block the plugins directory [ like this: Disallow: /wp-content/plugins/ ] because some plugin developers have the annoying habit of adding index.php files to their plugin directories that link back to their websites.” Joost de Valk
Wildcard Parameters
Disallow: *? These wildcard parameters can be powerful, so use them carefully.
Wildcards can be a lifesaver to disallow a few directories deep. Always disallow secure/back end pages just in case. Tanner Petroff of Fit Marketing
Panda 4.0 Dropped Traffic to sites blocking CSS & Template via Robots.txt
http://www.youtube.com/watch?v=KBdEwpRQRD0
If you’re not allowing Google to crawl elements that are part of your site’s design and template then you may be getting a penalty from the Panda algorithm.
“We recommend making sure Googlebot can access any embedded resource that meaningfully contributes to your site’s visible content or its layout. Make sure your css/js resources are crawlable. Use the fetch as Google to make sure they are rendering and remember to prioritize the solid server performance.” – Maile Ohye
Double Bag It: No Index/No Follow
“[The] disallow is…dumb. I prefer to double up with meta no index/no follow on pages/directories I’m serious about.” Tanner Petroff
What Else Can You Do With Your Robots.Txt File?
Hurt People’s Feelings
Protect The Human Race With Robots.txt
http://yelp.com/robots.txt
http://www.last.fm/robots.txt
http://www.google.com/killer-robots.txt
Order A Cup of Coffee with Robots.txt
http://www.starbucks.co.uk/robots.txt
Hire A SEO via Robots.txt
# Hi there, # # If you’re sniffing around this file, and you’re not a robot, we’re looking to meet curious folks such as yourself. # # Think you have what it takes to join the best white-hat SEO growth hackers on the planet? # # Run – don’t crawl – to apply to join TripAdvisor’s elite SEO team # # Email [email protected] # # Or visit https://tripadvisor.taleo.net/careersection/2/jobdetail.ftl?job=41102
Make Art With Your Robots.txt
Well Keyboard cat at http://sharkseo.com/robots.txt – play us out!
User-Agent: Contributors Allow /thanks https://twitter.com/NChimonas Nicolas Chimonas http://www.boom-online.co.uk/ @wayneb77 Wayne Barker http://ghergich.com @SEO A.J Ghergich http://www.tannerpetroff.com/ @TannerPetroff Tanner Petroff