Very nice. Wondering how a search engine will process your robots.txt file? Google now provides a way to check on that through the Google Sitemaps program. More stats and analysis of robots.txt files from the official Inside Google Sitemaps blog explains more. Below, I’ll give you a real life example of how nice this is in action, along with a plea that the robots.txt standard needs to become, well, more standard.
About two weeks ago, we wanted to stop Google from doing things on our Search Engine Watch Forums such as trying to reply to every thread over there. That meant blocking any URL that begins like this:
http://forums.searchenginewatch.com/newreply.php
See the bold part? We made that disallowed by our robots.txt file, like this:
newreply.php
However, we weren’t sure if that would stop spidering of variations like this:
http://forums.searchenginewatch.com/newreply.php?do=newreply&noquote=1&p=73140
One of our technical people felt that the way the robots.txt protocol is written, it should do a prefix match. That means if a URL begins with what you’ve disallowed, it won’t be spidered. So neither of these URLs would get indexed:
http://forums.searchenginewatch.com/newreply.php
http://forums.searchenginewatch.com/newreply.php?do=newreply&noquote=1&p=73140
because they both begin with newreplay.php, beginning meaning what comes after the domain name of forums.searchenginewatch.com.
I wasn’t so certain. To be safe, I wondered if we should make use of the wildcard option that Google allows, such as:
newreply.php*
Looking around, I found one WebmasterWorld discussion where prefix matching did NOT seem to be working according to one person while another said it should be.
I contacted Google to get a definitive answer. They had to do a quick test to be certain. Yes, prefix matching does work. This was all we needed:
newreply.php
Today, the new tool means I don’t have to bug a Google contact for an answer. Even better, anyone can get the answer themselves without needing to know someone at Google.
Plug in a URL from your site that you think your robots.txt file is supposed to be blocking in the the robots.txt checker at Google Sitemaps. If it’s blocked, you’ll be told something like this:
Blocked by line 23: Disallow: /newreply.php
For me, that shows exactly what in my robots.txt file is keeping that content out. It’s also a helpful way to find out if there’s something in your robots.txt file accidentally blocking content that you DO want in Google.
One odd thing. Google reports not understanding the crawl-delay values in our robots.txt file:
Crawl-Delay: 10 Syntax not understood
Google doesn’t support this option, but Ask, MSN & Yahoo do. But since the delay command is specifically called out in the robots.txt file for them (in our case for MSN and Yahoo), rather than for Google, I was surprised it bothered analyzing these sections of the robots.txt file at all. It should have just ignored them, rather than risk confusing people into thinking something was wrong.
Overall, I’m thrilled with the new tool. I’d like to see the other search engines add similar ones. Even better, I’d like to see them all come together on creating an enhanced and more standardized robots.txt standard. Consider:
- Google allows wildcards, but others don’t.
- Ask, MSN & Yahoo allow crawl delays (but don’t define minimum or maximum values). Google does not.
- Ask & Google have ALLOW commands that no others support
Postscript: Matt Cutts from Google has some good comments over here, pointing out Google also has an allow command (I’ve updated my list above) and further in comments to the post, explaining why they don’t support crawl-delay yet because of concerns it might be set too low by mistake by some webmasters.
Related reading
17 top plugins and extensions for SEO
(According to experienced SEOs.) We asked SEO experts which plugins and extensions they use for easier, more productive work. Here are their top 17 answers.
How did the Gillette video impact search traffic?
Gillette's “The Best a Man Can Be” ad sparked lots of coverage. The real question is, "What impact did that video have on search traffic?" The trends since.
What will the SERP of tomorrow look like? Four changes to prepare for today
Four predictions on the SERP of tomorrow from the VP of Industry Insights at Yext. What we can expect, and what to do today to prepare.
3 lead generation tips for ecommerce businesses
For 3 out of 5 marketers, generating traffic and leads is the toughest challenge. Here are our top three tips for lead generation for ecommerce businesses.