Very nice. Wondering how a search engine will process your robots.txt file? Google now provides a way to check on that through the Google Sitemaps program. More stats and analysis of robots.txt files from the official Inside Google Sitemaps blog explains more. Below, I’ll give you a real life example of how nice this is in action, along with a plea that the robots.txt standard needs to become, well, more standard.
About two weeks ago, we wanted to stop Google from doing things on our Search Engine Watch Forums such as trying to reply to every thread over there. That meant blocking any URL that begins like this:
See the bold part? We made that disallowed by our robots.txt file, like this:
However, we weren’t sure if that would stop spidering of variations like this:
One of our technical people felt that the way the robots.txt protocol is written, it should do a prefix match. That means if a URL begins with what you’ve disallowed, it won’t be spidered. So neither of these URLs would get indexed:
because they both begin with newreplay.php, beginning meaning what comes after the domain name of forums.searchenginewatch.com.
I wasn’t so certain. To be safe, I wondered if we should make use of the wildcard option that Google allows, such as:
Looking around, I found one WebmasterWorld discussion where prefix matching did NOT seem to be working according to one person while another said it should be.
I contacted Google to get a definitive answer. They had to do a quick test to be certain. Yes, prefix matching does work. This was all we needed:
Today, the new tool means I don’t have to bug a Google contact for an answer. Even better, anyone can get the answer themselves without needing to know someone at Google.
Plug in a URL from your site that you think your robots.txt file is supposed to be blocking in the the robots.txt checker at Google Sitemaps. If it’s blocked, you’ll be told something like this:
Blocked by line 23: Disallow: /newreply.php
For me, that shows exactly what in my robots.txt file is keeping that content out. It’s also a helpful way to find out if there’s something in your robots.txt file accidentally blocking content that you DO want in Google.
One odd thing. Google reports not understanding the crawl-delay values in our robots.txt file:
Crawl-Delay: 10 Syntax not understood
Google doesn’t support this option, but Ask, MSN & Yahoo do. But since the delay command is specifically called out in the robots.txt file for them (in our case for MSN and Yahoo), rather than for Google, I was surprised it bothered analyzing these sections of the robots.txt file at all. It should have just ignored them, rather than risk confusing people into thinking something was wrong.
Overall, I’m thrilled with the new tool. I’d like to see the other search engines add similar ones. Even better, I’d like to see them all come together on creating an enhanced and more standardized robots.txt standard. Consider:
- Google allows wildcards, but others don’t.
- Ask, MSN & Yahoo allow crawl delays (but don’t define minimum or maximum values). Google does not.
- Ask & Google have ALLOW commands that no others support
Postscript: Matt Cutts from Google has some good comments over here, pointing out Google also has an allow command (I’ve updated my list above) and further in comments to the post, explaining why they don’t support crawl-delay yet because of concerns it might be set too low by mistake by some webmasters.