Google Launches Robots.txt File Checker; Now We Need Robots.txt Standardization

Date published 7 February 2006 Author

Danny Sullivan

Categories

Industry

Very nice. Wondering how a search engine will process your robots.txt file? Google now provides a way to check on that through the Google Sitemaps program. More stats and analysis of robots.txt files from the official Inside Google Sitemaps blog explains more. Below, I’ll give you a real life example of how nice this is in action, along with a plea that the robots.txt standard needs to become, well, more standard.

About two weeks ago, we wanted to stop Google from doing things on our Search Engine Watch Forums such as trying to reply to every thread over there. That meant blocking any URL that begins like this:

http://forums.sewprod.wpenginepowered.com/newreply.php

See the bold part? We made that disallowed by our robots.txt file, like this:

newreply.php

However, we weren’t sure if that would stop spidering of variations like this:

http://forums.sewprod.wpenginepowered.com/newreply.php?do=newreply&noquote=1&p=73140

One of our technical people felt that the way the robots.txt protocol is written, it should do a prefix match. That means if a URL begins with what you’ve disallowed, it won’t be spidered. So neither of these URLs would get indexed:

http://forums.sewprod.wpenginepowered.com/newreply.php
http://forums.sewprod.wpenginepowered.com/newreply.php?do=newreply&noquote=1&p=73140

because they both begin with newreplay.php, beginning meaning what comes after the domain name of forums.sewprod.wpenginepowered.com.

I wasn’t so certain. To be safe, I wondered if we should make use of the wildcard option that Google allows, such as:

newreply.php*

Looking around, I found one WebmasterWorld discussion where prefix matching did NOT seem to be working according to one person while another said it should be.

I contacted Google to get a definitive answer. They had to do a quick test to be certain. Yes, prefix matching does work. This was all we needed:

newreply.php

Today, the new tool means I don’t have to bug a Google contact for an answer. Even better, anyone can get the answer themselves without needing to know someone at Google.

Plug in a URL from your site that you think your robots.txt file is supposed to be blocking in the the robots.txt checker at Google Sitemaps. If it’s blocked, you’ll be told something like this:

Blocked by line 23: Disallow: /newreply.php

For me, that shows exactly what in my robots.txt file is keeping that content out. It’s also a helpful way to find out if there’s something in your robots.txt file accidentally blocking content that you DO want in Google.

One odd thing. Google reports not understanding the crawl-delay values in our robots.txt file:

Crawl-Delay: 10 Syntax not understood

Google doesn’t support this option, but Ask, MSN & Yahoo do. But since the delay command is specifically called out in the robots.txt file for them (in our case for MSN and Yahoo), rather than for Google, I was surprised it bothered analyzing these sections of the robots.txt file at all. It should have just ignored them, rather than risk confusing people into thinking something was wrong.

Overall, I’m thrilled with the new tool. I’d like to see the other search engines add similar ones. Even better, I’d like to see them all come together on creating an enhanced and more standardized robots.txt standard. Consider:

Google allows wildcards, but others don’t.
Ask, MSN & Yahoo allow crawl delays (but don’t define minimum or maximum values). Google does not.
Ask & Google have ALLOW commands that no others support

Postscript: Matt Cutts from Google has some good comments over here, pointing out Google also has an allow command (I’ve updated my list above) and further in comments to the post, explaining why they don’t support crawl-delay yet because of concerns it might be set too low by mistake by some webmasters.

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Opinion

Information

Follow us

Google Launches Robots.txt File Checker; Now We Need Robots.txt Standardization

Leave a Reply Cancel reply

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

Why we’re hardwired to believe SEO myths (and how to spot them!)

Seven Google alerts SEOs need to stay on top of everything!

The not-so-SEO checklist for 2022

Wrapping up 2021 with our top 10!

Four tips for SEM teams to adjust to a privacy-focused future