IndustryBlocking Crawlers With Robots.txt

Blocking Crawlers With Robots.txt

Explains how the robots.txt standard lets you tell search engines which web pages NOT to index via a text file placed in your server's root HTML directory.

The robots.txt standard is a text file placed in the root server’s HTML directory. For example, if I did not want the entire calafia.com site to be indexed, I would make a file that would be found under the following URL:

http://calafia.com/robots.txt

An engine respecting the standard would ask for the file before trying to index any page within the site. To exclude the entire site, the file would say:

User-agent: *
Disallow: /

The user-agent portion lets you specify engines or browsers that should obey the next line. Chances are, you want them all to do so. The * is a way to specify everything.

The disallow portion is where you specify directories or file names. In the example above, the * is used to protect everything within the site. You can also be more specific and block particular directories or pages:

User-agent: *
Disallow: /webmasters/
Disallow: /access/
Disallow: /classroom/stats.htm

Now the engines respecting the standard will not index anything in the site with the addresses:

http://calafia.com/webmasters/
http://calafia.com/access/

And this page is also blocked:

http://calafia.com/classroom/stats.htm

Because the robots.txt file must go in the server’s root directory, many of those using free web space will not be able to use it. You cannot simply put it within your space. For example, here’s a scenario with AOL:

OK
http://members.aol.com/robots.txt

Not OK
http://members.aol.com/mysite/robots.txt

The first works because the file is in the server’s root directory. The second doesn’t work because it is located in a sub-directory.

Because of this problem, the meta robots tag was created to help those without access to the robots.txt file. It is described on the Blocking Search Engines With The Meta Robots Tag page.

Security Issues

If you don’t want something to be accessed, don’t put it on the web. Period. Certainly don’t expect the robots.txt file to protect it. Not every search engine respects the convention, though all the majors do. More importantly, humans may take advantage of the file. All anyone has to do is enter the address to your robots.txt file, and they can read the contents in their web browser. They can see exactly what you consider off-limits for spiders, which sometimes also means off-limits for humans.

Consider this as you create your robots.txt file. You don’t want it to be a roadmap to sensitive areas on your server. If you do list them, password protect the areas. Keeping them off the web, of course, is the safest route of all.

Other Notes

Occasionally, reports come about problems with having either a blank robots.txt file or no robots.txt file. In either case, the issue seemed to be that because there was no valid robots.txt file explicitly allowing indexing of some or all pages within the site, no pages were indexed at all. This really shouldn’t happen, but if you are encountering problems with getting indexed, try installing a robots.txt file that allows some or all of your pages to be indexed.

More Resources

The Web Robots Pages: The Robots Exclusion Protocol
http://www.robotstxt.org/wc/exclusion.html#robotstxt

The official word on using a robots.txt file.

Robotcop
http://www.robotcop.org/

The motto for the Robotcop project is “robots.txt: it’s the Law.” The robots.txt file is the mechanism that web site owners can use to block spiders from crawling all or portions of their web sites. It’s widely recognized and honored by the major crawlers, but it remains an unofficial law. Even worse, it’s a law with no law enforcement agency. Enter Robotcop. This is an open source project designed to produce plug-in modules for popular web servers. Did a crawler just fly past your robots.txt file? Robotcop can spot this and give you a variety of options with real teeth to them, such as blocking or trapping the spiders. Currently available for Apache 1.3 as a beta, there are plans to support Apache 2.0 and ISAPI webservers such as Zeus and IIS, in the future.

Resources

The 2023 B2B Superpowers Index
whitepaper | Analytics

The 2023 B2B Superpowers Index

9m
Data Analytics in Marketing
whitepaper | Analytics

Data Analytics in Marketing

11m
The Third-Party Data Deprecation Playbook
whitepaper | Digital Marketing

The Third-Party Data Deprecation Playbook

1y
Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study
whitepaper | Digital Marketing

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

2y