The robots.txt standard is a text file placed in the root server’s HTML directory. For example, if I did not want the entire calafia.com site to be indexed, I would make a file that would be found under the following URL:
An engine respecting the standard would ask for the file before trying to index any page within the site. To exclude the entire site, the file would say:
The user-agent portion lets you specify engines or browsers that should obey the next line. Chances are, you want them all to do so. The * is a way to specify everything.
The disallow portion is where you specify directories or file names. In the example above, the * is used to protect everything within the site. You can also be more specific and block particular directories or pages:
Now the engines respecting the standard will not index anything in the site with the addresses:
And this page is also blocked:
Because the robots.txt file must go in the server’s root directory, many of those using free web space will not be able to use it. You cannot simply put it within your space. For example, here’s a scenario with AOL:
The first works because the file is in the server’s root directory. The second doesn’t work because it is located in a sub-directory.
Because of this problem, the meta robots tag was created to help those without access to the robots.txt file. It is described on the Blocking Search Engines With The Meta Robots Tag page.
If you don’t want something to be accessed, don’t put it on the web. Period. Certainly don’t expect the robots.txt file to protect it. Not every search engine respects the convention, though all the majors do. More importantly, humans may take advantage of the file. All anyone has to do is enter the address to your robots.txt file, and they can read the contents in their web browser. They can see exactly what you consider off-limits for spiders, which sometimes also means off-limits for humans.
Consider this as you create your robots.txt file. You don’t want it to be a roadmap to sensitive areas on your server. If you do list them, password protect the areas. Keeping them off the web, of course, is the safest route of all.
Occasionally, reports come about problems with having either a blank robots.txt file or no robots.txt file. In either case, the issue seemed to be that because there was no valid robots.txt file explicitly allowing indexing of some or all pages within the site, no pages were indexed at all. This really shouldn’t happen, but if you are encountering problems with getting indexed, try installing a robots.txt file that allows some or all of your pages to be indexed.
The Web Robots Pages: The Robots Exclusion Protocol
The official word on using a robots.txt file.
The motto for the Robotcop project is “robots.txt: it’s the Law.” The robots.txt file is the mechanism that web site owners can use to block spiders from crawling all or portions of their web sites. It’s widely recognized and honored by the major crawlers, but it remains an unofficial law. Even worse, it’s a law with no law enforcement agency. Enter Robotcop. This is an open source project designed to produce plug-in modules for popular web servers. Did a crawler just fly past your robots.txt file? Robotcop can spot this and give you a variety of options with real teeth to them, such as blocking or trapping the spiders. Currently available for Apache 1.3 as a beta, there are plans to support Apache 2.0 and ISAPI webservers such as Zeus and IIS, in the future.