Blocking Crawlers With Robots.txt

Author

Date published June 9, 2003 Categories

Industry

The robots.txt standard is a text file placed in the root server’s HTML directory. For example, if I did not want the entire calafia.com site to be indexed, I would make a file that would be found under the following URL:

http://calafia.com/robots.txt

An engine respecting the standard would ask for the file before trying to index any page within the site. To exclude the entire site, the file would say:

User-agent: *
Disallow: /

The user-agent portion lets you specify engines or browsers that should obey the next line. Chances are, you want them all to do so. The * is a way to specify everything.

The disallow portion is where you specify directories or file names. In the example above, the * is used to protect everything within the site. You can also be more specific and block particular directories or pages:

User-agent: *
Disallow: /webmasters/
Disallow: /access/
Disallow: /classroom/stats.htm

Now the engines respecting the standard will not index anything in the site with the addresses:

http://calafia.com/webmasters/
http://calafia.com/access/

And this page is also blocked:

http://calafia.com/classroom/stats.htm

Because the robots.txt file must go in the server’s root directory, many of those using free web space will not be able to use it. You cannot simply put it within your space. For example, here’s a scenario with AOL:

OK
http://members.aol.com/robots.txt

Not OK
http://members.aol.com/mysite/robots.txt

The first works because the file is in the server’s root directory. The second doesn’t work because it is located in a sub-directory.

Because of this problem, the meta robots tag was created to help those without access to the robots.txt file. It is described on the Blocking Search Engines With The Meta Robots Tag page.

Security Issues

If you don’t want something to be accessed, don’t put it on the web. Period. Certainly don’t expect the robots.txt file to protect it. Not every search engine respects the convention, though all the majors do. More importantly, humans may take advantage of the file. All anyone has to do is enter the address to your robots.txt file, and they can read the contents in their web browser. They can see exactly what you consider off-limits for spiders, which sometimes also means off-limits for humans.

Consider this as you create your robots.txt file. You don’t want it to be a roadmap to sensitive areas on your server. If you do list them, password protect the areas. Keeping them off the web, of course, is the safest route of all.

Other Notes

Occasionally, reports come about problems with having either a blank robots.txt file or no robots.txt file. In either case, the issue seemed to be that because there was no valid robots.txt file explicitly allowing indexing of some or all pages within the site, no pages were indexed at all. This really shouldn’t happen, but if you are encountering problems with getting indexed, try installing a robots.txt file that allows some or all of your pages to be indexed.

More Resources

The Web Robots Pages: The Robots Exclusion Protocol
http://www.robotstxt.org/wc/exclusion.html#robotstxt

The official word on using a robots.txt file.

Robotcop
http://www.robotcop.org/

The motto for the Robotcop project is “robots.txt: it’s the Law.” The robots.txt file is the mechanism that web site owners can use to block spiders from crawling all or portions of their web sites. It’s widely recognized and honored by the major crawlers, but it remains an unofficial law. Even worse, it’s a law with no law enforcement agency. Enter Robotcop. This is an open source project designed to produce plug-in modules for popular web servers. Did a crawler just fly past your robots.txt file? Robotcop can spot this and give you a variety of options with real teeth to them, such as blocking or trapping the spiders. Currently available for Apache 1.3 as a beta, there are plans to support Apache 2.0 and ISAPI webservers such as Zeus and IIS, in the future.

More about:

Resources

Analytics The 2023 B2B Superpowers Index

The Merkle B2B 2023 Superpowers Index outlines what drives competitive advantage within the business culture and subcultures that are critical to success. It is the indispensable guide for B2B marketers to deliver world-class experiences and keep pace with the dynamic environment. Download Now
Analytics Data Analytics in Marketing

The ClicData survey found that various challenges exist that prevent organizations from achieving such gains. These challenges included inaccessible data formats and limited flexibility in displaying data in dashboards. Download Now
Digital Marketing The Third-Party Data Deprecation Playbook

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now
Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

The need for fraud prevention in the digital world is critical now more than ever. Why? Thinking about your own behavior, consider how you complete transactions and how this has changed over the last 5 years. Download Now

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Information

Follow us

Blocking Crawlers With Robots.txt

Security Issues

Other Notes

More Resources

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

Verizon acquires Yahoo's operating business for $4.8 billion

5 Essential Hybrid Digital Marketing Skills to Develop Now

Google's CTR Dominance Is a Double-Edged Sword

Could Microsoft and AOL Take on Google and Facebook?

Bing Tests Emoji Keyboard

Bing to Prohibit Multiple Display URL Domains Under the Same Ad Group

Want to Combine Sites in Search Console? Google Wants to Know.

5 Recent Updates from Bing Ads You Need to Know About

Follow us

Blocking Crawlers With Robots.txt

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Security Issues

Other Notes

More Resources

Get the Latestdaily news and insights about search engine marketing, SEO and paid search.

Resources

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

Verizon acquires Yahoo's operating business for $4.8 billion

5 Essential Hybrid Digital Marketing Skills to Develop Now

Google's CTR Dominance Is a Double-Edged Sword

Could Microsoft and AOL Take on Google and Facebook?

Bing Tests Emoji Keyboard

Bing to Prohibit Multiple Display URL Domains Under the Same Ad Group

Want to Combine Sites in Search Console? Google Wants to Know.

5 Recent Updates from Bing Ads You Need to Know About

Get the Latest
daily news and insights about search engine marketing, SEO and paid search.