SES Chicago - December 7-11, 2009
Subscribe and receive Organic SEO newsletters delivered to your inbox. Enter your e-mail:
Mark Jackson

Proper SEO and the Robots.txt File

When it comes to SEO, most people understand that a Web site must have content, "search engine friendly" site architecture/HTML, and meta data -- i.e., title tags, meta description, and meta keywords tags.

But lately, I'm seeing a lot of "optimized" Web sites that have totally disregarded the robots.txt file. When optimizing a Web site, don't disregard the power of this little text file.

What is a Robots.txt File?

Simply put, if you go to domain.com/robots.txt, you should see a list of directories of the Web site that the site owner is asking the search engines to "skip" (or "disallow"). However, if you're not careful when editing a robots.txt file, you could be putting information in your robots.txt file that could really hurt your business.

There's tons of information about the robots.txt file available at the Web Robots Pages, including the proper usage of the disallow feature, and blocking "bad bots" from indexing your Web site.

The general rule of thumb is to make sure a robots.txt file exists at the root of your domain (e.g., domain.com/robots.txt). To exclude all robots from indexing part of your Web site, your robots.txt file would look something like this:

User-agent:
* Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

The above syntax would tell all robots not to index the /cgi-bin/, the /tmp/, and the /junk/ directories on your Web site.

Real Life Examples of Robots.txt Gone Wrong

I recently reviewed a Web site that had a good amount of content and several high quality backlinks. But, the Web site had virtually no presence in the SERPs. What happened? Well, the site's owner had included a disallow to "/". They were telling the search engine robots not to crawl any part of the Web site.

In another case, a SEO company edited the robots.txt file to disallow indexing of all parts of a Web site after the site's owner stopped paying the SEO company.

And just yesterday, I reviewed a company's Web site and noticed that several directories that were part of their former site were disallowed in their robots.txt file. The company should have set up a 301 permanent redirect to pass the value from the old Web pages on the site to the new pages instead of disallowing the search engines to index any of the old legacy pages. Thus, all of the value was lost.

Robots.txt Dos and Don'ts

There are many good reasons to stop the search engines from indexing certain directories on a Web site and allowing others for SEO purposes. Let's look at some examples.

Here's what you should do with robots.txt:

  • Take a look at all of the directories in your Web site. Most likely, there are directories that you'd want to disallow the search engines from indexing, including directories like /cgi-bin/, /wp-admin/, /cart/, /scripts/, and others that might include sensitive data.
  • Stop the search engines from indexing certain directories of your site that might include duplicate content. For example, some Web sites have "print versions" of Web pages and articles that allow visitors to print them easily. You should only allow the search engines to index one version of your content.
  • Make sure that nothing stops the search engines from indexing the main content of your Web site.
  • Look for certain files on your site that you might want to disallow the search engines from indexing, such as certain scripts, or files that might contain e-mail addresses, phone numbers, or other sensitive data.

Here's what you should not do with robots.txt:

  • Don't use comments in your robots.txt file.
  • Don't list all your files in the robots.txt file. Listing the files allows people to find files that you don't want them to find.
  • There's no "/allow" command in the robots.txt file, so there's no need to add it to the robots.txt file.

By taking a good look at your Web site's robots.txt file and making sure that the syntax is set up correctly, you'll avoid search engine ranking problems. By disallowing the search engines to index duplicate content on your Web site, you can potentially overcome duplicate content issues that might hurt your search engine rankings.

One last note: if you aren't sure whether you can do this correctly, please consult with a SEO specialist.

Join us for SES San Jose, August 18-22 at the San Jose Convention Center.


Newsletter signup
Receive the next edition of Organic Search Engine Optimization delivered to your inbox.
Enter your e-mail here:


Learn more about Newsletters Learn more about Newsletters   Subscribe to RSS Feeds Subscribe to RSS Feeds

Biography
Mark Jackson

Mark Jackson, President and CEO of Vizion Interactive, a search engine optimization company. Mark joined the interactive marketing fray in early 2000. His journey began with Lycos/Wired Digital and then AOL/Time Warner. After having witnessed the bubble burst and its lingering effects on stability on the job front (learning that working for a "large company" does not guarantee you a position, no matter your job performance), Mark established an interactive marketing agency and has cultivated it into one of the most respected search engine optimization firms in the United States.

Vizion Interactive was founded on the premise that honesty, integrity, and transparency forge the pillars that strong partnerships should be based upon. Vizion Interactive is a full service interactive marketing agency, specializing in search engine optimization, search engine marketing/PPC management, SEO friendly Web design/development, social media marketing, and other leading edge interactive marketing services, including being one of the first 50 beta testers of Google TV.

Mark is a board member of the Dallas/Fort Worth Search Engine Marketing Association (DFWSEM) and a member of the Dallas/Fort Worth Interactive Marketing Association (DFWIMA) and is a regular speaker at the Search Engine Strategies and Pubcon conferences.

Mark received a BA in Journalism/Advertising from The University of Texas at Arlington in 1993 and spent several years in traditional marketing (radio, television, and print) prior to venturing into all things "Web."

Archive

Search Engine Watch Yesterday

Account Manager
Varick Media Management New York, United States

Reporting and Data Analyst
Varick Media Management New York, United States

Director of Marketing Communications
Avery Dennison Brea, United States

Publisher
Confidential Leading Publisher New York, United States