Managing Your Robots.txt File Effectively

Date published 22 July 2014 Author

Josh Mccoy

Categories

Lock The robots.txt file has been in the headlines of late – as we celebrated its 20th birthday last month, and Google Webmaster Tools updated its Blocked URLs section with the new Robots.txt Tester to help you glean insight on errors or warnings they find in your robots.txt file.

For a seasoned search marketer, the robots.txt file may seem like a simple but foundational element of SEO. For those with less knowledge of this file, simple robots.txt mistakes may prevent search engines from crawling your sites.

Robots.txt and Why We Need It

The robots.txt file is a resource for search engines to understand what pages, site sections, or types of pages they shouldn’t spend time crawling. While it can be dangerous to SEO if managed improperly, it can be a benefit when telling Google what non-search index critical content you have on your site, duplicate content, and content you don’t want crawled.

Be Very, Very Careful

While this is a handy tool for a webmaster, you have to understand how to use it and test the robots.txt file. There are three types of robots.txt directives:

Page Level

Disallow: /examplepage.html

Folder Level

Disallow: /example-folder/

Wildcard Directive

Anything as a child page of a folder (Disallow: /example-folder/*)

A file type (Disallow: /*.pdf)

Common Mistakes

Below are common robots.txt errors that should be avoided.

Disallow: /

Launching a site and the staging site robots.txt is brought into the production site disallowing the entire site from crawling.

Disallow: /images/

Disallow: /videos/

Disallowing the folders of the site that hold indexable content such as site images or videos. You may catch this first by seeing a drop or non-existence of image impressions in Google Search Queries information.

Disallow: /*.css

Disallow: /*.js

Disallowing search engines from accessing your CSS and JavaScript file locations. Withholding your page template code from search engines can make you look sketchy to the bots.

Disallow: /*.pdf

Disallow: /*.doc

Disallowing a page type just because it isn’t an HTML page. Hey, I hate PDFs, too, but they rank and can initiate visits to the site from search results.

Managing Robots.txt

Yes, Google has done a better job of helping you manage your robots.txt file. But, let’s first take a look at Bing.

Bing’s Webmaster Tools section of Crawl Information shows robots.txt excluded content. It’s quite good because you’re able to see all pages that excluded from search engine view and the link authority that isn’t being considered by search engines. See the example below, which shows that the first excluded page has 295 inbound links pointing to it.

Bing Webmaster Tools Robots.txt Exclusion

Aside from Bing Webmaster Tools, you can also review what the SEMrush Beta Site Audit shows as the URLs that are excluded via robots.txt.

SEMrush Blocked from Crawling

Additionally, if you would like to view similar data there is yet another tool you can use. While SEMrush helps you look at SEO issues on site and competitive intelligence, the Siteliner site allows you to review robots.txt exclusions while delving into on-site content duplication issues. One caveat to this version is that while it shows the inbound link equity which each page holds it also provides a “Page Power” grade, in other words, how heavily the page is linked across site.

Siteliner Skipped Pages

Last, Google Webmaster Tools provides the Robots.txt Tester. Other tools will help to you to understand what you’re withholding, but Google will help to show you what they feel is of error in the file. They also help to provide immediate insight on provided test pages and their potential exclusion.

Google Webmaster Tools robots.txt Tester

Conclusion

Robots.txt used correctly can help you aid search engines with site crawling. It doesn’t mean that it will immediately remove content from search engines like the noindex meta tag, but over time pages that are no longer crawled will begin to fall out of the index.

Hopefully this article has helped take your Google blinders off to realize there are other robots.txt tools out there. Additionally, I also hope you can now see what simple robots.txt mistakes could be removing content from crawling and link equity from view.

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Opinion

Information

Follow us

Managing Your Robots.txt File Effectively

Robots.txt and Why We Need It

Be Very, Very Careful

Common Mistakes

Managing Robots.txt

Conclusion

Leave a Reply Cancel reply

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

Optimize Google’s new Interaction to Next Paint metric

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

The new YMYL guidelines and what this means for marketers

How to drive B2B conversions from your organic traffic

Three critical keyword research trends you must embrace

Why we’re hardwired to believe SEO myths (and how to spot them!)