The robots.txt file has been in the headlines of late – as we celebrated its 20th birthday last month, and Google Webmaster Tools updated its Blocked URLs section with the new Robots.txt Tester to help you glean insight on errors or warnings they find in your robots.txt file.
For a seasoned search marketer, the robots.txt file may seem like a simple but foundational element of SEO. For those with less knowledge of this file, simple robots.txt mistakes may prevent search engines from crawling your sites.
Robots.txt and Why We Need It
The robots.txt file is a resource for search engines to understand what pages, site sections, or types of pages they shouldn’t spend time crawling. While it can be dangerous to SEO if managed improperly, it can be a benefit when telling Google what non-search index critical content you have on your site, duplicate content, and content you don’t want crawled.
Be Very, Very Careful
While this is a handy tool for a webmaster, you have to understand how to use it and test the robots.txt file. There are three types of robots.txt directives:
Disallow: /examplepage.html
Disallow: /example-folder/
Anything as a child page of a folder (Disallow: /example-folder/*)
A file type (Disallow: /*.pdf)
Common Mistakes
Below are common robots.txt errors that should be avoided.
Disallow: /
Launching a site and the staging site robots.txt is brought into the production site disallowing the entire site from crawling.
Disallow: /images/
Disallow: /videos/
Disallowing the folders of the site that hold indexable content such as site images or videos. You may catch this first by seeing a drop or non-existence of image impressions in Google Search Queries information.
Disallow: /*.css
Disallow: /*.js
Disallowing search engines from accessing your CSS and JavaScript file locations. Withholding your page template code from search engines can make you look sketchy to the bots.
Disallow: /*.pdf
Disallow: /*.doc
Disallowing a page type just because it isn’t an HTML page. Hey, I hate PDFs, too, but they rank and can initiate visits to the site from search results.
Managing Robots.txt
Yes, Google has done a better job of helping you manage your robots.txt file. But, let’s first take a look at Bing.
Bing’s Webmaster Tools section of Crawl Information shows robots.txt excluded content. It’s quite good because you’re able to see all pages that excluded from search engine view and the link authority that isn’t being considered by search engines. See the example below, which shows that the first excluded page has 295 inbound links pointing to it.
Aside from Bing Webmaster Tools, you can also review what the SEMrush Beta Site Audit shows as the URLs that are excluded via robots.txt.
Additionally, if you would like to view similar data there is yet another tool you can use. While SEMrush helps you look at SEO issues on site and competitive intelligence, the Siteliner site allows you to review robots.txt exclusions while delving into on-site content duplication issues. One caveat to this version is that while it shows the inbound link equity which each page holds it also provides a “Page Power” grade, in other words, how heavily the page is linked across site.
Last, Google Webmaster Tools provides the Robots.txt Tester. Other tools will help to you to understand what you’re withholding, but Google will help to show you what they feel is of error in the file. They also help to provide immediate insight on provided test pages and their potential exclusion.
Conclusion
Robots.txt used correctly can help you aid search engines with site crawling. It doesn’t mean that it will immediately remove content from search engines like the noindex meta tag, but over time pages that are no longer crawled will begin to fall out of the index.
Hopefully this article has helped take your Google blinders off to realize there are other robots.txt tools out there. Additionally, I also hope you can now see what simple robots.txt mistakes could be removing content from crawling and link equity from view.