Website Indexation Audit: How to Find & Remove Non-Essential Content

holding-out-trash-bagsThere's been a lot of talk about quality content in the SEO world for the past couple of years. We do our best to be mindful of Google's Content Quality Guidelines and know that a human rater may review the page quality of the content we provide for consumption by users.

Looking at the Big Picture

Creating fresh content that is insightful, entertaining, or informational is important. But that's not all you have to worry about. What about the content that's already on your site?

Site managers and marketers can become complacent with their sites, viewing the same top level and product/service pages, blog pages, news pages, and so on. Looking at your entire index through the lens of a website indexation audit will allow you to gain an understanding of what Google sees and indexes.

How many junky, non-essential pages make up your site as a whole?

Finding Non-Essential Content

The first stop in assessing an overall site indexation audit is completing a scrape of your site’s URLs. My favorite tool for this is Screaming Frog.

While all the URLs you discover might not be indexed in search engines, this list will show you what you're putting out there for crawler consumption.

A URL scrape may unearth some surprising results. Perhaps you'll find that you need to redirect a lot of pages or use robots.txt to exclude content that has no place being seen by the search engine bots.

Review the URL scrape list for:

  • Paginated result pages: These are often named with numbers such as /1/, /2/, and so on. This is non-essential content that will likely carry a duplicated title element and should at the least contain a canonical tag in the page source of each page promoting the ranking focus to the absolute page. I would no-index these pages.
  • Case Duplication: These are URLs that feature upper and lower case versions of URLs. This is sloppy duplicate and should be redirected to the lower case version of the URL.
  • SSL Duplication: Transactional pages should be housed within a secure socket layer, but there are times when all site pages are duplicated and these non-transactional pages should be forced to render only in non-SSL pages.
  • www/non-www.: Ensure that all site pages redirect to either the www or non-www version of site pages. This is a common culprit of duplication on websites.
  • Shopping Cart Dynamic Pages: Depending on your ecommerce cart platform, you may see pages that are similar to the absolute/true product pages, but these pages (i.e., /browse/, sort=, mode=, etc.) are used for product sorting or comparing. These are a good example of non-essential pages and should be removed to avoid indexing.
  • Blog Tag/Category Pages: These pages are a regurgitation of blog teaser content snippets and don't need to be featured in search results. No index tagging or robots exclusion should be considered in this case.

Is it Indexed?

Now, take some of the folder names and page names found above and perform a site: operator in Google and Bing to assess whether these pages are indexed. In Bing you can also review the Index Explorer to assess indexed pages. (Google, why don’t you have this?)

If these non-important pages are getting indexed you can now start making preparations for robots.txt exclusion, meta robots usage, or at the least canonical tag utilization.

Before you take any leaps of exclusion or canonicalization you have to review organic landing pages in your analytical profile. Use filters to search for the above content within Organic Landing Pages. It is highly unlikely that Google is crediting this content and it is driving traffic but it for some reason it is, then consider the consequences of removing this content from the indices.

While you're reviewing some of the above page types in potential indexation situations, it doesn’t hurt to review your overall index results in Google or Bing. You might be surprised at what shows up.

Do you see those pages from the old site eight years ago that never got redirected and are long forgotten in a dark corner on the server? Do you notice that many of your site’s site search result pages are being indexed. These are a few good examples of the types of content you may not know are getting indexed by search engines.

Clean Up Your Site & Take Out the Garbage!

Don't make web crawlers weed through the chaff to figure out what your site is about. Take the time to take out the garbage your site has been accumulating and feeding the search engines.

After you've polished your overall site theme, you can get back to creating quality content that will hopefully have a better chance of “shining” on your site because you found and removed all of the non-essential content.