This is the third in our series of articles about SEO on Web sites with tens of thousands to millions of pages. Large sites require Content Management Systems (CMS). From a SEO perspective, problems can arise from duplicate content created by a CMS.
CMS software is a wonderful tool in managing the content on a large site, a task you certainly wouldn't want to take on by hand.
Top 4 Duplicate Content Problems
- Search engines prefer showing only one version of a particular document in their SERPs (define). Search engines will simply filter out the duplicates. Removing duplicate content is only the first problem.
- When search engines come to your site, they have a plan regarding how many pages they are going to crawl. Maybe today they'll look at 1,000 pages. If 300 are duplicate content, you've wasted a part of the "crawl budget" by crawling pages that will never rank highly in their index, if at all. If you have a site with 100,000 pages, wasting your crawl budget becomes a big problem.
- Wasting link juiceoccurs because pages on your site that link to duplicate content are voting some of their link juice for those pages, even though they'll never rank. You'd be much better off if those pages instead voted that link juice on pages that might rank.
- Sites with a high percentage of duplicate content may simply get penalized as a result of being perceived by the search engines as a low value Web site.
How a CMS Duplicates Content
There are a few ways a CMS might render duplicate content on your site. I covered one of these in detail in "SEO Hell, A CMS Production."
Sites that use a multi-tiered hierarchy, such as a product catalog, may be designed for start with a category, then a sub-category, and product. The basic way you end up accessing a product is at a URL such as: www.youdomain.com/category/subcategory/product.html.
No problem yet. Where you run into problems: if you can also access the same product page using www.yourdomain.com/subcategory/product.html or www.yourdomain.com/product.html. This becomes a much bigger problem when the CMS itself actually uses the different variants of the URL and therefore exposes them to search engine crawlers.
In similar fashion, you might be able to access a page using www.yourdomain.com/category/subcategory/product.html, and www.yourdomain.com/category/subcategory/product.html?id=12343. Once again, this becomes a big issue when the CMS actually uses varying versions of the URL itself and makes them visible to crawlers.
Session IDs represent an old form of this problem, but you still see it from time to time. A Session ID is a parameter appended to the end of a URL by the CMS on a user by user basis. In other words, every visitor to the site gets their own unique session ID.
Older CMS systems used to use this for tracking each visitor's visit. This is extremely helpful if a user has made several selections on the site already, for example, and entered a zip code. The system wants to ensure it remembers that data as the user surfs around the site.
However, it produces a deadly form of duplicate content, because the search engine sees a different set of URLs for the site for every single visit that person makes.
Date and category archiving of content on a blog, such as WordPress, can create lots of duplicate content pages. The most extreme cases of this are when you have a category or date archive (e.g. July 2007) with only one article in it. As a result, that page ends up being an exact duplicate of the permalink page for the post itself.
However, even a category with five posts in it isn't adding any new content value to the site. After all, the entire content on the page of that category is content found across other pages of the site.
The devil is in the details here. The way to resolve these issues depends on the particular CMS. Sometimes you can configure the CMS itself to eliminate some of these problems.
Other times, you need to use your robots.txt file to tell the search engines you don't want them to crawl the pages in question.
For example, if you don't want the search engines to crawl your date based archive pages, you have to specify that group of pages in your robots.txt file using the disallow command. This is definitely a task for a programmer, and you want to make sure this is done absolutely correctly, as making a mistake in your robots.txt file has the potential to be disastrous.
If you are trying to clean up some of the duplicate content problems inherent in WordPress, you can use Joost de Valk's Meta Robots WordPress plugin. This greatly simplifies and speeds the clean up of the duplicate content created by WordPress.
Ultimately, you need to understand the exact nature of the problem on your site, and devise a strategy for coping with it. It's not easy to do, but the benefits it brings to your search engine strategy are well worth the trouble.