Large Enterprise SEO: CMS Duplicate Content
For large sites, CMS-created duplicate content obliterates SEO. How to survive.
For large sites, CMS-created duplicate content obliterates SEO. How to survive.
This is the third in our series of articles about SEO on Web sites with tens of thousands to millions of pages. Large sites require Content Management Systems (CMS). From a SEO perspective, problems can arise from duplicate content created by a CMS.
CMS software is a wonderful tool in managing the content on a large site, a task you certainly wouldn’t want to take on by hand.
Top 4 Duplicate Content Problems
How a CMS Duplicates Content
There are a few ways a CMS might render duplicate content on your site. I covered one of these in detail in “SEO Hell, A CMS Production.”
Sites that use a multi-tiered hierarchy, such as a product catalog, may be designed for start with a category, then a sub-category, and product. The basic way you end up accessing a product is at a URL such as: www.youdomain.com/category/subcategory/product.html.
No problem yet. Where you run into problems: if you can also access the same product page using www.yourdomain.com/subcategory/product.html or www.yourdomain.com/product.html. This becomes a much bigger problem when the CMS itself actually uses the different variants of the URL and therefore exposes them to search engine crawlers.
In similar fashion, you might be able to access a page using www.yourdomain.com/category/subcategory/product.html, and www.yourdomain.com/category/subcategory/product.html?id=12343. Once again, this becomes a big issue when the CMS actually uses varying versions of the URL itself and makes them visible to crawlers.
Session IDs represent an old form of this problem, but you still see it from time to time. A Session ID is a parameter appended to the end of a URL by the CMS on a user by user basis. In other words, every visitor to the site gets their own unique session ID.
Older CMS systems used to use this for tracking each visitor’s visit. This is extremely helpful if a user has made several selections on the site already, for example, and entered a zip code. The system wants to ensure it remembers that data as the user surfs around the site.
However, it produces a deadly form of duplicate content, because the search engine sees a different set of URLs for the site for every single visit that person makes.
Date and category archiving of content on a blog, such as WordPress, can create lots of duplicate content pages. The most extreme cases of this are when you have a category or date archive (e.g. July 2007) with only one article in it. As a result, that page ends up being an exact duplicate of the permalink page for the post itself.
However, even a category with five posts in it isn’t adding any new content value to the site. After all, the entire content on the page of that category is content found across other pages of the site.
SEO Solutions
The devil is in the details here. The way to resolve these issues depends on the particular CMS. Sometimes you can configure the CMS itself to eliminate some of these problems.
Other times, you need to use your robots.txt file to tell the search engines you don’t want them to crawl the pages in question.
For example, if you don’t want the search engines to crawl your date based archive pages, you have to specify that group of pages in your robots.txt file using the disallow command. This is definitely a task for a programmer, and you want to make sure this is done absolutely correctly, as making a mistake in your robots.txt file has the potential to be disastrous.
If you are trying to clean up some of the duplicate content problems inherent in WordPress, you can use Joost de Valk’s Meta Robots WordPress plugin. This greatly simplifies and speeds the clean up of the duplicate content created by WordPress.
Ultimately, you need to understand the exact nature of the problem on your site, and devise a strategy for coping with it. It’s not easy to do, but the benefits it brings to your search engine strategy are well worth the trouble.