We all know that duplicate content is a bad thing, right? But what is it? And why is it bad?
To fully understand the issue of duplicate content, we need to look at it not from the context of a website owner, but rather from the perspective of a search engine trying to provide the best possible experience to its users. From that primary standpoint we can then carry the principles forward to consider how that can impact a website and its organic traffic.
Understanding Duplicate Content
Because this is a basics article, we're going to keep things simple (or at least, as simple as anything in SEO can be):
Duplicate content is content which appears in more than one location on the Internet.
Inherently there is nothing specifically wrong with duplicate content.
Let's take for example an article on blue widgets written by Bill over at abcwidgets.com. I run xyzwidgets.com and really like the article. With Bill's permission I copy the article into my site providing proper reference to its source. What's wrong with that? Legally, ethically, and even from a business standpoint? Nothing.
In this example I'd decided that the content is so useful I'd like to share it with my own visitors, but want to keep them on my site. But what happens when I view the same scenario from the perspective of a search engine?
The question then has to be posed, which of these two articles deserves to rank and how does the engine know?
Many factors come into play at this point – the amount of duplicated content on the page and on the site as a whole, the relative strength of the sites, and which copy got seen first. But at its core, I always assume that the last factor here (first seen) gets the credit. This isn't always the case, but if we have to pick a rule-of-thumb, this works.
So what happens to my site with the duplicated page?
- It won't rank for that page.
- The weight of that page will be negligible.
- A point against the site as a reliable source of quality, unique content will be registered.
Now, this may seem unfair (and a decent argument you'd have), but we have to remember that the example above isn't the sum total of what the engines have to deal with. We'll be discussing below a few of the more common "ethical" duplicate content issues but it's important to keep in mind that not all strategies that have been used are in the best interest of searchers or even visitors.
Whole networks of websites have been built that focus only on duplicating content found elsewhere on the web in hopes of capturing search traffic. They're not built to add value and they generally don't.
One need only consider mass article syndication to get a feel for where it went wrong in the eyes of the engines; the same content on hundreds of sites with little or no quality control and with content across a massive range of subjects. Essentially we had pages with no value to the user and even less to Google all in the effort by those submitting to game the system for links and of the article sites to attract traffic for their impression-based ads.
To deal with all these types of issues, search engines had to adjust the way they valued duplicate content. Let's remember, they have to use an algorithm, and algorithms aren't great at making exceptions.
So, knowing that we're not here to argue with the engines about who's right and knowing that being right or wrong doesn't earn you traffic in this context, we need to ensure that even if what we're doing is right for our visitors, that it can't be confused with something that isn't. Fortunately there are methods in place to deal with the variety of duplicate content types. So let's explore those.
While there is a wide array of duplicate content types, the majority of sites contain one or more of only a handful. Here we're going to look at the most common types of duplicate content and discuss how to address them and what this means to the site owner.
The Situation: Let's start with the example we touched on above. I'm a site owner who has found a great piece of content on a different site I would like to share on my site.
The Issue: The issue you'll face is that this content is going to be valued poorly on your site and may contribute to an overall domain score quality drop.
The Fix: A cross-domain canonical tag is the only fix here. You'll need to add a canonical tag to the page indicating that the original source of the content is at a different location. It would resemble:
<link rel="canonical" href="http://www.abcwidgets.com/copied-article.html"/>
This will tell the engines that you know the article is copied, it is intentionally placed on your site and all link weight to that page should pass to the original location of the article.
The Downside: All link weight will pass to the original source of the article. Looking at how PageRank passes between pages, if you have 10 internal links on a page and one points to a canonicalized page you will retain only 90 percent of the weight (an over-simplification but it will do as an example). That said, if the content is useful to your visitors then the increase in time on site and visitor loyalty will exceed any decrease in PageRank.
Duplicated Product Information
The Situation: You run an ecommerce site selling blue widgets from a variety of manufacturers. These manufacturers provide you with product information (titles, descriptions, specs, and images) to post on your site.
The Issue: The manufacturers are also providing the exact same information to everyone else who's selling their products.
The Fix: While specs remain the same and duplication is acceptable across multiple sites, you need to set your site apart. This will generally involve writing new product descriptions, taking new photos, and hopefully adding content unique to your site such as reviews.
The Downside: The only real downside here is time. It takes a lot of time to write custom product descriptions, but if it's not worth the time to write them, one has to wonder if it's worth having the product on your site at all (i.e., if the ROI is so low at that level, is the product really going to be profitable?)
To hear it straight from the horse's mouth, here's Google's Matt Cutts talking about exactly this topic:
Sorting and Multi-Page Product Lists
The Situation: You run an ecommerce site and that site has sorting options that generate unique URLs or has multiple pages of the same core products. An example of this would be eBay where they have large number of product pages in most categories which then changes orders (or products in the list) depending on how the list is ordered or which page of the category you're on.
The Issue: If you have a page with 20 items and a different URL is generated when those items are sorted by price as opposed to alphabetically (for example) then you essentially end up with 2 pages with the same content at different URLs.
The Fix: Once again the solution is the canonical tag. For each page that is a sub-page of the initial category URL you would add the canonical tag to the initial category URL. This will ensure that they're not picked up as duplicate content and further ensure that the link weight is all passed in the right direction.
The Downside: For once there is no downside. On top of addressing the issue of duplicate content this strategy will also ensure that any weight passed to the sub-pages in the sorting options (either internally of from external links) will get passed back to the core category page resulting in a stronger landing page.
WWW vs. Non-WWW & Duplicate Homepages
The Situation: Your site can be found at both the www and non-www URLs (www.abcbluewidets.com and abcbluewidets.com) and/or your homepage can be found at both the root level (www.abcbluewidets.com) and file-based URL (www.abcbluewidgets.com/index.html).
The Issue: While the engines are generally good and figuring this issue out, it's not a good idea to rely on "generally". This can create a duplicate content issue and also cause links to the "wrong" URL to not get credited to your site correctly.
The Fix: While a canonical tag will fix this, the better route to go is with a 301 and to permanently redirect to the proper location. This will ensure that all requests for a resource resolve to the same location.
Different servers have different methods for accomplishing this. As there are a variety of different things you may wish to accomplish (redirecting index.html to the root of all folders vs. simply the homepage for example), you can visit http://www.seobook.com/archives/001714.shtml. Most of the codes can be found there (hat tip to Tony Spencer on that one).
These codes are for Apache servers. If your site is hosted on a Windows server I recommend chatting with you system admin as it'll require access to IIS for some of the more advanced functions.
The Downside: There is a small amount of link juice that dissipates through a 301 redirect effectively reducing the weight being passed to the target page. For that reason, even with all the right 301's and canonical tags in place, it's important to ensure that all internal links or links you create point to the correct desired URL whenever possible.
As long as you're aware of the issues of duplicate content it's nothing to fear. It happens, Google knows it happens (even accidentally).
Ensuring that you do everything you can to take ownership of how weight and authority pass through that duplication ensures that you don't "get spanked" by appearing to be unethically attempting to manipulate the results. Further, addressing duplicate content will help ensure that weight passes efficiently through your site with priority given to the correct pages.
While many of the fixes can take time, anything worth doing right generally does. Fixing duplicate content issues can generate some of the biggest ROI from an hour-in-visitor-out perspective of almost any SEO activity.