WordPress Caught Spamming After Enlisting To Fight Spam

Back in January, blogging software provider WordPress was one of several vendors that signed on to support the new nofollow attribute designed to stem blog and search spam. That's why it was so ironic when it emerged yesterday that WordPress has been spamming search engines itself.

There's quite a debate that has since emerged over whether WordPress was really spamming and if so, should it have been deemed OK because the aim was to help support the open source blogging platform that many bloggers use.

Was It Spam? Yes It Was!

Let's clear up the spamming question right away. This was spam of the search variety. As I've written before, the search engines themselves are the ultimate arbiters of what's search spam. Google has declared the pages so here in comments from GoogleGuy (and yes, Google confirms to me it was the real GoogleGuy):

There definitely appear to be hidden links on the root page of wordpress.org using CSS, e.g. "text-indent: -9000px; overflow: hidden". That's clearly against our quality guidelines at http://www.google.com/webmasters/guidelines.html#quality

What's more, it looks like the company responsible for doing this (hotnacho.com) is also responsible for creating duplicate content in the form of posting the articles in multiple places, as you can see with this url: http://tinyurl.com/3omjj (these duplicate pages probably won't last long).

Yahoo says the same in the Wordpress Article Spam Being Removed post from Tim Mayer, Director of Product Management for Yahoo Search:

We are in the process of removing the WordPress article spam.

What Did They Do?

Wordpress Website's Search Engine Spam from Andrew Baio at Waxy.org broke the news yesterday of how he discovered nearly 200,000 pages of low quality content designed attract people from search engines and hopefully get them to click on Google AdSense ads, generating revenue for the site. A screenshot helps explain the situation more:

This is the top of one of the pages in question. You can try to view it yourself here, but there's a good chance it will be removed shortly. Most of the other pages have been removed from the site.

I've added all the colored boxes. The big red one at the top highlights the AdSense ad on the page. That's the goal -- get someone to come to this page from a search engine, then hope they'll click on one of those ads (or the four that were at the bottom of the page). Do that, and the site earns money.

On first glance, the content doesn't sound bad. It does give you basic information about mesothelioma. But it's like junk food content, not really saying anything of real substance that fills you up.

Fingerprints Of Spamming

More important, the act of hosting all these pages shows all the fingerprints of content designed primarily to attract search engines, rather than to please humans. Note the relatively high repetition of the word "mesothelioma," a sign that the page is trying to do well for this term. Notice how the word "mesothelioma" is always a link, as I've illustrated with the blue boxes. That's an attempt to help search engines believe the pages being linked to are about that word.

Most important is the fact that this page has no relevancy to be on this site at all. The WordPress home page gave human visitors no idea that hidden within its bowels was a resource area about mesothelioma. Instead, the site seems to be all about the WordPress software itself. This content was not being openly promoted to visitors. That's because it was instead hoped that it would be found only by search engines themselves.

Hidden Links

How did the search engines get to find the content? Down at the bottom of the WordPress home page were (and still are at the time of this writing) these hidden links:

Sponsored Articles on Credit, Health, Insurance, Home Business, Home Buying and Web Hosting

They were hidden through the use of a style attribute that kept them from being seen by anyone using a fairly modern browser. But a search engine sees things generally like old-style browsers, which means the links were visible. You can see an example of how this was so by making use of the Lynx Viewer here to imitate how a search engine crawler might have viewed the page.

As the links weren't hidden to search engines, they found the special "articles" area of the WordPress site -- http://wordpress.org/articles/ -- and indexed the content inside there, thousands and thousands of pages.

Targeting High Priced Terms

If all these fingerprints weren't enough to tell you that the site was involved in trying to grab search traffic, you need only look at the topic being targeted. Advertisers regularly pay extremely high per click fees to rank well for "mesothelioma," because attorneys hope lawsuits involving this cancer will bring high settlements. The top spot for that word is currently going for $52.08 per click on Overture right now.

Indeed, as I've written before, the high earnings that ads for that term can bring is one reason another blog site recently started up, specifically to generate content that's hoped will earn money off mesothelioma ads. The author of that site was upfront about his motivation, and the content is certainly better than the junk food search fodder hosted on WordPress. But nevertheless, as I wrote, the quest for AdSense money in that case created new content we might really not need and which possibly might push out better content from top search listings.

Did It Work?

So back on WordPress, the content in question was spam. We don't know actually whether it was successfully bringing in search traffic or not, much less AdSense reviews. No one I've seen has posted any top ranking examples for these pages -- and now that both Google and Yahoo have removed the pages, it's even harder to check. I did a few queries last night on things like "mesothelioma" and "coping with mesothelioma" and didn't spot them ranking well. Nevertheless, with nearly 200,000 entries in the search engine lottery, they probably pulled some traffic related to that term or for a myriad of other topics that were targeted, such as "web hosting" or "diabetes."

The person who leads WordPress, Matthew Mullenweg, turns out to be traveling at the moment so hasn't been able to respond to the current debate. We do have his response from when questions about the content were first raised on a WordPress support forum thread back in mid-February, however:

The content in /articles is essentially advertising by a third party that we host for a flat fee. I'm not sure if we're going to continue it much longer, but we're committed to this month at least, it was basically an experiment. However around the beginning of February donations were going down as expenses were ramping up, so it seemed like a good way to cover everything. The adsense on those pages is not ours and I have no idea what they get on it, we just get a flat fee. The money is used just like donations but more specifically it's been going to the business/trademark expenses so it's not entirely out of my pocket anymore.

An Innocent Mistake? Hard To Believe

Some have argued the statement above suggests Mullenweg didn't realize this content would be seen as spamming the search engines, nor apparently that hiding links would be a no-no, either. Perhaps, but you'd think he would have had some inkling they might not like this. He'd already signed on to the nofollow comment spam fighting initiative. You'd expect he'd make some connection that doing funny things with links might be seen as bad by search engines.

In addition, last month WordPress was part of a web spam summit that was held, also described here. Since that summit covered the problem of "fake weblogs" or "spam blogs" designed to capture search engine traffic just to make money, you'd think some similarity between those and these pages would have rung a bell. True, these pages weren't blog posts. Still, they had many of the same basic goals behind having fake blogs.

What Was The Punishment?

However the content got there, innocent mistake or not, two major search engines have deemed the content spam and removed it from their indexes. That doesn't mean the WordPress site has gone, however. All that appears to have been specifically removed are the spam pages.

The WordPress home page does appear to have been penalized at Google, probably as a result of the hidden links it had and still has. The home page no longer shows a score in the Google Toolbar PageRank meter, whereas yesterday it scored a 8 out of 10. That's almost certainly a penalty that's been applied. But other pages in the site still have high scores, such as the About page, so this isn't a site-wide penalty.

Also yesterday, a search for blog software I did brought the home page up in the top 10 results on Google. Today, it's not in the top results. That's another sign that a penalty has been applied to that page. In fact, a search for WordPress itself doesn't bring up the home page on Google (it does on Yahoo still, and it was first on Google last night).

That's something that won't last. It hurts Google's relevancy for people not to get the WordPress.org home page when they do a search for the company (WordPress.com which is now first appears to be run by someone other than WordPress). After a short period of time, WordPress's home page will undoubtedly find its ban lifted. After all, do a search for WhenU, and you get that company's home page tops in Google despite it having been banned for cloaking last year. After 42 days, it was back in.

WordPress Users Need Not Fear

I've seen some comments worrying that because the WordPress home page has been penalized, anyone using WordPress might be banned on Google or Yahoo. That's not a concern, I'd say. This isn't an issue as with the SearchKing case where people using WordPress might be seen as part of a network of sites to be penalized.

On the flipside, plenty of people running WordPress now have links from their blogs to the site. Is WordPress now a "bad neighborhood," something search engines say not to link to lest you be penalized. Possibly, but I doubt it.

If you want to be absolutely safe, then ironically make use of the nofollow attribute. It never was going to be a complete solution to comment spam, nor has it been. But as I wrote before, it is a perfect way to link to other sites without worry that you'll be penalized by doing so with search engines. More about this in my past article, More On Link Condom & Blogger Worries Over Nofollow.

Spamming, But For All The Right Reasons?

The links to WordPress are also fueling a debate over whether those who have done so to show their support have now been duped. I'll leave that for those in the WordPress community to argue. I've used WordPress, liked it, have recommended it before and still recommend despite what's happened. It's good software. But that doesn't entitle it to some of the excuses I've seen some make on its behalf, to justify the spamming.

Just because WordPress is an open source project, asks for donations and needs more support doesn't entitle it for a free reign to spam search engines, "experimentally" or not. If it wants to spam, then it pays the same price anyone else pays if they want to be aggressive with search engine optimization and get caught breaking rules.

Given this, seeing a comment like this really annoyed me:

Hot Nacho is a company that supports open-source software, specifically WordPress. All the web geeks need to remember that there are worse companies out there than those that try to "screw with Google" for PageRank, etc. It's fun to say "spammers are scum" and I certainly don't like them, but get some perspective, there is worse evil in this world.

All that said, I don't have a big problem with what Matt did, he said it wasn't something he wanted to do long term, but if it could help bootstrap the community it would be nice.

Search engine guidelines against spam don't say something like, "Don't spam us, unless you're just trying to make a start and help other people, then it's OK." They don't say, "Spam us for a little bit, then you can stop when you've earned enough." They say don't spam, period. If you don't want to follow those rules, fine. That's a risk you can take, and others do as well. But don't expect to be let off for free, if you're caught.

Here's another comment I disliked:

I don't begrudge someone earning money from something they have put a great deal of effort and time into. Particularly when it seems to be putting back into the product and to the benefit of the community.

Well, I do begrudge someone earning money if it's screwing up the quality of my search results. Fair to say, the searching community (anyone who searches for information) is a little broader than the WordPress community.

Misleading Spam, An Important Tangent

I've written about search spam many times and generally try to cover various viewpoints and illustrate how tricky defining what "spam" is can be. But as said, my view is the search engines are the ultimate arbiters of what they consider spam, for banning purposes.

Beyond that, we individually decide what we consider spam. I come across search spam all the time -- which to me is irrelevant content that's overtly attempted to get a good ranking. I dislike it immensely when I hit this type of content, because I know exactly what the person has done to be misleading. Here's a recent example.

I wanted the phone number of a chicken place near our home. I typed in king chicken amesbury into Google, then saw this promising "Amesbury Business Directory" page in the top listings. The page wasn't a real directory at all. It came up because it was generically designed to work for a variety of cities and topics. All these cities were named on the page:

Amesbury Box Bradford-on-Avon Calne Chippenham Corsham Cricklade Devizes Downton Durrington Hawthorn Highworth Malmesbury Market Lavington Marlborough Melksham Mere Pewsey Ramsbury Salisbury Sherston Swindon Tisbury Trowbridge Warminster Westbury Wootton Bassett

They worked in combination with what were called keywords related to Amesbury:

1 litre 1 ton 100% funding 16 18 22mm 16 bit 1-6 people 16bit 1880 clu 2 litres 2 post 2 ton 2.4ghz 200 kilos 24 hour 24 hr 3 day 35mm aps 3d 3g models 4 post 5 to 1 5:1 sf 500 kilos 50s 5-1 sf 60s 68 briefs 6ixty 8ight 6mm to 25mm 7 8 9.5 mm 7 day 7 day opening 70s 76 cm 7650 games 91cm 99cm a la carte ab1 ab2 ab3 abrasive abs pp academy acce access control accessories accident accidents accommodation account accountancy accountant accountants accounting accounts accurate acerbis acrylic acrylics acryllic acton adams adapters additional address adhesives admin administration administrative adsl advance adventure adverse adverse debt advice advice advisor adviser advisers

I'm not printing the entire list. It goes on and on, and the few lines above make the point. This wasn't a relevant page. This did nothing to satisfy my query. This was simply created in hopes of getting me to the page and clicking on some of the ads. It wasted space in Google, and it wasted my time.

So let's bring it back to WordPress. The content it had existed solely to make money, not really to inform or help. It took away space and resources for good content I'd rather see. At least sophisticated spammers would have ensured that if they got a top ranking, I would have been delivered to something with far more useful content. That's just a prerequisite to ensure people don't end up reporting your content as spam.

Yeah, Your Search Spam Did Contaminate!

Another comment that caught my eye was this:

Spamming is unsolicited. All of these posts are on a sanctioned area of WordPress and don't exist anywhere else. It'd be different if these posts were dropped into blogs and wikis all over the place but they aren't. Linking them in off-screen content is a little bit of trickery but there isn't any leeching there.

It's similar to what Jonas Luster of WordPress argued here:

Let?s get the first response over with - please, please, please stop calling it "Spamming". Regardless of how you stand towards the deeper issue at hand, diluting a word by mixing pretty much everything into the basket of spamming is not a good idea. Yes, the postings were made to improve the Google rank of someone else, yes there was a financial transaction involved, and yes, the postings were not topical to the wider sense of the site, but it's not spam. Spam involves other, involuntary, carriers. No comment boxes were contaminated, no mailboxes, no Usenet forums, and certainly no one spent a single byte of extra bandwidth (with the exception of the links from Wordpress.Org) on it. It's not spam.

Honestly, statements like that are simply frightening. Spam isn't only something that happens if you drop comments or trackbacks on blogs. Neither is it some new term we've suddenly co-opted for SEO. I've personally been using it since I started writing about search engines in 1996.

Push misleading or irrelevant content into a search engine overtly just to get traffic, and I call that spam. Break the rules a search engine sets out, and they call that spamming. The search industry has been using the terms "spam" and "spamming" for nearly a decade. Heck, even legal cases have cited spam in relation to search engines. Trying to redefine the term as it applies to search to put a better spin on the situation at WordPress isn't going to help things.

But it didn't really hurt anyone! That's sort of the tone of this unreasonable justification:

Matt could have put out announcements asking for donations. He could have plastered flashing advertisements all over the WordPress sites. He could have used every available opportunity to "pass the cup". Instead he chose an avenue which was out-of-sight. And instead of perceiving this as "polite", people have chosen to view it as "sneaky". "Et tu, Brutè?"

I see. It was polite to get nearly 200,000 low content pages into the search engines, where they consumed crawler time in being found and regularly revisited, time that might have been spent on other pages. It was polite that people hitting these unsolicited pages via search engines wasted time having to go back and seek again the solid information they really wanted. Thanks for that. Next time, just put the ads up on some real content. Or yes, do tell people you need money.

In the end, the big deal really isn't that WordPress was caught spamming. People get caught for spamming all the time. But we have never, ever had a situation I can recall where someone was caught spamming at the same time they were supposedly working with the search engines to prevent spam!

The creation and rallying of industry support around the nofollow attribute was unprecedented. We never before had any unified effort among search engines in that way to fight spam, much less having other parties like WordPress cooperate with that. Yes, nofollow was designed to combat link spam. What WordPress did was content spam, a different tactic. But the aim of both tactics is the same -- get more traffic from search engines by trying to aggressively manipulate them. WordPress ultimately did the very thing it was supposedly fighting against. That was a very big deal indeed.

For more on this, check out Wordpress gettin' Slammed for Spamming? at Threadwatch, for a lot of good comments especially from aggressive SEO types. Over at InsideGoogle, WordPress Caught Spamming has some nice links at the end to a collection of comments from various blogs on the situation. Want to comment or discuss what I've written or the situation in general? Then join the forum thread I've started, WordPress In Spamming Uproar.

Postscript: As mentioned, Mullenweg is currently on vacation and generally without internet access. He's posted a brief note here on having just now seen the concerns over the spamming, and further posts will likely eventually follow on the home page of his blog.

Postscript 2: In mid-April, I heard from Mullenweg on some questions I sent across to him. He responded that he hadn't realized what was presented to him as "advertising" was a form of "web spam," saying:

My mindset in terms of spam is very focused on the type I deal with and fight on a daily basis, I did not think of things in terms of what search engines such as Google deal with because I've never been in that position. I'm not going to argue semantics, but that sort of artificial content, hosted or otherwise, is not something I would ever participate in again.

Hidden links were also an issue. Mullenweg said:

  • They were wrong and shouldn't have been done.

  • He added the hidden links himself.

I never got an answer to the final follow-up question of why the links were hidden in the first place.