Life After Google Penguin – Going Beyond the Name

In looking back at my recent posts here it seems, though not by design, there was a theme emerging. Have a look...

And that was all pre-Penguin no less. Seems my Spidey-sense was tingling. The world of search engine optimization just keeps getting more convoluted. Now more than ever, very little is clear.

To date I have not touched upon the Penguin update because, well, we just didn't know. There wasn't enough data to say much. Of course that really hasn't changed, but there are a few things we can certainly look at to help better understand the situation at hand.

But let's give it a go anyway shall we?

Penguins at the Googleplex

A Name is Just a Name

The first thing we need to consider is that there are numerous Google algorithm updates, some of which aren't named. In the weeks before the infamous Penguin rolled out, there was a Panda hit and another link update. The three of them, being within a five-week period, makes a lot of the analysis problematic.

And that's the point worth mentioning. Don't try too hard to look for dates and names. Look more to the effects.

We're here to watch the evolution of the algos and adapt accordingly. Named or not, doesn't matter. Sure, it can be great for diagnosing a hit, but beyond that, it means little.

Regardless of the myriad of posts on the various named updates, none of us really know what is going on. That's where the instinct part of the job comes in. Again, knowing the evolution of search, goes a long way.

What is Web Spam?

To understand how web spam is defined, you need to look at how search engineers view SEO. While there are many, I like this:

“any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page's true value.” (from Web Spam Taxonomy, Stanford)


“Most SEOs claim that spamming is only increasing relevance for queries not related to the topic(s) of the page. At the same time, many SEOs endorse and practice techniques that have an impact on importance scores to achieve what they call "ethical" web page positioning or optimization. Please note that according to our definition, all types of actions intended to boost ranking, without improving the true value of a page, are considered spamming.” (emphasis mine)

Well la-dee-da huh? We can intimate that Google has eased that stance by trying to define white hat and black hat, but at the end of the day any and all manipulation is seen in a less than favorable light.

The next part of your journey is to establish in your mind what types of activities are commonly seen as web spam. Here's a few:

  • Link manipulation: Paid links, hidden, excessive reciprocal, shady links etc.
  • Cloaking: Serving different content to users and Google.
  • Malware: Serving nastiness from your site.
  • Content: Spam/keyword stuffing, hidden text, duplication/scraping.
  • Sneaky JavaScript redirects.
  • Bad neighborhoods: Links, server, TLD.
  • Doorway pages.
  • Automated queries to Google: Tools on your site, probably a bad idea.

That's about the core of the main offenders. To date with the Penguin update, people have been mostly talking about links. Imagine that... SEOs obsessed with links!

However, we should go a bit deeper and surely consider the other on-site aspects. If not on your site, then on the site links are coming from.

On-site Web Spam

Hopefully most people reading this, those with experience in web development and SEO (or running websites), don't use borderline tactics with their sites. We do know there is certainly elements of on-site with both the Penguin and Panda updates... so it's worth looking at.

Here are some common areas search engines look at for on-site web spam:

  • Domain: Some testing has shown that .info and .biz domains are far more spam laden than more traditional TLDs.
  • Words per page: Interestingly it seems spam pages have more text than non-spam pages (although over 1,500 words, the curve receded). Studies have shown the spam sweet spot to be in the 750-1,500 word region.
  • Keywords in title: This was mentioned in more than a few papers and should be high on the audit list. Avoid stuffing; be concise.
  • Anchors to Anchor text: In other studies engineers looked at the ratio of text, to anchor text on a page.
  • Percentage of visible text: This involves hidden text and nasty ALT text. What percentage of text is actually being rendered on the page.
  • Compressibility: As a mechanism used to fight keyword stuffing, search engines can also look at compression ratios. Or more specifically, repetitious or content spinning.
  • Globally popular words: Another good way to find keyword stuffing is to compare the words on the page to existing query data and known documents. Essentially if someone is keyword stuffing around given terms, they will be in a more unnatural usage than user queries and known good pages.
  • Query spam: By looking at the pattern of the queries, in combination with other signals, behavioral data manipulation would become statistically apparent.
  • Phrase-based: looking for textual anomalies in the form of related phrases. This is like keyword stuffing on steroids. Looking for statistical anomalies can often highlight spammy documents.
(some snippets taken from my post "Web Spam; the Definitive Guide")

And yes, there's actually more. The main thing to take from this is that there are often many ways that the search engines look at on-site spam, not just the obvious ones. Once more, this is about your site and the sites linking to you.

A lot of on-site web spam that's a true risk, will be from hacking. Sure, your CMS might be spitting out some craziness, or your WordPress plug-in created a zillion internal links, but those are the exceptions. If you're using on-site spam tactics, I am sure you know it. Few people actually use on-site crap post-Panda, many times it's the site being hacked that causes issues. So be vigilant.

Link Spam

Is the Penguin update all about links? I'd go against the grain and say no. Not only do we have to consider some of the above elements, but also there seems to be an element of 'trust' and authority at play here as well. If anything, we may be seeing a shift away from the traditional PageRank model of scoring, which of course many may perceive as a penalty, due to links.

But what is link spam? That answer has been a bit of a moving target over the years, but here are some common elements:

  • Link stuffing: Creating a ton of low-value pages and point all the links (even on-site) to the target page. Spam sites tend to have a higher ratio of these types of unnatural appearances.
  • Nepotistic links: Everything from paid links to traded ones, (reciprocal) and three-way links.
  • Topological spamming (link farms): Search engines will look at the percentage of links in the graph compared to known "good" sites. Typically those looking to manipulate the engines will have a higher percentage of links from these locales.
  • Temporal anomalies: Another area where spam sites generally stand out from other pages in the corpus are in the historical data. There will be a mean average of link acquisition and decay with "normal" sites in the index. Temporal data can be used to help detect spammy sites participating in unnatural link building habits.
  • TrustRank: This method has more than a few names, TrustRank being the Yahoo flavor. The concept revolves around having "good neighbors". Research shows that good sites link to good ones and vice versa.

(some snippets taken from my post "Web Spam; the Definitive Guide")

I could spend hours on each of these, but you get the idea. With many people are theorizing about networks, anchor texts, etc... the larger picture often evades us. There are so many ways that Google might be dealing with 'over optimization' that we're not talking about.

The last 18 months or so we have seen a lot of changes including the spate of unnatural-linking messages that went out. Again, Penguin or not doesn't matter. What matters is that Google is certainly looking harder at link spam, so you should be too.

It wouldn't hurt to keep a tinfoil hat handy as well… Look no further than this Microsoft patent that talks about spying on SEO forums. Between that and the fact that SEOs write about their tactics far and wide, it's not exactly hard for search engineers to see what we're up to.

Google Groups Therapy

How Are We Adapting in a Post-Penguin World?

What's it all mean? Well I haven't a bloody clue. Anyone who says they've got it sorted, likely needs to take their head out of a certain orifice.

What you should do is become more knowledgeable in how search engines work and the history of Google. Operate from intelligence, not ignorance.

Have you considered the elements outlined in this post when analyzing data and trying to figure out what's going on? I know I didn't. It was researching this post that reminded me of the myriad of various spam signals Google might look at.

Here's some of my thinking so far:

  • It really is a non-optimized world: Don't try too hard for that perfect title. Avoid obsessing over on-page ratios. You don't need that exact match anchor all the time, in fact you don't even need a link (think named entities). In many ways, less-is-more is the call of the day.
  • Keep a history: Be sure to always track everything. And when doing link profile or other types of forensic audits, compare fresh and historic data (such as in Majestic).
  • Watch on-site links: From internal link ratios to anchors and outbound links, they all matter. From spam signals to trust scoring, they can potentially affect your site.
  • Faddish: Another interesting thing, how much it plays into things we know not, was that Google might have an issue of the tactic du jour.
  • Watch your profile: In the new age of SEO it likely pays to be tracking your link profiles. If something malicious pops up, deal with it and make notes of dates and contact attempts.
  • On site: Hammer it and make it squeaky clean. The harder links get, the more one needs to watch the on-site. Schedule audits more frequently to watch for issues.
  • Topical-relevance: When looking at links think about topical-relevance. Are the links coming from sites/pages that are overly diverse (and have weak authority)?
  • Link ratios: Watch for a low spread in anchor texts as well as total links vs. referring domains (lower the better, it means less site-wide links generally).
  • Cleaning up: When possible look at link profiles and clean up suspect links. And I wouldn't wait until you get an unnatural linking message or tanked rankings.

We've seen a ton of data (this one is interesting) since this all went down and while there are common elements, nothing is conclusive (again, there have been a spate of updates). What is more important is to understand what Google wants and where they're headed. It's just another step in the long road of search evolution, don't get caught up in the names.

Taking the easy way out rarely works for success in life. SEO is no different.

Understand how a threshold might be used. This thing of ours is like the old story of the two of us in the woods when a hungry bear appears. I don't have to outrun the bear; just you. Ensure your strategy is within a safe threshold and it should work out just fine.

It's About Time

To close out there is the one part of this that keeps nagging; history. If you've been squashed by the recent updates (including Penguin) it may not entirely be about recent activities. There is a sense that Google is indeed keeping a history and that this may be playing into the large scheme of things.

Some of the most interesting Google patents were the series on historical elements. Be sure to go back and read some of these older posts:

Sure, they're 3-4 years old, but it is probably some of the more telling parts of the mindset change many in the world of SEO need.

More Reading

Google spam patents:

Link spam papers:

Note; if you have a penalty related story, be sure to get in touch as I am always interested in hearing about them and helping when I can.