Dissecting Google's Penguin algorithm has been a passion for many SEO pros since first learning about the update. Last year, U.K.-based MathSight used reverse engineering to identify which factors Penguin 2.0 was targeting on a website. More recently, MathSight crunched numbers to pull apart Penguin 2.1, and revealed additional clues about what this particular algorithm was after.
Prior to Penguin 2.1, Andreas Voniatis, managing director of MathSight, said it's important think beyond the link when it came to Penguin by understanding the root cause.
"Many people forget that an inbound/outbound link profile originates from website's pages," he said. "So by analyzing the on-site SEO, we are effectively finding the stylistic properties of those external linking pages, which provides predictive value for Penguin 2.0."
But that was 2.0, which MathSight said was all about targeting "low readability" levels of content on a website, specifically looking at body text, anchor text, hyperlinks, and meta information. So what about this time around with Penguin 2.1?
MathSight's data showed websites that gained and lost traffic from Penguin 2.1 had links from web pages that contained:
- A higher (good) or lower (bad) proportion of rare words in the body text.
- A higher (good) or lower (bad) number of words per sentence in the body text.
- A higher (good) or lower (bad) number of syllables per word in the body text.
MathSight's data may support theories SEO pros have about linking to poor quality sites, and that "quality" factor is hindered on content, Voniatis said.
"The readability of content from a linking web page is highly influential to how Penguin views the destination site, that is, the site being linked to. Websites should eliminate links from sites that don't meet the readability thresholds Penguin demands," he said.
He added, "Readability is how Penguin cleans up the linking ecosystem on the premise that the more intellectual the text reads, the more authentic the content is likely to be."
So how does 2.1 differ in the nitty gritty metrics?
"When we compared Penguin 2.1 to 2.0, we found the algorithm had been refined so that the readability metrics were more heavily weighted towards Flesch-Kincaid than Dale-Chall readability," said Voniatis. "So it looks like Google is trying to find the limits of web spam by tweaking its readability formulas."
Voniatis said the formula used to determine readability using the Flesh Kincaid scale was as follows:
RE = 206.835 – (1.015 x ASL) – (84.6 x ASW)
- RE = Readability Ease
- ASL = Average Sentence Length (the number of words divided by the number of sentences)
- ASW = Average number of syllables per word (the number of syllables divided by the number of words)
"The lower the score, that is, the harder the text is to read, the more beneficial content is for Penguin algorithm updates," said Voniatis. "ANOVA (analysis of variance) statistics showed that the certainty of Flesch-Kinkaid causing a change in traffic due to Penguin was 99.999 percent."
The red bars in the graph above indicate those factors in sites examined that triggered Penguin, according to MathSight data. The green bars indicate those factors that websites had which benefitted from Penguin 2.1.
So what does all this mean, and what's an SEO to do with the data?
Voniatis said the statistics "tell us the secret ingredients but not the reason why Google is using readability. I suspect Google finds readability an easy way of discounting links from guest posts written by non-experts."
He added that SEO professionals could manually check each and every linking web page content for Flesch-Kincaid and Dale-Chall readability by using free online tools. But said MathSight's API does this more efficiently by crawling links on- and off-site, evaluating readability and returning "a delta to the optimum readability threshold, so SEOs can disavow the links or recondition on-site landing page content." And, he said, "The thresholds are updated with each algorithm update."