How Search Engines Use Machine Learning for Pattern Detection

Date published 1 December 2011 Author

Peter Van Der Graaf

Categories

Search engines use machine learning for pattern detection. While it’s impossible to explain in one short article how machine learning influences our lives, understanding the basics of machine learning can give you some insight into search algorithm updates, such as Google’s Panda update.

Correlation Between Variables

To predict the outcome of future tests, scripts can use supervised learning on past outcomes to define a hypothetical prediction line. The three images below show how plotted examples define averages. These averages are more likely to represent some truth as the training set grows.

machine-learning-formula-straight-line

machine-learning-averages-shift

In this case the correlation between a number for duplicate content and a number based on negative reviews by manual quality raters is plotted. As the learning set grows the prediction becomes more certain.

Anomalies can initially have a huge impact on the hypothesis, but their effect diminishes in areas with a lot of reference material. This also indicates that not every part of the hypothesis line is equally rigid and the level of certainly can play a separate role in computer learning.

The hypothesis in the previous example is based on a formula that produces a straight line. Other formulas with incremental steps or curves could in some cases represent the given data much closer.

In advanced learning problems there are additional learning scripts that compare various formula types and automatically select which one to use for the closest approximation. Below are two examples of different prediction formulas.

prediction-formula-hypothesis

Combining Variables

The previous forecast shows a likely correlation between duplicate content and a bad user experience. But is it a causal issue or do sites with a lot of duplicate content also often have old-fashioned web designs, a bad brand reputation, or excessive advertising? And where do you draw the line between good and bad?

The example below shows that combining the forecast graphs of positive and negative often provides the optimum middle. This distinction is required for decision learning with multiple variables.

machine-learning-threshold

Decision learning is best done with only two or three possible outcomes at each step. A decision tree can look like this. This is nothing like the real Panda tree, so don’t look at these fictive numbers.

machine-learning-decision-tree

To automatically generate such a tree, a script can try multiple variables to find out which has the most distinct difference between two groups. Where certain variables might not be the best way to divide the entire dataset, they might show a great distinction within an already divided subset. At each step the script can choose which next variable and which threshold provide the best division.

Forest Models

Google doesn’t only use its human quality raters to create decision trees. It can use numerous indicators like bounce rate, social signals, links, and visitor data. These can be used as a desired or undesired outcome and create their own trees.

Mixing different formulas of prediction and division has the possibility to create many trees. Each addition to the dataset is also likely to alter it slightly. Scripts can even use unsupervised learning without teaching it any outcome.

In these automated trees a script keeps dividing into groups on all available variables including quality indicators. It is then afterwards possible to see what all branches with undesirable quality have in common.

With all these trees and possibilities it isn’t smart to choose just the best tree. The best way is finding recurring patterns in all trees and their respective outcome certainty. Combined these patterns are the best bamboo forests to feed your Panda (forests represent combinations of trees).

The Details of Machine Learning

Machine learning is very geeky stuff and even this extremely simplified explanation might seem mind boggling. If you are however a mathematician, the next step is following the free Stanford video lectures on the topic.

The material has been hard to simplify without leaving out certain details. Comment if you see any important inaccuracies.

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Opinion

Information

Follow us

How Search Engines Use Machine Learning for Pattern Detection

Correlation Between Variables

Combining Variables

Forest Models

The Details of Machine Learning

Leave a Reply Cancel reply

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

Optimize Google’s new Interaction to Next Paint metric

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

The new YMYL guidelines and what this means for marketers

How to drive B2B conversions from your organic traffic

Three critical keyword research trends you must embrace

Why we’re hardwired to believe SEO myths (and how to spot them!)