How Search Engines Use Machine Learning for Pattern Detection

Search engines use machine learning for pattern detection. While it’s impossible to explain in one short article how machine learning influences our lives, understanding the basics of machine learning can give you some insight into search algorithm updates, such as Google’s Panda update.

Correlation Between Variables

To predict the outcome of future tests, scripts can use supervised learning on past outcomes to define a hypothetical prediction line. The three images below show how plotted examples define averages. These averages are more likely to represent some truth as the training set grows.

In this case the correlation between a number for duplicate content and a number based on negative reviews by manual quality raters is plotted. As the learning set grows the prediction becomes more certain.

Anomalies can initially have a huge impact on the hypothesis, but their effect diminishes in areas with a lot of reference material. This also indicates that not every part of the hypothesis line is equally rigid and the level of certainly can play a separate role in computer learning.

The hypothesis in the previous example is based on a formula that produces a straight line. Other formulas with incremental steps or curves could in some cases represent the given data much closer.

In advanced learning problems there are additional learning scripts that compare various formula types and automatically select which one to use for the closest approximation. Below are two examples of different prediction formulas.

Combining Variables

The previous forecast shows a likely correlation between duplicate content and a bad user experience. But is it a causal issue or do sites with a lot of duplicate content also often have old-fashioned web designs, a bad brand reputation, or excessive advertising? And where do you draw the line between good and bad?

The example below shows that combining the forecast graphs of positive and negative often provides the optimum middle. This distinction is required for decision learning with multiple variables.

Decision learning is best done with only two or three possible outcomes at each step. A decision tree can look like this. This is nothing like the real Panda tree, so don’t look at these fictive numbers.

To automatically generate such a tree, a script can try multiple variables to find out which has the most distinct difference between two groups. Where certain variables might not be the best way to divide the entire dataset, they might show a great distinction within an already divided subset. At each step the script can choose which next variable and which threshold provide the best division.

Forest Models

Google doesn’t only use its human quality raters to create decision trees. It can use numerous indicators like bounce rate, social signals, links, and visitor data. These can be used as a desired or undesired outcome and create their own trees.

Mixing different formulas of prediction and division has the possibility to create many trees. Each addition to the dataset is also likely to alter it slightly. Scripts can even use unsupervised learning without teaching it any outcome.

In these automated trees a script keeps dividing into groups on all available variables including quality indicators. It is then afterwards possible to see what all branches with undesirable quality have in common.

With all these trees and possibilities it isn’t smart to choose just the best tree. The best way is finding recurring patterns in all trees and their respective outcome certainty. Combined these patterns are the best bamboo forests to feed your Panda (forests represent combinations of trees).

The Details of Machine Learning

Machine learning is very geeky stuff and even this extremely simplified explanation might seem mind boggling. If you are however a mathematician, the next step is following the free Stanford video lectures on the topic.

The material has been hard to simplify without leaving out certain details. Comment if you see any important inaccuracies.