The concept of a search engine is very simple. Database the location of pages and return the most relevant results.
Primitive search engines only understood words on the pages in their database and returned pages that included those searched words. Google essentially came along with the ability to reorder those search results based on how many other pages on the Internet linked to them.
Since then Google and other search engines have improved these foundational ranking factors and developed thousands of ways to reorder search results so only the most relevant are displayed to the user. For example the "freshness" of content or the location of where the search took place.
Panda is just another way Google reorders results based on the quality of a site's content.
Throughout this article we speculate in detail what can be learned from the "Panda Patent". The verbiage used is often vague and sometimes makes it hard to retain meaning, so here's a quick concept breakdown that should help throughout.
- Panda refreshes gathered information about links and queries associated with a site.
- Upon a user search, each result listing (URL / Page) is given an initial score based on relevance to search and page quality.
- Calculations of #1 and #2 determine if the result listing (URL / Page) is above or below a threshold.
- Results are reordered by final values.
Looking Back and the New "Panda Rank"
Way back in February 2011, Google released a major change in how they order search results. In it they explain:
This update is designed to reduce rankings for low-quality sites—sites which are low-value add for users, copy content from other websites or sites that are just not very useful.
They go on to say:
...it is important for high-quality sites to be rewarded, and that's exactly what this change does.
Panda looked at the quality of content as it related to a site.
In summary, Google assigned each page eligible for search results a value that reordered search results for any particular query based on the quality of the group (for example, a site) the page was associated with. That's a jam-packed summary and seems complex to programmatically automate.
Danny Sullivan, founder of Search Engine Land, speaks to this in an article published four months after the introduction Panda, "Right now, it's too much computing power to be running this particular analysis of pages. Instead, Google runs the filter periodically to calculate the values it needs."
We've come to know what is described here as a refresh, marking the occasion of sites getting hit or released from the Panda threshold.
Note: (according to Google this has recently been integrated into their normal process, hence no more "refreshes").
Fast forward 3 years and more than 25 Panda updates when Bill Slawski discovered a patent that was granted with the name Navneet Panda, the Google Engineer who authored the Panda update. Bill notes that Panda, "...is aimed at improving search results rather than penalizing sites or identifying attempts to manipulate search results." This is a key differentiator when comparing Panda to other Google updates.
To help provide some context to the patent analysis below, we're going to call the new group based quality score Google created Panda Rank. In ode to how Google originally determined relevant pages for queries, PageRank.
'Panda Rank' Patent
As much as we can gather from the complexity of a patent, Panda in a most basic summary seems to have been a two-part process. For each URL eligible for a particular search query:
- An initial score is generated – URL level, relevance to query, and/or quality score.
- If applicable, group based modification factors were applied.
The first step in generating the Panda Rank was generating an initial score for every URL listing within the context of a particular query. The patent goes into more detail, "...a measure of the relevance of the resource (URL listing) to the search query, a measure of the quality of the resource (URL listing), or both."
STEP ONE: Given the search term, assign an initial value for all eligible URLs based on:
- Relevance to query.
- Quality measurement.
Initial Scoring Speculation and Summary
Initial scores were generated from factors that were not new. Panda is special because it was a feasible way to automate reordering search results based on group VS URL quality.
Group Based Modification Factor
Once an initial score has been generated for every URL for a given search query, that score is then modified based on the quality of the group the URL has been assigned.
A site is one type of group, Google refers to address-based groups several times throughout the patent but provides possibilities of what a group could be:
...a portion of the resources on the Internet. A group can be defined in any of a variety of ways. An address-based group of resources is a group of resources that is defined by the Internet addresses, e.g., Uniform Resource Locators (URLs), of the resources in the group. Resources are grouped so that a resource cannot be included in more than one group of resources. For example, a group of resources can include each resource that can be accessed using a particular domain name. That is, the group could include http://www.domain.com/resource1, http://wwww.domain.com/resource2, http://www.domain.com/resourceN, and so on, without regard to when the resources first become available to the search engine 130 for indexing. Alternatively, a group of resources can include each resource that can be accessed using a particular host name, e.g., http://host.example.com/resource1, http://host.example.com/resource2, http://host.example.com/resourceN, and so on. Other address-based groupings are possible. For example, a particular group can include only a portion of the resources that can be accessed using a particular host name or a particular domain name.
Patent speak likes to project that anything is possible, which makes sense for litigation purposes.
STEP TWO: Once an initial score is assigned for eligible URLs given a search term, modify the scores according to the quality of the groups to which they are associated. For all URLs in a group, quality score is based on identifying a count of:
- Reference queries.
- Independent links.
Reference Query Group Count
A reference query is when a searcher uses a search query to find a specific URL.
Note: More detail is provided about reference queries in the Panda Flow and Speculation Summary below.
Independent Link Group Count
Independent links are especially "vetted" links pointing to a URL.
Group Based Modification Speculation and Summary
Panda took initial URL level relevance and quality scores and modified them according to a group based factor looking at reference queries and independent links.
The Panda Flow
The patent image below describes the flow of Panda.
- Generate initial scores for all eligible URL listings for a given search query.
- Determine if query is navigational, if so leave initial score alone.
- If the initial score of the URL is below the group quality score threshold, modify initial score.
Note: In the actual patent, they seem to refer to a URL as a "resource".
The Panda Flow Speculation and Summary
One confusing aspect of the Panda flow is what the difference between navigational and reference queries might be. We think that navigational queries may have something indicating the brand name within the query itself.
Something like searching [nike shoes on zappos] might be considered a navigational query. Reference queries might be based on timing. If someone uses a search and clicks a link within a certain time frame it could be flagged as a "reference query.
Another interesting aspect to this flow to consider is the possibility of multiple thresholds. It makes sense that Google wants a quality factor based on site or group of URLs and found that with Panda, but having multiple thresholds could have different outcomes. A URL listing for a given search query could be:
- Negatively impacted by multiple group score factor.
- Negatively impacted by one group score factor and positively impacted by others.
- Positively impacted by multiple group score factors.
The patent does state, "Resources are grouped so that a resource cannot be included in more than one group of resources."
Despite this being outright stated, one thought is that a URL could have modification factors applied based on both address and non-address based groups.
Google could set the address based modification factor to have more of an impact on a URL than non-address based groups. Effectively demoting URL / pages that have internal duplication issues to a higher degree versus URLs / pages with external duplication issues.
Along the same line of thought, if a page has both internal and external group quality issues, it could be negatively impacted twice.
Panda Today and Ranking Factors to Consider
Without thinking about the details, one thing we can agree on is that site and group based quality scores are a ranking factor in Google search results today. No doubt things have changed since this patent was written, but it does seem to provide some valuable insight in how to think about and how to avoid and overcome Google's Panda.
As it is described in the patent, URLs are given a unique score that is modified according to any query searched. Furthermore, that modification factor is dynamic depending on ever changing variables. That translates to Panda having a different impact on every URL. That makes for hard quantification, but there are factors that have been discussed in detail above to consider further incorporating into best practices:
Panda URL Level Ranking Factors
- Relevance to query.
Panda Group Level Ranking Factors
- Count of reference queries.
- Count of independent links.