So what's Google done that's caused so many sites to drop? My best guess is that the company may be making use of Teoma-like "local ranking" to filter out irrelevant links that can throw its link analysis system off. Stemming is also a factor, and other techniques may be involved.
Google Versus Teoma
Consider how Google traditionally has ranked things and explained its system to the general public. Google calculates a score for each and every page it knows about, based on the links it sees to that page from all other pages it has seen on the web. This is the page's "PageRank" score, which can be viewed with the Google Toolbar.
When someone does a search, Google uses the PageRank score plus many other factors to determine if a page should rank well. Pages with PageRank scores less than other pages may still outrank those other pages if factors such as page content, hyperlink content and so on give the page a boost.
Google has tweaked the both the PageRank counting system and its entire search algorithm over time. Guestbook links may not be counted or not count as much, it's widely assumed. Internal links, especially those that use the same text and style on every page in a web site, these also are suspected to have been discounted. Despite these tweaks, Google is still examining every link if finds from across the web.
Now consider Teoma (and see this article for more details on how Teoma works). When you search at Teoma, it finds all the pages considered to have matched your query. Look for "travel," and it gets all the pages matching travel based on body text and traditional on-the-page factors. Then for this particular set of pages, Teoma quickly calculates its own version of PageRank for each page in the set. Only the links within the set are deemed relevant.
- Google = link score based on all web pages
- Teoma = link score based on only pages found within a query
In general, LocalRank is very Teoma-like. The key difference, as I understand it, is that Google gets all the pages it would in the past where PageRank would be involved, then it recalculates PageRank based on what's found in response to the query.
The WebmasterWorld.com summary says that only the top 1,000 results would be considered. That's not actually the case. The patent says that any depth could be analyzed. Google might look at links between pages in the top 100 results, the top 1,000, the top 100,000 or all pages found.
Google might also make use of some enhancements for discounting links that aren't covered in this particular patent. The main point is really that the pages actually found in a QUERY suddenly have much more influence than ALL pages across the web.
This type of change would be a major departure for Google. It wouldn't be a tweak -- it would be a massive upgrade of its algorithm. It's also likely to have a huge impact on some sites.
Consider all the people that may have done well with link exchanges from pages not related to a particular topic, such as the hotel site that got a link from a pet store. If that pet store page doesn't make it into the top 1,000 results for a search on "hotel," as you would expect it not to, then the pet store link to the hotel site doesn't help as much as with Google's old system.
Consider someone with good internal linking. If many of the internal pages are off the topic of a search, then they get discounted in a local rank system. Or consider that Google might do something like say, "OK, I see they've got many relevant internal pages, but I'm only going to count links on those I'd actually display in the top results." And given how Google "indents" when there's more than one relevant page from a particular site, these internal links suddenly might count for less.
Running Two Systems
Retrieving documents in response to a query is no easy task. Recalculating ranking scores based on the documents within a set you've retrieved is even harder. That's why Teoma has been so happy. They've got a system that lets them do this quickly. In my past interviews with Teoma, they've felt it would be harder for others to catch up. But if anyone was going to, Google would be the top candidate.
Of course, Google handles a lot of traffic -- more than any other search engine in the world. Using a new, more processor intensive system for every query still wouldn't be easy. In addition, it might not be necessary. As a result, I think Google initially rolled out LocalRank plus other new algorithm changes for "easy" queries, those not involving exclusions and only a few words.
In particular, I remember a conversation with Infoseek several years ago, where they criticized Google for being an "ALL" search engine. In other words, Infoseek at the time would look for matches that involved ANY of your words, and it took more effort for them to do this. Google was actually rare in being an ALL search engine that looked for ALL your terms, and Infoseek felt they did this because they lacked the processor power to do an ANY search quickly.
If true, then you understand why Google was showing "old" matches when you did a search involving an exclusion of some type. Since that's going to require extra computation power, Google might have been falling back to its old style ranking system that's less processor intensive. In other words, it couldn't easily do LocalRank and exclusion at the same time.
To underscore this, the filter test most people tried involved doing a search plus a made up word. But it also worked with "real" words. For instance, I had tried this:
travel agent -holly
"Holly" is a real word, but there were only about 50,000 pages that matched a search for all three words, "travel agent holly." So doing the second search did drop some pages but not enough that should have make a major ranking impact. Yet doing it produced some reorganization.
In another test, I searched for "alcoholics anonymous" minus any matches that were .XLS files. This is another form of exclusion, but one that doesn't involve words at all. Nevertheless, it produced some subtle changes. In addition, excluding .XLS files worked as a filter test on some other queries just as effectively as using a made-up word.
In short, any type of exclusion caused Google to process a query differently. I think the main factor behind this processing is a new algorithm -- and this could be one that takes LocalRank into account.
Google, for its part, won't confirm this. They'll only say that "medium" level changes have been made to their ranking system. So what I've described is simply my own speculation.
If Google won't talk, how about Teoma? Do they feel Google is trying to imitate what they've done from the outset?
"We prefer not to comment specifically on how other engines work. However, it is worth noting that it appears that Google seems to have recognized the value of what we call subject specific popularity or local rank when used in addition to global rank methods [i.e. PageRank, in their case”. I should note that both these measures of popularity, in combination, have always been significant to Teoma's process. If your speculation is correct and Google is now doing this (even if it is on a smaller sub-data set versus Teoma's process which scales to the size of the returned dataset no matter how large) then it should improve their ranking method to the extent that they can apply it," emailed Paul Gardi, senior vice president, strategy & growth initiatives at Ask Jeeves, which owns Teoma.
What else is going on? When Google rolled out its new ranking system, it also introduced stemming. This means that when someone searches without using any special settings or search commands such as + or - in front of a word, they'll also get variations of the words they looked for. Plural forms and -ing endings will be matched, for example.
Consider a search for home decorating. It actually brings back pages that contain at least this many variations:
To see this, look at the search results for home decorating. Note the words that are bolded in the descriptions. These are the words Google has found that match your query. You'll see the variations I've listed.
Stemming will have a natural impact on rankings. There are more possible pages included in the query than in the past, and it's difficult to know which exact words or phrases that Google may decide to weight more.
Search Engine Watch will be returning to look more closely at stemming in a future article. However, for a good webmaster's perspective, see this current thread on the topic at SearchGuild.com.
If you're trying to understand if stemming may be the main factor causing your ranking drop, here's how to test. Rerun your query with + symbols in front of your key terms, such as +home +decorating. That forces Google to stop stemming. If your page comes back up, then you know it's mainly stemming that was to blame.
Hilltop In Action?
Above, I talked about the idea that Google might be making use of LocalRank. Phil Craven of PageRank Explained has an article suggesting that Google might be using another system, one known as Hilltop.
Hilltop was coauthored by Krishna Bharat, who is now a Google employee -- in fact, he's also the same person who authored the LocalRank patent. In Hilltop, the system is designed to look for "expert" pages, pages that have links (often many) to other pages on a particular topic. At Teoma, its system not only finds expert pages, but it places them into a special "Resources" area.
Hilltop could certainly have come into play. Google could be trusting links from certain pages more, such as those on directory pages. However, Google needn't use just the Hilltop algorithm to get some of these benefits and produce some the changes we've seen.
One thing that's been remarked on is that Google may be showing more "informational" pages and that more content from educational or governmental sites is appearing. This could be due to Google configuring its systems to give more weight to links from .edu and .gov sites. It may feel these sources are more impartial.
This is not to say that something like Hilltop isn't happening. In fact, it could very well be -- along with some form of LocalRank and other tweaks. In fact, I think it's reasonable to assume that Google's using a whole mixture of new techniques.
In addition, I think it's reasonable to assume that Google may use certain techniques for certain classes of queries. I don't think they have a predefined list of "commercial" words. However, they may have a system in place to determine the "commercialness" of a query on the fly.
Andrew Goodman correctly recalls InfoSpace doing something like this on Dogpile. My own recollection was that Dogpile did this in part by looking at how many results were coming back from the paid listing providers it was partnered with. If there were a lot for a particular term, then a query might reasonably be assumed as commercial.
Google could be doing something like this or something more sophisticated. But the point is, should a search be deemed "commercial," Google might try using more aggressive spam filtering or perhaps trusting certain types of links more.
As noted earlier, Google's not saying much about its changes. I spoke with Matt Cutts, a software engineer who oversees webmaster issues for Google, earlier this week. He acknowledged that Google is making use of some new "signals" about the quality and content of web pages and that use of more information is planned.
"Over the next six weeks to two months, we're going to bring in another two or three signals," Cutts said.
What are these mystery signals? Google could make use of data it collects from its Google Toolbar to impact rankings, a favorite theory of WebmasterWorld founder Brett Tabke.
Google has denied that toolbar data is being used for ranking purposes in the past. Now it simply refuses to comment on whether toolbar data is involved at all.
Clickthrough data could be used. But clickthrough data has never been proven long-term as that helpful to improving ranking, since it's prone to spam. There's also no sign that Google has increased the limited clickthrough measurement it does. In the past, the company has measured clickthrough on about one percent of its listings to help with quality assessment. That seems to still be the case, though for the record, Google won't confirm one way or another if clickthrough data is coming into play.
Is Google classifying pages as commercial or non-commercial? Again, no comment. However, Andrew Goodman managed to at least get Google to acknowledge that is may indeed be classifying pages in this way and into other types.
Google will say that the changes introduced some new spam fighting techniques, but it also stressed that most of the changes people are seeing aren't because of spam factors.
"It's definitely not penalties or [spam” filtering but new scoring that takes advantage of these new signals," Cutts said. "It's not that they are in some penalty sandbox, and even if they change things, they'll never be better. Instead, we've got a better view of those sites, and some sites don't measure up as well as they did before."
In other words, it's simply that a new measurement yardstick is being used.
"If a stock picker used a formula for what to pick, then said 'Now I care more about incoming revenue,' then you'd end up with a different portfolio of companies," Cutts explained.