Jean Véronis takes another look at trying to figure out Google count oddities covered before, such as why an OR search for two different words brings back FEWER results that a search for either of the words individually (it should bring back more, since the possible matching set is larger).
In Google's missing pages: mystery solved?, he surmises:
A possible scenario is that the real index used by Google is considerably smaller than the counts officially announced.
Indeed, it's not a scenario. It's reality. We know that Google has at least two indexes is uses for web searches -- the "regular" one and a "supplemental" index (see Search Engine Size Wars & Google's Supplemental Results).
Google has never said how many pages are in that supplemental index, when exactly it gets hit and so on. As a matter of fact, I was just asking them about it last night and still didn't get any further information about what exactly is in there or how it is used.
Véronis does some testing and mathematical calculations to determine that the "real" index is about 5 billion pages. I'd translate that into saying that the main index is 5 billion pages -- near the number that Google long used to report on its home page. And the page increase it recently announced? Seems like that was an expansion of content to the supplemental index.
Véronis also speculates that the two indexes may be divided into pages that are actually indexed plus pages that Google knows about but hasn't actually indexed. Perhaps, except that Google has for years had what it calls partially-indexed URLs that aren't actually indexed at all. As I've written before, link-only listings is a better description, as these are pages Google knows about only through link data, not from having indexed anything at all.
In either case -- whether these are in a separate index or part of the main one -- the idea that Google might be guessing how many link-only pages contain particular words and then extrapolating an overall figure is interesting. And then further, if Google does some type of processor intensive search, the extrapolation might be dropped.
The other specter lurking out there is that Google might be using different algorithms for different types of searches, something that really came up during the big "Florida" update in Nov-Dec. 2003. For Search Engine Watch members, my Speculation On Google Changes article looks at this in depth.
Interestingly, we're in the midst of the most significant update since then, as my Feeling Like Google Dance Time post explains. People are speculating on all types of things that might be trying that's resulting in the changes -- but it may very well be a number of things are in use, depending on the type of query you are performing.
Véronis concludes with this simple advice:
In all likelihood, the Google engineers simply forgot to plug the extrapolation routine at the end of the boolean module! Therefore, if you want to know the real index count for any word, simply type it twice.
Try it yourself, and you'll see how the count drops. But this isn't something new. Tara Calishain talked about it in her excellent Google Hacks book from back in 2003 -- she touches on it a bit in this post, as well.
Going back to her book (page 22, if you've got it), Tara talks about how repeating a word more than once both lowers the count and also changes the order of the search. Google gave her no explanation for this.
I'll throw out one other thing to contemplate. I've been looking at getting a Tom Tom GPS systemt this week. That is, a TomTom system -- but sometimes by mistake, I spell it as two words. In these cases, it's a good thing that I get different counts.
The results for tom are much different than for tom tom, with the latter bringing in a lower count and getting the TomTom site I want at the top, despite my misspelling. Interestingly, do tom tom tom, and the count drops again significantly -- though I don't see this when going from two words to three on other queries.
So yes -- Google is showing oddities. Exactly what these are and why, I can't say. I've asked about some of this before but gotten cagey answers, at best. I'll go back again to see if I have more luck, because these type of things have an impact on searchers
Also, going back to my TomTom query, the same thing happens on Yahoo in terms of a rank change, though the count drop isn't as signficant. And you get oddities there, as well. tom tom tom gives 62.5 million matches but tom tom tom tom gives 64 million? At MSN, three toms gives slightly fewer matches than four?
At least with Ask Jeeves, the count for tom versus tom tom doesn't change -- but try three or more, and you get no web search results at all. But ask upward to ask ask ask ask, worked, and it was nice to see the number pretty much stayed rock solid.
For more background on Google counts and oddities, also be sure to see my recent post, Questioning Google's Counts.