Further to my previous post on the Google index update/size increase, there appears to be a new way to count all the pages within Google. Find a term that doesn't exist, then search for minus that term, and you get a full count. Well, sort of.
This was the technique that we used to be able to use at Northern Light, to verify all the pages it had. AllTheWeb used to have a similar method it gave to Greg Notess, as he says here, that he used as part of his long time documentation of search sizes.
I was emailing with Google last week about wanting that type of command to exist at Google. It didn't work last week when I tried, nor had I seen it working before. But if we had it, then anyone could see exactly the total number of pages Google should have in its index.
That's important, because as I've written, the count on Google's home page doesn't change in line with the index growing. In addition, searching for a common word like "the" sometimes doesn't work well because of stopword issues. There are also plenty of non-English language pages that won't contain the word "the."
Today, I noticed the technique suddenly did work! To see it in action, I've provided an example below. This work in the long-term, because once this post gets indexed, the word will suddenly exist in Google's index. But you can easily do it with other words.
A search for djfdkjkfjkdjdfk comes back with "Your search - djfdkjkfjkdjdfk - did not match any documents." OK, then we know there are no documents in the Google index with this term.
Now I do a search for -djfdkjkfjkdjdfk. That means, "Show me all the pages you have that don't have this word on them." Since we know that NO pages have that word, asking for all pages without it should show us everything.
Count? About 9,560,000,000 pages. Count on the Google home page? "Searching 8,168,684,336 web pages." So at least, Google should have about 1.5 billion pages in its index more than it currently claims.
I actually think that's much higher, as I'll explain in a future post. That's why I'm saying this may "sort of" work to show all the pages. Certainly PhilC on our SEW Forums has tried this technique and gotten 11.3 billion results. I can't get the same, but it's just another sign that the counts aren't adding up in the many ways you want to slice them.
By the way, I tried the negative technique at some other places. It won't work for Yahoo and Ask Jeeves. But at MSN, -djfdkjkfjkdjdfk came back with a count of 5,304,186,736, which is right in line with the self-reported figure of 5 billion MSN gave last year.
Of course, even if all the search engines make this technique work, it doesn't necessarily mean we've got apples-to-apples comparisons. What depths are the pages indexed to? How well are duplicates removed? Are these pages actually indexed or just links to pages you know about? Those are just some of the issues.
More important, as I've written before and will come back to again, having higher counts won't mean you're more comprehensive. For more on this, see my post from yesterday, Googlewhacks Show More Signs That Google's Increased Its Index; Time To Drop The Hamburger Count.
Want to comment or discuss? Visit our forum thread, Sept. 2005 Google Index Update & Size Increase Coming?
Meet Your Favorite Search Engine Watch Contributors
Many of SEW's leading expert contributors will be at ClickZ Live, the new online and digital marketing event kicking off in New York (March 31-April 3). Hear from the likes of: Thom Craver, Josh Braaten, Lisa Barone, Simon Heseltine, Josh McCoy, Lisa Raehsler, Greg Jarboe, Dan Cristo, Joseph Kerschbaum, John Gagnon, Eric Enge and more!