Ah, summer. Time to play on the beach, head out on vacation and if you're a search engine, announce to the world that you've got the largest index.
Around this time last year, AllTheWeb kicked off a round of "who's biggest" by claiming the largest index size. Now it's happened again, when AllTheWeb said last month that its index had increased to 3.2 billion documents, toppling the leader, Google.
Google took only days to respond, quietly but deliberately notching up the number of web pages listed on its home page that it claims to index. Like the McDonald's signs of old that were gradually increased to show how many customers had been served, Google went from 3.1 billion to 3.3 billion web pages indexed.
Actually, not yawn. Instead, I'm filled with Andrew Goodman-style rage (and that's a compliment to Andrew) that the search engine size wars may erupt once again. In terms of documents indexed, Google and AllTheWeb are now essentially tied for biggest -- and hey, so is Inktomi. So what? Knowing this still gives you no idea which is actually better in terms of relevancy.
Size figures have long been used as a surrogate for the missing relevancy figures that the search engine industry as a whole has failed to provide. Size figures are also a bad surrogate, because more pages in no way guarantees better results.
How Big Is Your Haystack?
There's a haystack analogy I often use to explain this, the idea that size doesn't equal relevancy. If you want to find a needle in the haystack, then you need to search through the entire haystack, right? And if the web is a haystack, then a search engine that looks through only part of it may miss the portion with the needle!
That sounds convincing, the reality is more like this. The web is a haystack, and even if a search engine has every straw, you'll never find the needle if the haystack is dumped over your head. That's what happens when the focus is solely on size, with relevancy ranking a secondary concern. A search engine with good relevancy is like a person equipped with a powerful magnet -- you'll find the needle without digging through the entire haystack because it will be pulled to the surface.
Google's Supplemental Index
I especially hate when the periodic size wars erupt because examining the latest claims takes time away from other more important things to write about. In fact, it was a great relief to have my associate editor Chris Sherman cover this story initially in SearchDay last week (Google to Overture: Mine's Bigger). But I'm returning to it because of a twist in the current game: Google's new "supplemental results."
What are supplemental results? At the same time Google posted new size figures, it also unveiled a new, separate index of pages that it will query if it fails to find good matches within its main web index. For obscure or unusual queries, you may see some results appear from this index. They'll be flagged as "Supplemental Result" next to the URL and date that Google shows for the listing.
Google's How To Interpret Your Search Results page illustrates this, but how about some real-life examples you can try? Here are some provided by Google to show when supplemental results might kick in:
- "St. Andrews United Methodist Church" Homewood, IL
- "nalanda residential junior college" alumni
- "illegal access error" jdk 1.2b4
- supercilious supernovas
Two Web Page Indexes Not Better Than One
Using a supplemental index may be new for Google, but it's old to the search engine industry. Inktomi did the same thing in the past, rolling out what became known as the small "Best Of The Web" and larger "Rest Of The Web" indexes in June 2000.
It was a terrible, terrible system. Horrible. As a search expert, you never seemed to know which of Inktomi's partners was hitting all of its information or only the popular Best Of The Web index. As for consumers, well, forget it -- they had no clue.
It also doesn't sound reassuring to say, "we'll check the good stuff first, then the other stuff only if we need to." What if some good stuff for whatever reason is in the second index? That's a fear some searchers had in the past -- and it will remain with Google's revival of this system.
Why not simply expand the existing Google index, rather than go to a two tier approach?
"The supplemental is simply a new Google experiment. As you know we're always trying new and different ways to provide high quality search results," said Google spokesperson Nate Tyler.
OK, it's new, it's experimental -- but Google also says there are currently no plans to eventually integrate it into the main index.
Deconstructing The Size Hot Dog
Much as I hate to, yeah, let's talk about what's in the numbers that are quoted. The figures you hear are self-reported, unaudited and don't come with a list of ingredients about what's inside them. Consider the hot dog metaphor. It looks like it's full of meat, but if you analyze it, it could be there's a lot of water and filler making it appear plump.
Let's deconstruct Google's figure, since it has the biggest self-reported number, at the moment. The Google home page now reports "searching 3,307,998,701 web pages." What's inside that hot dog?
First, "web pages" actually includes some things that aren't web pages, such as Word documents, PDF files and even text documents. It would be more accurate to say "3.3 billion documents indexed" or "3.3 billion text documents indexed," because that's what we're really talking about.
Next, not all of those 3.3 billion documents have actually been indexed. There are some documents that Google has never actually indexed. It may list these in search results based on links it has seen to the documents. The links give Google some very rough idea of what a page may be about.
For example, try a search for pontneddfechan, a little village in South Wales where my mother-in-law lives. You should see in the top results a listing simply titled "www.estateangels.co.uk/place/40900/Pontneddfechan" That's a partially indexed page, as Google calls it. It would be fairer to say it's an unindexed page, since in reality, it hasn't actually been indexed.
What chunk of the 3.3 billion has really been indexed? Google's checking on that for me. They don't always provide an answer to this particular question, however. Last time I got a figure was in June 2002. Then, 75 percent of the 2 billion pages Google listed as "searching" on its home page had actually been indexed. If that percentage holds true today, then the number of documents Google actually has indexed might be closer to 2.5 billion, rather than the 3.3 billion claimed.
But wait! The supplemental index has yet to be counted. Sorry, we can't count it, as Google isn't saying how big it is. Certainly it adds to Google's overall figure, but how much is a mystery.
Let's mix in some more complications. For HTML documents, Google only indexes the first 101K that it reads. Given this, some long documents may not be totally indexed -- so do they count as "whole" documents in the overall figure? FYI, Google says only a small minority of documents are over this size.
OK, we've raised a lot of questions about what's in Google's size figure. There are even more we could ask -- and the same questions should be directed at the other search engines, as well. AllTheWeb's 3.2 billion figure may include some pages only known by seeing links and might include some duplicates, for example. But instead of asking questions, why not just test or audit the figures ourselves?
That's exactly what Greg Notess of Search Engine Showdown is especially known for. You can expect Greg will probably take a swing at these figures in the near future -- and we'll certainly report on his findings. The last test was done in December. His test involves searching for single word queries, then examining each result that appears -- a time-consuming task. But it's a necessary one, since the counts from search engines have often not been trustworthy.
Grow, But Be Relevant, Too
I'm certainly not against index sizes growing. I do find self-reported figures to also be useful, at least as a means of figuring out who is approximately near each other. Maybe Google is slightly larger than AllTheWeb or maybe AllTheWeb just squeaks past Google -- the more important point is that both are without a doubt well above a small service like Gigablast, which has only 200 million pages indexed.
However, that's not to say that a little service like Gigablast isn't relevant. It may very well be, for certain queries. Indeed, Google gained many converts back when it launched with a much smaller index than the established major players. It was Google's greater relevancy -- the ability to find the needle in the haystack, rather than bury you in straw -- that was the important factor. And so if the latest size wars should continue, look beyond the numbers listed at the bottom of the various search engine home pages consider instead the key question. Is the search engine finding what you want?
By the way, the baby of the current major search engine line-up Teoma did some growing up last month. The service moved from 500 million to 1.5 billion documents indexed.
Paul Gardi, vice president of search for Ask Jeeves, which owns Teoma, wants to grow even more. He adds that Teoma is focused mainly on English language content at the moment -- so the perceived smaller size of Teoma may not be an issue for English speakers. Subtract non-English language pages from Teoma's competitors, and the size differences may be much less.
"Comparatively speaking, I would argue that we are very close to Google's size in English," Gardi said.