Ah, summer. Time to play on the beach, head out on vacation and if you're a search engine, announce to the world that you've got the largest index.
Around this time last year, AllTheWeb kicked off a round of "who's biggest" by claiming the largest index size. Now it's happened again, when AllTheWeb said last month that its index had increased to 3.2 billion documents, toppling the leader, Google.
Google took only days to respond, quietly but deliberately notching up the number of web pages listed on its home page that it claims to index. Like the McDonald's signs of old that were gradually increased to show how many customers had been served, Google went from 3.1 billion to 3.3 billion web pages indexed.
Actually, not yawn. Instead, I'm filled with Andrew Goodman-style rage (and that's a compliment to Andrew) that the search engine size wars may erupt once again. In terms of documents indexed, Google and AllTheWeb are now essentially tied for biggest -- and hey, so is Inktomi. So what? Knowing this still gives you no idea which is actually better in terms of relevancy.
Size figures have long been used as a surrogate for the missing relevancy figures that the search engine industry as a whole has failed to provide. Size figures are also a bad surrogate, because more pages in no way guarantees better results.
How Big Is Your Haystack?
There's a haystack analogy I often use to explain this, the idea that size doesn't equal relevancy. If you want to find a needle in the haystack, then you need to search through the entire haystack, right? And if the web is a haystack, then a search engine that looks through only part of it may miss the portion with the needle!
That sounds convincing, the reality is more like this. The web is a haystack, and even if a search engine has every straw, you'll never find the needle if the haystack is dumped over your head. That's what happens when the focus is solely on size, with relevancy ranking a secondary concern. A search engine with good relevancy is like a person equipped with a powerful magnet -- you'll find the needle without digging through the entire haystack because it will be pulled to the surface.
Google's Supplemental Index
I especially hate when the periodic size wars erupt because examining the latest claims takes time away from other more important things to write about. In fact, it was a great relief to have my associate editor Chris Sherman cover this story initially in SearchDay last week (Google to Overture: Mine's Bigger). But I'm returning to it because of a twist in the current game: Google's new "supplemental results."
What are supplemental results? At the same time Google posted new size figures, it also unveiled a new, separate index of pages that it will query if it fails to find good matches within its main web index. For obscure or unusual queries, you may see some results appear from this index. They'll be flagged as "Supplemental Result" next to the URL and date that Google shows for the listing.
Google's How To Interpret Your Search Results page illustrates this, but how about some real-life examples you can try? Here are some provided by Google to show when supplemental results might kick in:
- "St. Andrews United Methodist Church" Homewood, IL
- "nalanda residential junior college" alumni
- "illegal access error" jdk 1.2b4
- supercilious supernovas
How do you get into the supplemental results, you might wonder? Google will find your web pages through the course of its normal web crawling, it says. But you really don't want to be in the supplemental results. In fact, seeing your pages labeled this way is essentially a sign that Google doesn't consider them important enough to be in the main index. It's better than nothing, but it's not something you want to hope for. Google is also not saying how often the supplemental index will be refreshed.
Two Web Page Indexes Not Better Than One
Using a supplemental index may be new for Google, but it's old to the search engine industry. Inktomi did the same thing in the past, rolling out what became known as the small "Best Of The Web" and larger "Rest Of The Web" indexes in June 2000.
It was a terrible, terrible system. Horrible. As a search expert, you never seemed to know which of Inktomi's partners was hitting all of its information or only the popular Best Of The Web index. As for consumers, well, forget it -- they had no clue.
It also doesn't sound reassuring to say, "we'll check the good stuff first, then the other stuff only if we need to." What if some good stuff for whatever reason is in the second index? That's a fear some searchers had in the past -- and it will remain with Google's revival of this system.
Why not simply expand the existing Google index, rather than go to a two tier approach?
"The supplemental is simply a new Google experiment. As you know we're always trying new and different ways to provide high quality search results," said Google spokesperson Nate Tyler.
OK, it's new, it's experimental -- but Google also says there are currently no plans to eventually integrate it into the main index.
Changes Under The Hood
Revisiting Inktomi, the company appears to have initially gone to a two index solution because its existing system may not have been able to handle one giant index -- despite Inktomi's long-standing claims to be scalable. It took until August 2002 for Inktomi to return to a single index situation. At that time, Inktomi also explained that it had completely reengineered its systems to make such a system possible and for it to grow significantly in the future.
Given this, one obvious reason of why Google may now be using two indexes is that its current index can't handle more. In fact, Google Watch pushed this idea in June. For its part, Google utterly denied any such problems, when I asked about it a few weeks ago.
As mentioned, Inktomi eventually solved the dual index issue by reengineering everything and moving to a new system. Google may be doing the same. Certainly the company has left enough public clues that something like this is going on. In addition to the new supplemental index, the normal monthly refresh is slowly replaced with a continuous crawl of many documents.
Indeed, popular search forum WebmasterWorld.com was generally staggered by hundreds of posts of every time the monthly Google Dance, or refresh, happened. That's been noticeably absent now, as more and more webmasters have reporting content regularly being updated. WebmasterWorld.com founder Brett Tabke succinctly summed things up for one forum member looking for news of the next dance:
"Dude - the dance is dead," Tabke posted.
Deconstructing The Size Hot Dog
Much as I hate to, yeah, let's talk about what's in the numbers that are quoted. The figures you hear are self-reported, unaudited and don't come with a list of ingredients about what's inside them. Consider the hot dog metaphor. It looks like it's full of meat, but if you analyze it, it could be there's a lot of water and filler making it appear plump.
Let's deconstruct Google's figure, since it has the biggest self-reported number, at the moment. The Google home page now reports "searching 3,307,998,701 web pages." What's inside that hot dog?
First, "web pages" actually includes some things that aren't web pages, such as Word documents, PDF files and even text documents. It would be more accurate to say "3.3 billion documents indexed" or "3.3 billion text documents indexed," because that's what we're really talking about.
Next, not all of those 3.3 billion documents have actually been indexed. There are some documents that Google has never actually indexed. It may list these in search results based on links it has seen to the documents. The links give Google some very rough idea of what a page may be about.
For example, try a search for pontneddfechan, a little village in South Wales where my mother-in-law lives. You should see in the top results a listing simply titled "www.estateangels.co.uk/place/40900/Pontneddfechan" That's a partially indexed page, as Google calls it. It would be fairer to say it's an unindexed page, since in reality, it hasn't actually been indexed.
What chunk of the 3.3 billion has really been indexed? Google's checking on that for me. They don't always provide an answer to this particular question, however. Last time I got a figure was in June 2002. Then, 75 percent of the 2 billion pages Google listed as "searching" on its home page had actually been indexed. If that percentage holds true today, then the number of documents Google actually has indexed might be closer to 2.5 billion, rather than the 3.3 billion claimed.
But wait! The supplemental index has yet to be counted. Sorry, we can't count it, as Google isn't saying how big it is. Certainly it adds to Google's overall figure, but how much is a mystery.
Let's mix in some more complications. For HTML documents, Google only indexes the first 101K that it reads. Given this, some long documents may not be totally indexed -- so do they count as "whole" documents in the overall figure? FYI, Google says only a small minority of documents are over this size.
To test this, I did a quick search for all the pages from imdb.com that are listed by AllTheWeb. Why AllTheWeb? Because you can restrict your search by file size there. With no file size restriction, I learned there were 701,148 pages listed. When I looked for only those over 101K, the number was a tiny 2,771 -- less than a percent. That was test on only one site, and there may be further complications (such as AllTheWeb perhaps having its own page size limitation). But it does suggest that a 101K file size limit isn't a major problem.
What about duplicates? Are they counted in the overall totals? Let's go back to pontneddfechan, to see how this might be confusing. A search for that on Google shows 479 matches. The same search over on AllTheWeb shows 181 matches -- less than half the Google results. But now try to get to the last result at Google. You'll find that only 139 will be shown, followed by a message that Google has "omitted some entries very similar" to those already displayed. Choose the option to see all the results, and you'll discover plenty of duplicate pages that get uncovered.
Google does the right thing in "rolling up" these duplicates -- but perhaps the figure it reports for the query ought to be the 139 figure, rather than the 479. As for whether duplicates like this make up part of the 3.3 billion figure cited, I honestly don't know, at the moment.
OK, we've raised a lot of questions about what's in Google's size figure. There are even more we could ask -- and the same questions should be directed at the other search engines, as well. AllTheWeb's 3.2 billion figure may include some pages only known by seeing links and might include some duplicates, for example. But instead of asking questions, why not just test or audit the figures ourselves?
That's exactly what Greg Notess of Search Engine Showdown is especially known for. You can expect Greg will probably take a swing at these figures in the near future -- and we'll certainly report on his findings. The last test was done in December. His test involves searching for single word queries, then examining each result that appears -- a time-consuming task. But it's a necessary one, since the counts from search engines have often not been trustworthy.
For example, try the now classic the search on Google, and you'll find it reports 5.2 billion pages indexed for that word -- but Google's home page says it has only indexed 3.3 billion. (FYI, back at the end of June, Google was reporting 3.1 billion pages indexed but 3.7 billion matches for a search on "the").
Google's response when I asked about this at the end of June was told the sizes it reports in response to searches are simply estimates and the size it lists on its home page may not reflect if the Google crawler has indexed more that usual in a given period.
ResourceShelf's Gary Price found similar count oddities when he tried running searches for words and eliminating particular web sites. But that's OK, Gary, I can top those.
A search for "prince charles" (as a phrase) brings up 287,000 results. Now see the fourth listing, for "Prince Charles Pipe Band?" Let's eliminate it with this search: "prince charles" -"pipe band". Great -- all gone, but even though we subtracted pages from the original search, the count has now gone up 373,000 results!
I asked Google about this problem about two weeks ago. The response was again that counts are just estimates. However, Google also said that they're continuing to do engineering work to upgrade some of their systems, which is making some of the counts get thrown off.
Grow, But Be Relevant, Too
I'm certainly not against index sizes growing. I do find self-reported figures to also be useful, at least as a means of figuring out who is approximately near each other. Maybe Google is slightly larger than AllTheWeb or maybe AllTheWeb just squeaks past Google -- the more important point is that both are without a doubt well above a small service like Gigablast, which has only 200 million pages indexed.
However, that's not to say that a little service like Gigablast isn't relevant. It may very well be, for certain queries. Indeed, Google gained many converts back when it launched with a much smaller index than the established major players. It was Google's greater relevancy -- the ability to find the needle in the haystack, rather than bury you in straw -- that was the important factor. And so if the latest size wars should continue, look beyond the numbers listed at the bottom of the various search engine home pages consider instead the key question. Is the search engine finding what you want?
By the way, the baby of the current major search engine line-up Teoma did some growing up last month. The service moved from 500 million to 1.5 billion documents indexed.
Paul Gardi, vice president of search for Ask Jeeves, which owns Teoma, wants to grow even more. He adds that Teoma is focused mainly on English language content at the moment -- so the perceived smaller size of Teoma may not be an issue for English speakers. Subtract non-English language pages from Teoma's competitors, and the size differences may be much less.
"Comparatively speaking, I would argue that we are very close to Google's size in English," Gardi said.
Twitter Canada MD Kirstine Stewart to Keynote Toronto
ClickZ Live Toronto (May 14-16) is a new event addressing the rapidly changing landscape that digital marketers face. The agenda focuses on customer engagement and attaining maximum ROI through online marketing efforts across paid, owned & earned media. Register now and save!