As Chris blogged, Google has raised the stakes in the search engine size wars by claiming an index of 8 billion pages. Microsoft had planned to seize the title of biggest search engine by announcing 5 billion pages indexed today. That would have put it above the 4.2 billion mark Google has self-reported for about a year.
We've been through these size wars before. They erupt any time a search engine seeks some type of concrete evidence that it is better than another. Size figures don't "prove" this at all, of course. A search engine with lots of pages might actually be worse than one with fewer, if the index isn't refreshed often or if the relevancy simply isn't there.
My Search Engine Sizes page lays out the past size wars we've had, for the curious, along with plenty of reference material and past articles. The figures haven't been updated since Size Wars IV in 2003, so I'll be off to fix that soon. Meanwhile, here's where we stand:
|Search Engine||Reported Size||Page Depth|
|Ask Jeeves||2.5 billion||101K+|
Now time for all the caveats!
Reported Size Figures
Reported Size is just that -- whatever the search engines claim. With Google, this has sometimes included what they call "partially-indexed" pages or what would more fairly be called link-only pages. These were pages Google knows about solely by links pointing at them. Nothing on the pages themselves has been indexed.
Typically, search engine sizes shouldn't count duplicate pages, spam pages and so on. But we're not auditing here, so they might.
As for Yahoo, it's trying to stay out of the size game. When it launched its own search technology earlier this year, it refused to provide a size figure, instead saying it was "comparable" to others. The company is sticking with this.
"As in the past, we are not disclosing the size of our index for competitive reasons. That said, we believe our index is highly competitive. Search quality is comprised of a variety of factors including freshness, relevance etc. and we continue to deliver high quality results for our consumers to ensure that they are able to find the best results for what they are looking for," said spokesperson Stephanie Iwamasa.
I both love and hate Yahoo for this. I love the idea of not getting into the size wars again, which are never that productive. But I also hate the idea we don't have a clue where they are at. I want those numbers -- I just want the search engines to put them out without the hype.
Since Yahoo won't release a figure, I'm putting them at 4.2 billion. That was the figure Google had long claimed -- and I read Yahoo's past statements of being comparable to mean they were at least equal with where Google was at.
Page Depth Amount
Page Depth is much more interesting. So you've got tons of pages -- do you actually index the full text on them, every word? That used to be how some search engines operated. Google almost singlehandedly made it acceptable to only partially index some pages.
In the past, if a page were longer than 101K, only the first 101K worth of text was indexed by Google. Everything else was ignored. My assumption right now is that Google still operates this way. If not, we'll bring an update as more information is gained.
MSN's page depth figure comes from statements they gave during the Meet The Crawlers session at Search Engine Stategies San Jose last August. It may not be true for the current release. I'll double-check on this and update, if so.
Yahoo's figure is from that same session. Ask Jeeves declined to state a figure during the session, going with, "We're in the ballpark of others." So, I've made them equal to Google, for now.
It's pretty easy to figure this stuff out. You just find a big long page, then do searches to see which search engines find text at the bottom of it. Tara Calishain did this recently to Yahoo and found Yahoo actually picking up some pages to a depth of 800K.
Greg Notess of Search Engine Showdown is also the historic star of this type of auditing. In the past, Greg has run tests to try and determine if search engine sizes as reported seem to measure up. If he jumps back into this, we'll let you know. We may also jump in on the page depth side. Trying to audit the index size is much more time consuming.
In the meantime, I'll leave you with the refrain that Chris, Gary and I all agree with. Search engine size figures are useful but by no means should they be taken as a surrogate for a relevancy figure. Google having an index twice as large as Yahoo does NOT mean it is twice as good.
I'll leave you with a reference to my past article, Search Engine Size Wars & Google's Supplemental Results. It goes into even more depth on all the issues relating to search engine index size and the games that can -- and have -- been played.
Our recent Search Memories article is also a good read for those who want to hear some first-hand accounts of the first search engine sizes wars that were sparked by AltaVista.
Want to comment on this story? Visit our forum thread: It's Official: Google Now Searching 8,058,044,651 web pages.