More on the Total Database Size Battle and Googlewhacking With Yahoo

The total web database size claims that Yahoo released this week continue to have people talking. It’s so not a big deal, at least for me, the searcher. For Gary, the search industy watcher, it’s interesting to see another round of database size wars up and running but it’s still not a big deal in the searching sense. We’ve been through this before. What Total Size Wars 2005 illustrates that pr/bragging rights and mindshare are so crucial in the today’s search business.

Yesterday, Google went on the record with John Battelle:

“Our scientists are not seeing the increase claimed in the Yahoo! index. The data we have doesn’t support the 19.2 (billion page) claim and we’re confused by that.”

Both Google and Yahoo officials have also talked to the Search Engine Watch team. In fact, during the GoogleDance the other night, Danny, Chris, and I spent about 90 minutes chatting and looking over some of the same reasearch that they also shared with Battelle. I’m sure Danny will have more to say about our meeting next week.

Total size battles are nothing new. For example, back in the summer of 2003, we had something similar go down with total size claims between Google and AllTheWeb. Of course, the search biz in 2003 wasn’t what it is today.

So, what are possible next steps or is this something that will be repeated over and over again and not just between Google and Yahoo?

First, remain calm, all is well. Enjoy the weekend.

Second, if total size claims are so important to Yahoo, Google, and others, how about both of these companies and others sitting down and agreeing to an independent third party auditing and certifying future size claims? I just wonder if each company would be willing to disclose the needed info for a third party to make accurate verifications. Btw, in the short term, I’m also hoping that noted search engine expert Greg Notess, will run some tests and offer his search size estimates.

Again, do total size numbers mean anything in the first place to the searcher? No. We all know what does matter. However, as we’ve seen this week, this number sure means something in the pr/marketing/branding/press coverage game.

Those of you who decide to do your own size tests need to remember that without knowing precisely what each company is counting, it’s very difficult to measure apples with apples.

For example, to get accurate total size numbers it would be useful to know how each engine handle the followning and other variables:

  • Precisely how do Google and Yahoo handle stemming?
  • How does each company count non-html docs in their totals?
  • Does each company have a cutoff for the amount of content on a page or in a document they crawl and index?
  • Since both companies include links on results page that don’t explicitly contain each and every search term it would be important
    to understand how Google and Yahoo handle anchor text?
  • How is punctuation handled?
  • The differences between what is seen on a results page and the total database. Do dupes count in the overall total? What about mirrored pages?
  • How does each handle discovered but uncrawled urls in their totals?
  • What does spam filtering mean to all of this?
  • When searching are you hitting the entire database? Could some portions of the dbase be inaccessible at the instant you click the search button?
  • What was the size of the Yahoo database before this week’s increase?

Also, attempting to make estimates simply by running a bunch of searches on both engines and only looking at total page estimates are not doing anything productive. The page estimates listed at the top of web results page are not accurate especially as a measurement tool. To get total page counts you’re going to have to literally count each and every result and now how each engine handles the variables listed above (and others). Very time consuming.

Googlewhacking with Yahoo

Since a Googlewhack is only Googlewhack if it returns just one result, I thought it would be interesting but far from scientific to see if that one result would or wouldn’t appear if you ran the same Googlewhack producing search with Yahoo. Would more results appear? Less? Since these searches produce just one result, counting would likely be easy. I selected 20 recent “whacks” from the current Googlewhack stack.

I’ll leave the interpretation, if any, up to you. For me, the following was just a fun exercise and proves nothing. Btw, I wonder if Googlewhack founder, Gary Stock, and his crew of “whackers” are going to start Yahoowhacks?

A Googlewhack equals a Google search producing one result.

10 Googlewhacks were not found (zero results) in Yahoo.
6 Googlewhacks were found in Yahoo. In other words, the same single result at both Yahoo and Google.
4 Googlewhacks found more than one unique result at Yahoo.
++ 3 Googlewhacks searched with Yahoo found one additional result.
++ 1 Googlewhack searched with Yahoo found 6 additional results.

Specific Searches
+ tartiest dieing
Not found in Yahoo

+ intergalactically janitorial
Not found in Yahoo

+ icebreaking snaggletooth
Not found in Yahoo

+ poboys moneybag
Not found in Yahoo

+ pangea anthropocentrically
Not found in Yahoo

+ bedtimes downshifter
Not found in Yahoo

+ obverse tartiness
Not found in Yahoo

+ hubristic sweatsuits
Not found in Yahoo

+ overload underkills
Not found in Yahoo

+ tailgated winnebagoes
Not found in Yahoo
+ supercharged disestablishmentarianism
2 results found in Yahoo. One unique

+ wildebeest colonoscopies
7 results found in Yahoo, Six unique

+ fictionizing rumsfeld
2 found in Yahoo. One unique

+ arachnophobic swashbuckler
2 results, One unique
+ semipublicly popularized
Same result found in both

+ gifting twoonies
Same result found in both

+ cruddiness pretentiousness
Same result found in both

+ congratulating schoolchilds
Same result found in both

+ fabulator marsupial
Same result found in both

+ overaggressively tapped
Same result found in both

Notes: I’ll run a new random Googlewhack test next week and report if if find anything different or interesting. Also, no word if I was searching on the “new” larger Yahoo web database. OK, that’s it. Remember, TGIF!

Postscript: I just noticed that John posted a few more comments on his blog including the following. He writes, “Would I be surprised if Google announced shortly that its index was magically up to, oh, 22 billion or so? No, I would not.” I agree with John, Total size numbers from all engines are just claims. They’ve always just been claims. To move beyond this, some type of agreed upon standards and methods are needed. Otherwise, this week’s headlines will likely happen over and over again.

Related reading

Simple Share Buttons