Today, Google is dropping from its home page the famous count of pages in its index while simultaneously claiming it has the most comprehensive collection of web documents available to searchers. Yes, this is the long expected counterblow to Yahoo's claim last month to have outdistanced Google. However, dropping the home page count is a positive move that I think helps defuse the entire size wars situation. That's because it divorces the notion of page counting as a way to "prove" comprehensiveness, a move long overdue for the industry to make.
I've written before about the Yahoo-Google size fight that broke out in August, and I'm going to recount some of that history as part of this analysis. Suffice to say, I've (along with Gary Price and Chris Sherman) had many conversations with both companies over the past few weeks. In pondering the various arguments and statements, I was ultimately left feeling how little counting pages, either self-reported index counts or those seen in response to actual queries, mean in terms of whether a search engine is comprehensive.
I'm going to explain that more below with some examples, plus I'll dive in about the current Google news. However, I think the best thing to do is start back at the beginning. So I hope you'll indulge me with a little history both distant and more recent, because I think it will help explain the issues more.
The Bigger Is Better Attitude
Way, way back one century ago
Not long after the web began
AltaVista lived in the land of search
A fine example of a big search engine
AltaVista, AltaVista and rivals
Saying you were bigger could help attract users
AltaVista, AltaVista and rivals
Spent all of their days counting up pages
Apologies to Andrew Lloyd Webber and Tim Rice for playing with one of the Joseph And The Amazing Technicolor Dreamcoat songs. But I couldn't resist!
Last century, in December 1995 to be exact, AltaVista burst upon the search engine scene with what was at that time a giant index of 21 million pages, well above rivals that were in the 1 million to 2 million range. The web was growing fast, and the more pages you had, the greater the odds you really were going to find that needle in a haystack. Bigger did to some degree mean better.
That fact wasn't wasted on the PR folks. Games to seem bigger began in earnest. Lycos would talk about the number of pages it "knew" about, even if these weren't actually indexed or in any way accessible to searchers through its search engine. That irritated search engine Excite so much that it even posted a page on how to count URLs, as you can see archived here.
Size As Surrogate For Relevancy
While size initially DID mean bigger was better, that soon disappeared when the scale of indexes grew from counting millions of pages to tens of millions. Bigger no longer meant better because for many queries, you could get overwhelmed with matches.
I've long played with the needle-in-the-haystack metaphor to explain this. You want to find the needle? You need to have the whole haystack, size proponents will say. But if I dump the entire haystack on your head, can you find the needle then? Just being biggest isn't good enough.
That's why I and others have been saying don't fixate on size for as long as 1997 and 1998. Bigger no longer meant better, irregardless of the many size wars that continued to erupt. Remember, Google -- when it came to popular attention in 1998 and 1999 -- was one of the tiniest search engines at around 20 to 85 million pages. Despite that supposed lack of comprehensiveness, it grew and grew because of the quality of its results.
Why have the size wars persisted? Search engines have seen an index size announcement as a quick, effective way to give the impression they were more relevant. In lieu of a relevancy figure, size figures could be trotted out and the search engine with the biggest bar on the chart wins! See my Screw Size! I Dare Google & Yahoo To Report On Relevancy and In Search Of The Relevancy Figure articles for more about this.
The Yahoo-Google Dispute Of Aug. 2005
The latest size wars broke out last month, when Yahoo said on its blog that it now provided access to over 19 billion web documents. Yahoo had been silent on its index size since rolling out its own technology in early 2004. Now, we were given a figure -- and a figure over twice as that claimed by Google on its home page.
The Yahoo post DID NOT claim that Yahoo was more comprehensive or bigger than any other search engine. But Yahoo did make this claim in an Associated Press article about the self-reported size increase:
"This is a great reason for more people to check us out," said Eckart Walther, the Yahoo vice president for products. "We are more comprehensive than anyone else out there."
That quote, more than anything else, was a red flag to Google. Fair to say, Google wants to be a leader in all things search. The idea that it might be second-best in comprehensiveness wasn't a statement it wanted left unchallenged.
Counting Pages Does Not Equal Measuring Comprehensiveness
But how do you prove comprehensiveness? I'll skip past some of the past attempts (you can read more here) and dive quickly into the study two students at NCSA did recently. In short, they looked at rare words. The idea is that the more matches you get for a rare terms, the more comprehensive a search engine must be. If Yahoo really was more than twice as big as Google, it should come back with more than twice as many matches.
Unfortunately, the study has many low level flaws, as even the two students admit in a follow-up to it. It was skewing toward bringing back dictionary lists, rather than more "normal" documents useful to searchers. And were duplicate pages being checked? Was Yahoo somehow filtering out spam pages? To what depth were the pages indexed? Full text or only after a certain length? See also Seth Finkelstein and Jean Vironis (here and here, linking to Google Translate versions from the French originals) for more issues with the study.
Beyond that, there was a higher level flaw. The students not unreasonably assumed that the 8 billion pages indexed claim on the Google home page was accurate. It was not. In talking with Google over the past weeks, it turned out that the home page count was its most conservative estimate of pages in the index. Beyond what was officially claimed were other documents, including "partially indexed" ones. These were documents that while not actually indexed might still come up in results and add toward a count.
In total, the Google actual Google index size was above 8 billion -- maybe as high as 9-11 billion pages and possibly a bit more. That higher count, unbeknownst to the students, meant the "gap" between Google and Yahoo counts would naturally be less. It wasn't necessarily that Yahoo was smaller than it claimed. It could also be that Google was bigger than it claimed.
As you can see, there were some low level and higher level flaws. But the biggest flaw was the entire idea that counting pages could equal measuring comprehensiveness. Years and years ago, that might have been more true. But today, especially in a period of syndicated content and much near-duplicated content, measuring comprehensiveness is a much more subjective task.
In other words, assume you search for a "rare" term and one search engine comes up with three matches while its rival comes up with only one. Is the "big" search engine three times better? Not until you look at the actual pages. What if those three pages are simply duplicates or near duplicates of each other. If that's the case, the counts aren't accurately reflecting comprehensiveness.
The Duplicate Content Issue
Let's take a look at some real-life examples to fully understand the problems with depending just on counts. I'll start first with data from Google's new blog search service. It handily keeps me updated with all posts in feeds that are linking to our Search Engine Watch Blog. As it turns out, this has been a great way to simply find those who are carrying my content.
Consider this page, from the Amazezing site. It is simply a summary of my article over here. The Amazezing site's summary potentially could have comments that would make it unique from my article. However, there aren't any. Carrying that page in a search engine's index, at the moment, really doesn't make the search engine any more comprehensive than if it just carried my original article. Nevertheless, if a search engine indexes both pages, then it would see twice as good as a search engine that only carries my article on a pure count basis.
Head to the Amazezing site's home page. Now look at the Open Directory's home page. Seem familiar? The Amazezing site simply carries a copy of the Open Directory's listings. There's nothing wrong with that. The Open Directory encourages people to use its listings. But where's the unique content?
Look at the Open Directory's Anime Genres category. Now look at the corresponding page at Amazezing. The key difference? Amazezing has Google AdSense links, and the Open Directory doesn't. Carrying the Amazeing category page makes a search engines negligibly more comprehensive to a human eye. But on a pure count basis? Negligible becomes twice as good.
Counting Pages Indexed Per Site
How about perhaps looking at comprehensiveness in terms of pages from particular sites that are indexed? Surely sites with that have fewer pages listed at one search engine than another may mean the search engine isn't very comprehensive.
Search Engine Watch reader Sam Davyson certainly felt that way, discovering recently Google had indexed nearly all of his 110 web pages while Yahoo had only five.
Then again, comprehensiveness may be in the eye of the beholder. Another reader, Patrick Mondout, has been frustrated that Yahoo has refused to list any pages other than the home page of his Super70s.com site. In contrast, Google has up to 62,000 pages listed (many of these, however, seem to be "link" only pages that haven't been indexed), and MSN has 3,000. He emailed me his view of the Yahoo comprehensiveness claim:
I was really peeved to see Yahoo trying to suggest that its index is bigger than Google's. Why? Because they steadfastly refuse to list me.
Yahoo's response when I followed up on this was that Mondout's site seemed to be mostly content scraped from Amazon or eBay, plus that he had excessive crosslinking. Mondout's response in turn was that he had plenty of unique content.
Readers can take a look around at the site and judge for themselves. However, there's no doubt in some cases, a search engine might drop a site or content from a site for good reasons. In doing so, count numbers go down and ironically may make them look less comprehensive when that's not the case. In other situations, of course, it will be the opposite.
Some Real Life Needle Finding
What would be a good way to measure comprehensiveness? Look at actual queries that searchers do that come back with no results or relatively few results. This would be much better than the "rare word" testing that's often been done because those types of searches are artificial in nature.
Actual queries reflect actual needle in the haystack hunts. Look at these type of failure or low count queries at Yahoo and see if Google comes back with matches. Do the same with Google queries at Yahoo. That would be an interesting test, but it would rely on data that the search engines aren't providing.
In lieu of that, I'll give you a real life example from my end. A few weeks ago, the kids and I were playing Lego Star Wars on our Xbox, which is a great game. There was only one last "minikit" piece that we needed. We knew roughly where it was by using our minikit detector, but our efforts to actually find it came up with nothing.
I turned to search. I honestly can't remember at this point exactly which search engine I used and the exact search terms. I know, I know -- I should have saved all this! But I was playing a game at the weekend and having fun, only thinking of this as a search example later on.
As best I can reconstruct, I ended up with something like [lego star wars defence of kashyyyk minikit”. That's the UK spelling of defense, since we're playing the UK version of the game. I seem to recall getting practically no results on Google or Yahoo but one of them did get me what I was looking for.
Here's the search on Yahoo. Only three results, and the first page having the answer. That's an example of finding the needle in the haystack for me. Over at Google, here's the search there. Many more matches, over 100, though you can see some duplicates getting counted as toward the end:
- www.sweet-cheats.com/?cat=13/ 1380k - Supplemental Result - Cached - Similar pages
- www.sweet-cheats.com/?cat=13&paged=1 1399k - Supplemental Result - Cached - Similar pages
- www.sweet-cheats.com/?cat=13/&paged=1 1399k - Supplemental Result - Cached - Similar pages
Those three results are all the essentially same page, just with slightly URLs/ways to reach them. Still, Google DID find me the page with the right answer, right at the first page of its results. Needle in the haystack found, both at Google and Yahoo.
Here's another real life example, fresh off the press. Today at lunch, my mother-in-law was over visiting. My wife noticed that her glasses had been bent. She'd sat on them accidentally and fixed them as best she could, but no eyeglasses place in the UK could repair them properly. That's because she'd bought them a couple of years ago when she was on vacation in Northern Cyprus.
I suggested she send them back for repair there, but she couldn't remember the name of the place. So my wife decided it was time to introduce her to web search (my mother-in-law is yet to get a computer or surf the web herself).
They tried a couple of searches first for [northern cyprus optician” and more specifically, on Google, for kyrenia northern cyprus optician that brought back 27 results. The second listing had the name of the place, Akay Optik, ringing a bell in my mother-in-law's head. Success!
The listed page actually no longer existed, as it turned out. Still, they could see the name of the optician in the listing on Google. Using the actual name, a search for akay optik got them to the right place. (When I looked, it also became clear why they didn't find the site more easily. It's a graphical site, rendering it largely invisible to search engines).
So in the end, that initial search with 27 results got the needle in the haystack. Over at Yahoo, the same search came up with 6 results and missed the needle we needed. Google came out in this instance as more comprehensive.
Proof Google is more comprehensive, both in numbers and in quality of actual results! Hold on. If I varied the search, Yahoo pulled through. A search for opticians in kyrenia brought up this page as number two, an excellent overview to services where Akay Optik was easily found (and yes, Google has it too).
How Do You Prove Comprehensiveness?
In the end, I hope some of the above has helped illustrate why counts alone can't be taken as proof of comprehensiveness. They are prone to all types of errors, in terms of how do you define a page, a duplicate page, the depth of the page indexed not to mention whether the page really is of a quality to produce expanded comprehensiveness, rather than a larger count.
If you can't rely on counting pages as proof, then how does any search engine definitively prove that it is more comprehensive. I asked Yahoo about this in response to the AP quote mentioned above. The response came back that it was given in the context of counting and self-reported figures. If you believe counts equal comprehensiveness -- and you believed the counts both Yahoo and Google were giving at that time -- then they were the most comprehensive by THAT measure.
Aside from that, they could simply say that they were bigger than they were before and felt they were more comprehensive than they were before. Whether others found them to be more comprehensive remained to them.
Today's Google Claim
That leads to today's claim by Google. First, it's making the claim as part of it being Google's seventh birthday celebration. I know -- some of you may recall Google's birthday is on September 7, but Google says it's celebrating the September event today.
Google's saying that as part of its birthday celebrations, it's now 1,000 times larger than it was when launched. I'm specifically not going to do the math as part of my "get away from the counts" attitude of this piece. But neither is Google giving a figure. It's only saying that it is three times larger than the closest competition (that's Yahoo, even though Google's not naming Yahoo).
Talking about today's announcement, Marissa Mayer, Google's director of consumer web products, also told me:
We're happy to say at this point we are the largest and most comprehensive by a large factor and have been for some time.
Counts Can't Be Compared
Proven how? Google's not releasing a count. Why not? It feels that the ability to come up with count comparisons is too difficult. It's not an apples-to-apples thing. Google doesn't know exactly how its competitors are counting documents, when they release counts. And for the record, that's exactly what Yahoo has been saying to me and other analysts like Charlene Li.
Eckart Walther, the aforementioned Yahoo vice president of products in the AP article I cited above told me earlier this month:
We cannot deduce the basic documents they have in their index, and they cannot deduce the number of documents in our index.
Indeed, even the controversial study from the NCSA students covered this:
Although there is no direct way to verify the size of each search engine's respective index, the standard method to measure relative size was developed by Krishna Bharat and Andrei Broder in 1998.
So no player can tell exactly how big each other is in terms of counts, but they feel they can make some guesses at relative size. Google feels it is now three times relatively larger than Yahoo. But it is not saying it is three times more comprehensive by showing a count as proof of that.
Reprise: Counts Don't Equal Comprehensiveness
Even if Google did trot out count numbers, I've explained that wouldn't convince me, nor do I feel it should necessarily convince others. But if Google was upset over Yahoo's earlier claim not seeming to be proven, how can it then put out its own claim to be most comprehensive without backing?
"We believe the margin of difference is large enough that users should be do a few queries themselves and check it out," Mayer said. "If it's not a commonly occurring term, chances are they'll be able to see a difference themselves."
I agree with that. The proof is in the pudding, so to speak. Rather than another round of trotting out figures and third parties trying to see if rare word lists bring up more or less than would be expected, let's get the focus back on the quality of the results. Quality includes comprehensiveness. So if someone devises a test of real queries, things that don't involve rare words but instead rare information on the web, that's of interest.
Here's one more example of this from me. My wife and I love the watercolors of Annie Williams, a Welsh artist. We have a number of prints and one of her actual paintings. About two or three years ago, I tried to find out more about her on the web. I tried all the major search engines. There was nothing. Believe me, nothing. All my searching skills came to naught.
I just checked things out today. Here's a crafted query at Yahoo, where I've added and eliminated things to narrow in on Annie Williams, the artist. Only seven matches, but useful -- the first a gallery that I might want to follow up with about where I might find another exhibition of her works. And the third listing for spotjockey.com led me to this short but nice bio.
Over at Google, the same query comes back with 20 matches. A few more promising prospects, though a number of dead ends and blank pages. But the bio page is found directly on the second page of results, and there are other things interesting to explore.
Over at Ask Jeeves and MSN, I get three matches -- the same good page for the exhibition web site I found at Yahoo and Google, but that's end.
So gut feeling for this query? Google slightly more comprehensive than Yahoo, but Yahoo not bad and ahead of Ask and MSN. FOR THIS QUERY ONLY! For other queries, or other wordings, things might change significantly. Thus the challenge of declaring an overall comprehensive winner.
So in the end, it comes back to what I, Gary Price, Chris Sherman and virtually any long-time writer of advice on search engines will have long told you. The major search engines are all great resources. They find lots of things. But you'll find they may be better for some things and not others because they don't have the same listings. Use different search engines and see for yourself the ones that fit best.
A Hearty Goodbye To Counts
I know some will be itching to test things out still by counting. I'll leave that to others. For myself, the dropping of the count from the Google home page is to be applauded. It's not been accurate as I've covered above, for one thing. But more important, it takes the counts out of the equation and puts the focus much more on quality, where it belongs. Any serious study of comprehensiveness, I certainly want to see that and review it. But I'm not going to miss time spend trying to figure out whether pure counting of pages -- rather than measuring comprehensiveness -- was done right. Neither should you!
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.