Who's The Biggest Of Them All?
From The Search Engine Report
Nov. 1, 1999
AltaVista is now claiming to have the largest index of the web, at 250 million pages. But is its claim true? Does the service deserve the title of biggest? Probably -- or certainly it and FAST Search share the title. But it's not possible to say so definitively.
Before exploring the issue, let me make my usual caution that being biggest does not mean having the best results. I definitely see advantages to having index sizes grow -- it does mean that we are more likely to find unusual or obscure information. But fixating just on numbers can be misleading as to the actual quality of a service.
Now how do we prove the size of a search engine's index? One technique is to search for a word you know does not exist on any page in the index. For instance, a search for "dffjkdjkf" at Northern Light shows no matches. That means if I do a Boolean "NOT dffjkdjkf" search, I should be shown a count of ALL the pages in Northern Light's index, since all of them fit that search criteria.
This works well -- Northern Light tells me it has 189,060,458 pages indexed. But this technique doesn't work at those services seriously competing against Northern Light in the size game.
So what else can be done? You can search for various topics, then compare the total hit count. After all, if one search engine reports 7 million hits for "travel" and a competing one that's supposed to be the same size only finds 3 million matches, you have reason to believe maybe the second service has a smaller index than it claims.
There are two problems here. First, some search engines have results clustering that cannot be turned off. That means there's no way to get an accurate count of ALL the pages actually found for a query. Another problem is that not all search engines report a hit count, such as Lycos, or the count is only approximate, as with AltaVista.
That leaves only one real solution -- to run queries for extremely obscure topics, so that you can easily verify the exact count. An example of this can be found via the URL below, which is the method I used to try and verify AltaVista's size claim. That test told me that AltaVista and FAST Search seemed about equal, which isn't surprising given that they both claim indexes in the 250 million page range, which would make them the largest search engines on the web.
Now let's complicate things. AltaVista has a terrible habit of "timing out." This means that during busy hours, it will search for a short period of time, then return whatever it has found -- even if there's more information lurking in the index.
So even though it might be biggest, or tied for biggest, you might not be querying everything it has available. Nor might AltaVista be alone in this -- other services have suggested that their competitors don't search as completely against their indexes as they could.
This brings us back to the value issue. Unless a query is really obscure, having more isn't helpful. Is anyone really going to look through more than a thousand matching web pages for any topic? No -- but they are certainly going to appreciate having the very best 10 or 25 or 50 of those pages for that topic.
Of course, size will continue to be a selling point for crawler-based search engines, and those services will want their claims to be verified -- as do their users. Fine, then give reviewers the tools they need. Allow for results clustering to be turned off. Provide accurate counts, not approximate ones that can change depending on the time of day. With these two features, anyone can run comparative tests.
I'd also like to see all crawler-based search engines add a feature that lets you see the number of pages indexed from any particular web site. This also goes to the issue of verifying size claims. It would allow you to compare how deep search engines crawl various web sites, which can also be brought into the mix to verify size claims.
Search Engine Size Test
The Search Engine Report, Nov. 1, 1999
Shows the tests I did to check AltaVista's coverage against its claims.
Northern Light Claims Largest Index
The Search Engine Report, Feb. 2, 1999
More thoughts on the difficulty in auditing sizes.
Early Bird Rates have been extended!
June 12-14, 2013: Join industry experts at SES Toronto for a crash course in the latest strategies in Online Marketing and Advertising.
Save $300 when you register by Thursday, May 23.