Search Engine Sizes Scrutinized
From The Search Engine Report
April 30, 1998
In early April, the mainstream Internet press went nuts over a study in Science magazine that found no single search engine indexes everything on the web.
Visitors to Search Engine Watch know this isn't a new discovery. I've been reporting on it over the past two years, and the site has a page devoted to the topic of search engine size. The site's Search Engine EKGs also illustrate both search engine size and freshness.
In fact, anyone who received HotBot's press release last December on having the biggest index could have easily proven that search engines don't cover everything, without having to perform painstaking research. In it, HotBot announced it was now indexing 110 million of the 175 million pages it estimated to exist on the web. Simple division could tell any reporter that HotBot, the industry leader at the time, was only covering 63% of the web.
However, the prestige of a Science article on the topic grabbed headlines, which is a good thing as most search engine users are not educated about what goes on under the hood of their favorite service. Better education can help them make better choices. And the painstaking research was crucial in providing freshness ratings for each service, along with a new estimate of the size of the web.
Does Size Matter?
Before discussing the study, it's helpful to ask, does size matter? Yes and no. If you are looking for relatively obscure information, it's extremely helpful to have a service with a big index. It increases the odds that a service will bring back a match.
In contrast, a large index is not necessarily helpful for very general queries, which many users perform. In fact, a smaller index of pages drawn from select sites may be more useful.
Many of the major services made this "better not bigger" argument throughout 1997, when it was clear they weren't keeping up with the growth of the web. As noted, it is valid to a point. However, there is some degree of growth required for them to have a decent sample of what's out there. This issue is discussed in more depth on the "How Big Are The Search Engines" page within the site, linked to below.
Now for some specifics from the study. Researchers at the NEC Research Institute ran the same 575 queries on HotBot, AltaVista, Northern Light, Excite, Infoseek and Lycos. They then counted the matching pages, with a variety of constraints used. Duplicate pages weren't counted, the maximum query limits for each engine were not exceeded, and other controls were used to normalize across services.
Another control was to discard any page from the count if the exact search term did not appear. So if the search was for "crystal," then a page would not be counted unless the exact word "crystal" was found.
The problem with this is that a page may radically change after it is indexed. Listings may be days, weeks or even months out of sync with the actual page. Likewise, some sites deliver pages tailored to search engines spiders. A human visiting the site would see a completely different page.
This probably didn't impact results greatly, especially as the queries were scientific in nature, and thus reveal pages that were not likely to be created by webmasters swapping code. But it does point out a difficulty in conducting this type of research, given that the search engines, and the web itself, is not a controlled environment.
A better solution would be to do a count of pages retrieved from various web sites, but the problem here is that only AltaVista, HotBot and Infoseek allow this to be done with any degree of accuracy. This is something the Melee Indexing survey has tried to do.
Another solution is to track the pages retrieved from known web sites, which is what the Search Engine EKGs within Search Engine Watch do.
After the researchers filtered the results, they had in essence a giant pool of matching pages. They then looked at how many pages from this pool each search engine listed. HotBot covered the most, and its coverage was used as a baseline for estimating search engine size.
For example, AltaVista found only 81% as many pages as HotBot, so the researchers presumed that its index would only be 81% the size of the HotBot index. The researchers had HotBot's size from a recent press release, 110 million web pages. So they multiplied 110 million by 81% to get an estimate for AltaVista of 89 million web pages. A similar calculation was done for the other search engines in the study.
The numbers aren't too off from those published by the major search engines, except for Lycos.
The study estimated Lycos to have an index of 8 million web pages, while Lycos says its index is above 30 million. This lower estimate would haunt the service when the overall web coverage numbers were calculated.
(I felt that using HotBot as a baseline might skew the results somehow, so I did the same thing using Infoseek as the baseline, along with its published size of 30 million web pages. The numbers remained nearly the same).
With estimates of each search engine's size in hand, the researchers then examined the overlap between the two largest services, HotBot and AltaVista, to extrapolate a size for the entire web: 320 million web pages.
This size estimate was big news, because it exceeded by far most other estimates. In December, HotBot was saying the web was at 175 million web pages (they now estimate 200 million), while several other estimates were in this range or lower.
Finally, with a size estimate for the entire web, they returned to calculate percentages of the web covered by each search engine. This was straight-forward division. HotBot's had the best score, 110 million out of 320 million web pages, or 34% of the web. Lycos, estimated to have only 8 million web pages, came in last with a paltry 3% coverage.
The Lycos Problem
As you might expect, Lycos wasn't very happy. It put forth the argument that size isn't that important, but it also stated that the study's estimate for it was off. It reaffirmed to me, and others, that it has 30 million web pages indexed, if not more.
So what happened with Lycos? Are its published numbers to be believed? Quite possibly. If the count is indeed too low, the most likely culprit is in the queries that were used.
The study queries culled from among its staff, which are mostly scientists. Thus, these are more likely to retrieve pages from academic resources. If a search service does not index many pages from these places, such as universities, then it would naturally appear to have less coverage than a service that does.
Lycos falls right into this trap. It tends to index pages from "popular" sites. A site with lots of links pointing at it might get indexed in more depth, while a site that is not well publicized may be missed entirely. As you can expect, many university sites are not well publicized.
Were the same survey done with a different set of queries, an entirely different picture of coverage might appear. This is something that authors readily acknowledge within the study. A more accurate headline for many articles might have been "Search engines fall short for scientists," since this study was primarily aimed at helping them search better.
Ironically, shortly after the Science article appeared, researchers at AltaVista's owner Digital released their study, one that its says did use a wide range of queries. It would have been interesting to see if Lycos performed better with this set, but the search engines was not included.
The Digital study put the size of the web at 275 million web pages in March 1998. It also found -- surprise! -- that AltaVista provided the best coverage at 40%, with HotBot a close second at 36%. Infoseek and Excite tied for third, at 16%.
The Issue Of Freshness
One excellent thing the NEC study did was that it rated the freshness of each search engine's index. This is very hard to quantify without the painstaking research of physically verifying the existence of each page listed.
Freshness is important, both because it saves people from wasting time, and it also shows that search engines are reflecting the current information available on the web. Below are the percentages of bad links found in each service:
Northern Light: 5.0%
While Lycos deserves honors for its low score, it's AltaVista that should be most singled out. It combines one of the largest web indices with a relatively low stale link rate, an excellent balance.
Unfortunately, a count of dead links from Yahoo was not included, and it could have been done. The percentage quite possibly would have exceeded HotBot bad score. Numerous people complain about out of date sites, as well as listings, with the service.
What To Do?
Statistics about index size and freshness aside, the most important thing people want from a service is relevancy. That's difficult to quantify, because relevancy is subjective. Everyone has different expectations and searching styles.
For this reason, most people should think of a search service like a pair of shoes. Try different ones on, and wear the one that fits best. If you like the results you get, don't worry so much that another service may have a bigger index.
Also remember that people wear different shoes for different activities. It's the same with search services. If you are looking for news, use a specialty news service. If you are doing a general search, a service with a smaller index or hand picked listings may help.
But for the serious researcher, the coverage and freshness numbers are extremely important. They help direct you toward the players more suitable for the serious searcher. This NEC study, my own studies and the search engines' own published sizes indicate these are HotBot, AltaVista and Northern Light. Not surprisingly, these services are among the most named when librarians are asked what they use.
The NEC study also suggest using metacrawlers as a good way to get the best coverage of the entire web, since no one search engine covers everything. See the Search Engine Watch metacrawler page for more information about these tools.
A graphical look at how large each search engine is, with trends over time. You will also find links to information about the Science magazine study, an update to that study, and a link to the similar study by Digital.
Provides an idea of how large and how fresh each search engine is.
Article within Search Engine Watch that explains the issues of index size in more depth. Does size really matter? Also has links to other resources of information, such as the Melee Survey.
Introducing SES Online
Want to view one of the sessions you missed or listen to an especially informative presenter a second time? SES New York sessions are available for purchase on ClickZ Academy's new e-Learning site. SES is now Online!