In 1998, a landmark study published in Science magazine found that search engines fell well-short of listing everything on the web. Last month, a repeat to this study was published in Nature magazine. Once again, search engines were found to contain only a small percentage of the web pages available for indexing.
Both studies made for great headlines about how search engines seemingly fail users, but the number of pages a search engine has indexed has nothing to do with whether it is any good. Size alone doesn't matter.
Don't believe me? Go to your favorite search engine and look for a company, any company. Was the company's web site listed first? That's what many users want to have happen. This example is just one of many that illustrate the value of relevancy over sheer size. Improving relevancy means being a smarter search engine, not just a bigger one.
Even the study itself explains that bigger doesn't mean better: "There are diminishing returns to indexing all of the web, because most queries made to the search engines can be satisfied with a relatively small database," the study's authors write.
In fact, study coauthor Dr. Steve Lawrence isn't saying search engines are bad. Lawrence credits them for making more information available to the public than ever before, but he thinks they could do better.
"The amount of information that you can quickly and efficiently search for using search engines has been increasing, but what has been decreasing is the amount you can search for versus what you could potentially search for," Lawrence said.
So having provided some balance to the headlines you may have read recently, let's examine details from the study in more depth.
Conducted by scientists at the NEC Research Institute, the study found that there were 800 million indexable pages and 180 million images on the web, as of February 1999 . "Indexable" means pages that weren't hidden behind password protection, excluded from indexing by robots.txt files, locked away in databases or basically inaccessible to search engines for other reasons.
The first study found that there were 320 million pages as of December 1997, so it sounds as if the web has more than doubled in size in just over a year. However, the two studies cannot be compared fairly because they used completely different methods of estimating the size of the web.
Previously, the size estimate was derived by comparing the overlap between results at different search engines to extrapolate an overall figure for the web. In the latest study, the researchers used a variety of techniques to determine how many web servers were publicly available (2.8 million) and the mean number of pages per server (289). Multiplying these two figures is where the 800 million web pages count comes from.
From here, the next step was to determine how much coverage of the web each search engine provided.* You need a known starting point, and Northern Light was selected. The researchers did a search that told them Northern Light had an index of 128 million web pages.** Then they divided Northern Light's index size by the entire web size to find that Northern Light covers 16 percent of the web.
Unfortunately, you can't get a count of index size for some of the other crawler-based search engines, so the researchers instead ran a series of test queries -- 1,050 in all.*** They totaled how many web pages were returned for each search engine, and that told them proportionally how big or small each service was compared to Northern Light (which coincidentally led the pack in terms of coverage). Here's the complete rundown:
Northern Light: 16%
Inktomi (Snap): 15.5%
Inktomi (HotBot): 11.3%
Inktomi (MSN Search): 8.5%
Inktomi (Yahoo): 7.4%
Inktomi provided search results to several services when the survey was done, which is why it appears multiple times. It powered primary results at MSN Search and secondary results at HotBot, Snap and Yahoo. The variation in coverage reflects what I and others have written about in the past -- not all Inktomi partners tap completely into its 110 million page index.
Lawrence said the percentages for HotBot and Lycos might be slightly higher than shown, because those services will only display one page per web site in their results. That could have caused an undercount, he said.
The big winner was Northern Light, which saw its traffic triple just after the study was published. That had to be satisfying to CEO David Seuss, who said earlier this year that winning the title of biggest search engine would make the still relatively little-known service more popular. Of course, whether those users will stick with Northern Light remains to be seen, and challenger FAST has just ousted Northern Light from the number one spot with its new 200 million web page index. Excite is also about to weigh in later this month with a new index in the 250 million page range - expect more details in the next newsletter. Northern Light is in the 165 million page range, followed by AltaVista's 150 million pages. All sizes are unaudited, self-reported numbers.
Combined, search engines are seen to be covering only 42 percent of the web, meaning that if you used each search engine individually, you'd access more pages across the web than using any single service alone. In comparison, the last study found a combined coverage of 60 percent. So things are getting worse? Probably, but not conclusively. You can't accurately compare these figures because the methods of estimating the size of the web were so different for each study.
Lawrence agrees that direct comparisons won't be exact, but he feels the overall finding that coverage is dropping is basically correct. "It wouldn't change any of the conclusions, though it might have changed the magnitude," he said of the drop, assuming the same size method had been used in both studies.
The study also reported on the number of dead links at each service, which are an indication of how fresh each search engine's index is. The lower the percentage, the fresher the index. Overall results were:
Northern Light: 9.8%
Inktomi (Yahoo): 2.9%
Inktomi (Snap): 2.8%
Inktomi (MSN Search): 2.6%
Inktomi (HotBot): 2.2%
The poor showing by Lycos reflects another fact I've previously reported, that its index was woefully out of date earlier this year. The situation at Lycos has been greatly improved since then. Moreover, primary results at Lycos now come from the Open Directory, not from the spidered index reviewed in the study.
Somewhat related, the survey estimated how long it takes the typical web page to appear at a search engine. The median age is 57 days -- thus, most documents take about two months to appear. Since direct submission will almost always speed up the listing process at the search engines, this median number can be taken in my opinion as a good indication of how long it takes a page that has never been submitted to appear (assuming, of course, that it gets listed at all). Complete figures were:
AltaVista: 33 days
Excite: 47 days
Inktomi (HotBot): 51 days
Inktomi (MSN Search): 57 days
Infoseek: 60 days
Inktomi (Yahoo): 76 days
Northern Light: 84 days
Inktomi (Snap): 91 days
Lycos: 174 days
By the way, to get these figures, the researchers downloaded every single web page for a variety of queries over time. If a page was never seen before in response to a query, it was considered "new" even if the page itself had been online for several months. Because all results for each query were downloaded, the researchers are fairly certain these "new" pages were appearing because they'd been finally indexed and not just because they'd been indexed but never ranked well before.
Another interest statistic was on meta tag usage. The researchers found that 34.2 percent of web servers make use of either the meta keywords or description tag on their home pages: 31% had at least a meta description tag and 32% had at least a meta keywords tag. The authors concluded, and I'd agree, that such low usage means that take up of proposed RDF/XML tagging standards is likely to be slow. By the way, usage of Dublin Core tags was found on only 0.3 percent of home pages.
Even more stats: the researchers estimate that 83 percent of web servers are commercial in nature, 6 percent are academic or scientific, nearly 3 percent are health-related, just over 2 percent are personal sites, 1.5 percent are pornographic and just over 1 percent are government-related. Sites could be in more than one category, and remember, these are stats for web servers -- not web pages. The percentages could change significantly if individual pages were categorized.
The study also found that it was more likely that pages from popular and well-known sites would be indexed. That's in part a function of how search engines work -- spiders are more likely to visit a site if they keep coming across links to it -- plus it is a conscious decision on the part of most all the major search engines as a means to provide better listings and combat spam.
The study's authors worry that this trend means high-quality but "unpopular" pages may not list well. In my opinion, this is less a concern. If anything, I feel the trend toward popularity measures means that general users will get more relevant and useful results in response to common queries, which is desperately needed. Meanwhile, those performing more refined and focused queries such as research professionals should still be able to locate much information of use.
The solution to better serving the second group probably won't be to create a single, super-comprehensive 800 million web page search engine. Instead, it's likely to be the creation of new specialty services that do in-depth coverage of sites by topic.
Long overdue is probably an academic search engine. I've talked to several people recently who all wish there was a way to do a search across university web sites, with the knowledge that the search engine would have done a deep and frequent crawl of the sites on its list. Whether such a service would make money for its owners is another issue, of course.
But even as specialty services arise, there are definitely advantages for general purpose search engines to enlarge their indexes and keep pace with the web. It means that they can provide more comprehensive coverage for those users that search for unique or obscure information, such as about a rare disease. And users themselves should consider search engine size when selecting among services. Just be wary about using size alone to decide which service is best.
Accessibility and Distribution of Information on the Web
Produced by the authors of the study, you can request a copy to be sent via email, and more details are also promised for the future.
Search Engine Sizes
You'll find comparison charts and links to past articles that deal with size issues, including the previous Science study, all on this page.
* The researchers actually did test queries first, then used Northern Light as a benchmark, but I've reversed the steps to more easily explain the process.
** The researchers did a Boolean NOT search for a term they knew was not in the Northern Light index. As a result, a listing of all pages indexed by Northern Light was displayed.
*** The search engines will provide self-reported size estimates, but there's no way to audit this with some via a search as was done with Northern Light.