How Big Are The Search Engines?
NOTE: While this is an older article, the same basic concepts about search engine sizes applies. Also see the Search Engine Sizes page. It links to many more recent articles about size issues.
Index size is hugely debated among the search engines. In 1996, there was a lot of competition to be the largest index. In 1997, I would say the emphasis is shifting toward being the best index. That usually means being large, because most of the search engines want to have a complete account of the web. But that also means having fresh information, something difficult to do when you need to keep checking up on millions and millions of web pages.
Here are two comparisons: HotBot, which wants to be both large and fresh, introduced a new crawling system to help it keep up with the growth of the web. Meanwhile, WebCrawler chooses to opt out of the "bigger is better" competition. Instead, it focuses on fully indexing key sites, while sampling smaller ones.
Unfortunately, most people assume that search engines index everything. In fact, there was quite a bit of discussion after the "discovery" that AltaVista does not index everything on the web (which you can read about more fully on the AltaVista Size Controversy page, link below).
The search engines feed into this impression of indexing everything, with some of the things they say on their submission pages. Here are some examples:
- AltaVista: Please submit only one URL. Our spider will explore your site by following links.
- Excite: This form allows you to add a Web site to Excite's constantly refreshed database of 50 million URLs.
- HotBot: If any other indexed page links to your page, HotBot will automatically find and index your page.
- Lycos: Don't bother registering every page in your site individually, unless you're really anxious. (Our spiders will automatically register all additional screens shortly after your home page.)
Let's make it clear. None of the search engines -- none of them -- index everything on the web. No search engine can claim to have a perfect record of everything out there.
There are some physical reasons why they miss things, such as problems with frames, image maps or the inability to index dynamically-created web pages.
There are also hardware limitations: it takes a lot of space to store everything on the web, and a lot of processing power to sort through the material quickly enough to respond. That means spending more money, and some of the search engines may prefer to perform a sort of web page triage, and index only a representative sample of the web.
In fact, it should be clear to anyone that pages are going missing. In mid-1996, the big search engines were all in the 25 to 50 million web page range. A year later, they're still reporting the same numbers.
Did the web stop growing? Absolutely not. Stats from Internet Archive, which crawls the web to preserve web sites, estimated there were at least 80 million web pages as of January 1997. In April 1997, AltaVista's chief technical officer said his search engine had crawled at least 100 million pages and thought it reasonable to assume there are at least 150 million pages out there.
So are the search engines failures? Hardly. They could be better -- a lot better -- in some cases. But people do use them every day with a great deal of success. Some of them do a phenomenal job of keeping up. Both Excite and HotBot, in particular, crawl a lot and regularly, which is reflected on the Search Engine EKGs in this web site. And while AltaVista has come under fire for sampling some sites, it at least makes constant crawls.
It would be nice if there was form of product labeling for each search engine. It might tell you that only the home pages from sites are indexed, or that frame pages are excluded, or that certain domains may not be indexed, or any other factors particular to that search engine.
In lieu of this, search engine users and webmasters can make use of some of the resources below. For search engine users, knowing the size and crawl frequency of the different search engines can help you choose appropriately. For webmasters, understanding how the search engines work can help you ensure that your web pages get in, whenever possible.
Search Engine Sizes
A graphical look at how large each search engine is, with trends over time, and links to many articles about size issues.
Know your Ambiguous Customer: Effective Multi-Channel Tracking
Wednesday, June 5 at 1pm ET - Learn why a move from the "batch and blast" email approach enables better conversations with your customers.
Register today - don't miss this free webinar!