The Alta Vista Size Controversy
From The Search Engine Report
July 2, 1997
In case you've missed it, earlier this year, discussion was flying fast and furious over Alta Vista indexing only a sample of web pages, rather than everything it finds. To some degree, the search engine has been singled out unfairly, as anyone reading the How Big Are Search Engines? page in this site will see (link below). But Alta Vista may also have set itself up for a fall, with the claim on its home page to be "the largest web index."
The debate started innocently enough, with a March 1997 article on ZDNet regarding ways to publicize your web site. John Pike, webmaster of the Federation of American Scientists web site responded to the article, complaining that he found only 600 of 6,000 pages from his web site to be indexed by the Alta Vista.
Pike's response went on to detail the a message he received from Alta Vista regarding this. He was advised that 600 pages were probably the most he'd see for any domain. He was also given the example of Geocities, which is a popular site that provides web space for its members. He was told that although Geocities has over 300,000 members (and thus at least 300,000 potential web pages), only 300 pages from the domain had been indexed.
This started ringing alarm bells in Pike's head. He realized that Alta Vista was indexing only a sample of web pages, not the entire web itself.
"I confess that I was not previously aware of this practice of AltaVista, which is certainly not been previously reported anywhere, and is certainly @ variance with their apparent claims that if you supply them with one URL from their site they will spontaneously include the rest of their site in their index."
Of course, visitors to this web site know better. The levels and depth that a search engine crawls has been a top category on the Search Engine Features Chart, since the site began back in April 1996. It has clearly shown that not all search engines grab every page.
With the red flag flying, Alta Vista's Chief Technical Officer, Louis Monier, posted a response. In it, he pointed out that doing an advance search actually revealed over 50,000 pages from GeoCities, not 300. Oddly, he blamed Pike for this mistaken number, which Pike was originally given from Alta Vista.
Monier did correctly point out that any page directly submitted to Alta Vista gets added (in most cases), though he wasn't correct in saying Alta Vista is the only search engine that does this. Infoseek does it, too.
He also correctly stated: "The concept of 'the size of the Web' in itself is flawed, as there are many sites virtually infinite in size: dynamically generated documents, personalized news pages and shopping baskets using cookies, robot traps, scripts, the list goes on. Also unless one spends a lot of effort cleaning it up (we do), an index holds a lot of pages unlikely to ever be retrieved, like multiple copies of the same page and access logs. Size alone is a poor measure of usefulness"
While that's true, it doesn't get around the fact that Alta Vista arbitrarily decides which web pages to index and which to leave out, which is true of some other search engines, also. The claim to be the biggest is also not correct. While Alta Vista has a very large index, everything I can determine points to Excite and HotBot being larger.
Meanwhile, more flak came at Alta Vista from the I-Advertising Mailing List, in late April 1997. Tripod member Chris Longley related how he received a "Too many URLs submitted from that site" message when submitting his site to Alta Vista.
Alta Vista responds with that message if you try to submit too many pages all at the same time via its Add URL page. Why? The ability to add pages quickly via this page can make it easier for spammers to experiment with ways of boosting their rankings.
Alta Vista told Longley that they no longer indexed pages from servers like Tripod or Geocities, because these were popular places for spammers to create jump pages.
A jump page is one that has been specifically created for a search engine, in hopes that it will do well in the rankings. Spammers may create many different pages and submit the entire lot, usually with a hyperlink to the "real" site. Eventually, if the search engine discovers the spam page, the spammer just moves to another location.
Because anyone can get web page space instantly, and for free, through services like Tripod, these are favorite venues for the spammers to use. Unfortunately, that means that perfectly legitimate web pages under these domains may be barred from the index.
Alta Vista may not be alone in passing over free web space. I'm currently exploring this issue, but there are reasons to believe so far that anyone who cares about being indexed should expect that free web pages are not be the way to go.
Search Engines Sizes
Graphical look at how large each search engine is, with trends over time. Links to information on whether size matters.
How Big Are The Search Engines?
A look at why pages may not make it into the different search engines.
Search Engine Features Chart
Shows the most current sizes reported from each of the search engines, plus it details often they crawl.
You Want Your Web Site to Get Noticed, Don't You? Here's How
ZD Net, March 19, 1997
The original story that prompted John Pike to complain about Alta Vista only indexing 10% of the Federation of American Scientists web site. His complaint, and the response from Alta Vista's Chief Technical Officer, is below.
Shocked By Search Engine Indexing
John Pike's original complaint.
AltaVista CTO Responds
Alta Vista's Chief Technical Officer Louis Monier defends the service.
I-Advertising Mailing List
You can check the archives here for Chris Longley 4/27/97 complaint about Alta Vista excluding certain web sites.
Federation of American Scientists
The site overseen by John Pike, who discovered to his shock that only 10% of it was indexed by Alta Vista, rather than 100%, as he expected.
Chris Longley's Web Site
This is the site Chris Longley tried to submit to Alta Vista, only to have it rejected because of spamming by other pages within the members.tripod.com web space.