Another milestone in the search engine size wars was hit when Google went live with a full-text index of 560 million URLs in June, making it the largest search engine on the web. In addition, because of how Google makes use of link data, its reach extends to a further 500 million URLs that it has never actually visited, the company says. That means searches at Google potentially encompass more than 1 billion pages, which is the size the entire web was recently estimated to be at earlier this year.
So does this mean Google is the first search engine to give 100 percent coverage of the web? No. For one thing, that 1 billion page estimate is several months old, and the web has almost certainly increased in size since then. Nor does that estimate include the millions of pages that search engines typically don't crawl, such as those behind password protected areas or served up by identifiable dynamic delivery systems. How big the web is now is anyone's guess.
Also, Google has actually visited and recorded the contents of 560 million pages, not 1 billion. Google, unlike any other major search engine, does make clever use of its technology to leverage its reach beyond this core set of pages (as the articles below explain further). It isn't just marketing hype for it to use the 1 billion figure, but those extra pages are more like a bonus that you can't always depend on, rather than the assurance you get from having indexed each and every page.
None of this takes away from Google's accomplishment nor the value of using its service. It is now a clear choice for those seeking both highly relevant results and comprehensive searching across the web. Searchers should also have even more choice in the coming months. WebTop just announced its own half-billion page index, and some of Inktomi's partners should go live with Inktomi's new half-billion page index in the very near future.
While Google is running searches against the full index at its own site, its partners may not tap into the entire amount. "We support searches for different partners, so they won't all be necessarily be getting the largest index," said Google president Sergey Brin. Google's customers can choose to search against indexes of different sizes, with the smaller indexes containing a higher amount of the web's most popular pages, as determined by Google's link analysis system. The benefit for customers in using a smaller index is savings. It costs them more to query against the biggest collection of documents, Brin says.
Offering different sized indexes isn't new. Inktomi has also offered its partners this option, and it is one reason why Yahoo's Inktomi-powered results have never matched those of some other Inktomi-powered services. Yahoo has never chosen to hit Inktomi's index as deeply as possible. Whether this will change when Google takes over remains to be seen. Google says Yahoo has the option to do so, and I'll revisit the issue after Yahoo goes live with the Google results.
For webmasters, the index variation is important. As Google begins to add more and more partners, as with Inktomi, you'll need to expect that results will be different that at Google.com. For searchers looking for comprehensiveness, the key issue will be to tell which partners hit the full index. Unfortunately, that probably won't be readily apparent.
In other changes to the index, Google can now also update the parts of the collection more often. "We're aiming to keep everything within a month," Brin said, explaining that in the worst case, a document might not be rechecked for a month. However, the index itself is aimed to be updated at least weekly, and more volatile documents may be refreshed on a daily schedule.
Google also now serves queries from three data centers, two on the West Coast of the US and one on the East Coast. Brin says that the different data centers should be kept in sync with each other, so lag time between mirrors should not be significant.
Google has also begun to float articles from major news wires at the top of its results, in response to current news stories. You can't specifically search for news, but the system should recognize topical queries such as "elian gonzalez" and respond with links that appear beginning with the word "News."
A new page offering a behind-the-scenes look of Googlites at work and play.
Search Engine Size Test
I took a look at how the new Google index performs against other size leaders. Did it live up to its claims? Pretty much, yes. Also see how FAST, Excite, AltaVista and some Inktomi-powered services perform. NOTE: The page isn't ready yet, but results will be posted by July 7.
Search Engine Sizes
The current sizes of major crawler-based search engines, historical numbers and plenty of articles that document the size wars over the past years.
Yahoo Partners With Google
The Search Engine Update, July 5, 2000
Yahoo has selected Google to take over from Inktomi in powering Yahoo's secondary results. These are the listings that appear in the "Web Pages" area of Yahoo's results, after any hits from Yahoo's own human-compiled listings.
Numbers, Numbers -- But What Do They Mean?
The Search Engine Update, March 3, 2000
Explains how Google leverages its link database to expand its coverage, and it also puts other "dual numbers" you may hear into perspective.
The Half Billion Crew: Google, Inktomi GEN3, & WebTop
Search Engine Showdown, June 29, 2000
Greg Notess compares and contrast leaders in the index size game and runs a current comparison.
Google's Cool Billion
About.com Web Search Guide, June 26, 2000
Another look at the Google size increase from search writer Chris Sherman.
I mentioned WebTop in my June newsletter, and now the company has just announced a half-billion page index. Expect a closer look in the future.
Boo.com's computers used to revamp search engine
Reuters, June 28, 2000
Computers from the failed Boo.com online retailer have been put to new use powering WebTop.com
Optimising Digital Marketing Campaigns with Search, Social and Analytics
At SES London (9-11 Feb) you'll get an overview of the latest tools, tips, and tactics in Paid, Owned, Earned, Integrated Media and Business Intelligence to streamline your marketing campaigns in 2015. Register by 31 October to take advantage of Early Bird Rates.