Google Fires New Salvo in Search Engine Size Wars

Google announced today that its web index has grown to more than 3 billion documents, including a complete Usenet archive dating back to 1981. The search engine is also putting a major emphasis on freshness, re-indexing several million pages on a daily basis, as well as adding links to relevant news stories for many queries.

Of the 3 billion total searchable documents, 2 billion are web pages, with more than 75% of those pages fully indexed. 700 million are Usenet posts, and 330 million are images. "To search our collection of 3 billion documents by hand, it would take 5,707 years, searching twenty-four hours per day, at one minute per document," said Larry Page, Google's co-founder and president of Products. "With Google, it takes less than a second."

"We've been pushing up the index size beyond what we've officially said," said Urs HÖlzle, Google Fellow. HÖlzle added that scaling up Google's index hasn't required significant changes. "We're continually monitoring quality, and haven't actually had that many changes since the 1 billion pages announcement."

While the enhanced web index is certainly an impressive achievement, perhaps even more noteworthy is Google's comprehensive Usenet index of 700 million postings in more than 35,000 topical categories, with a full archive going back to 1981 -- the year Usenet began.

"One of the bigger complaints in the Usenet community is that even Deja never had anything close to a full Usenet archive," said HÖlzle. "We've been able to find all Usenet archives and index them" with the help of a number of individuals who maintained archives and made them available to Google, HÖlzle explained.

"The Google Groups Usenet archive reveals a detailed view into two decades of history -- that's ten years' worth of content that existed before the birth of the web," said Sergey Brin, Google's co-founder and president of Technology. Google's Usenet archive, called Google Groups, was released from beta today.

Separately, Google has been quietly testing a feature that includes links to relevant news stories with some types of queries. "When testing this new service it was very enthusiastically received," said David Krane, Google's Director of Corporate Communications.

News links, when they are found, are returned at the top of a result page. Not all queries cause news links to be displayed. "We're trying to make the coverage better while at the same time not decreasing relevance," said HÖlzle. "We're also shortening time between when news happens and we have it."

HÖlzle said that Google's news crawler is adaptive, and can respond quickly to breaking news, making it available on Google in as little as 15 minutes after a story has been posted.

While HÖlzle declined to provide specifics about the news sources Google is crawling, he said it was "hundreds or even thousands" of sites. Most of the sources are identified automatically. "If it even remotely looks like a news site then it should be part of the search," said HÖlzle.

News junkies are probably salivating at the prospect of a new search resource. Don't expect a specialized news search, or a "news" tab added to Google's home page any time soon, though. Links to news stories, when they're served, will be treated just like other search results.

Along with adding timely news, Google is now working harder to make its index fresher. "Part of the index gets refreshed every day," said HÖlzle. While news sites, which change frequently, are obvious candidates for daily indexing, other sites are being indexed daily as well. "They're chosen by algorithm, not by hand. We concentrate on pages that are identified as important and relevant for updating," said HÖlzle.

"This week, it's on the order of 3 million, but that's a number that should rapidly increase over time in a relatively short time frame." HÖlzle noted that while 3 million pages are actually re-indexed each day, Google's crawler visits many more than that looking for changes.

"We're planning to scale this fairly rapidly over the next few months, with our goal to have unquestionably the freshest index on the web," HÖlzle said.

Although the announcements were made today, it will take some time for the changes to take effect at all of Google's data centers. About 50% of Google's data centers are currently updated, will the rest scheduled to be fully updated by Friday.

Google Opens German Sales Office

Google announced the opening of a sales office in Hamburg, Germany, expanding Google's advertising model directly to advertisers in Germany, Austria and Switzerland.

Why Germany? Worldwide, the German language is the second most popular language used on Google, after English. Google currently offers two types of advertising programs:

1. Premium Sponsorship: German advertisers work closely with Google's Germany-based sales team to develop ad campaigns and monitor performance on Google's German website,

2. AdWords: Self-service, fully automated advertising solution available on

Danny Sullivan will take a detailed look at the new changes at Google and the implications for both webmasters and searchers in the upcoming Search Engine Update, available to Search Engine Watch members. For more information on becoming a member, see


Search Engine Sizes
The charts on this page show the size of each search engine's index, updated December 11, 2001.

Google Unveils More of the Invisible Web
SearchDay, Oct. 31, 2001
Google has quietly extended its index of the web, for the first time making searchable a number of file formats that are all but ignored by other search engines.

Google Does PDF & Other Changes
The Search Engine Report, Feb. 6, 2001
Google now includes listings of Adobe PDF files from across the web, a first for any major search engine and a feature long overdue for them to offer.

How Google Works
A detailed look under the hood at all aspects of Google's operation.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.

Wearable translating computer due out soon...
Nando Times Dec 11 2001 11:14AM GMT
Surfers Find No Refuge From Online Ads... Dec 10 2001 11:02PM GMT
Hanaro Telecom Creates Mega Internet Portal...
Korea Times Dec 10 2001 10:27PM GMT
Sinas Daniel Mao spells out how the portal differs from
China Online Dec 10 2001 10:06PM GMT
Guarding intellectual property on the Internet...
CNN Dec 10 2001 7:02PM GMT
Interior Dept. Sites Still Down...
Wired News Dec 10 2001 5:18PM GMT
Search engines lead to best buys...
Chicago Tribune Dec 10 2001 5:12PM GMT
Iphrase lands deal with Yahoo Finance...
Boston Globe Dec 10 2001 5:06PM GMT
Global E-Copyrights Treaty To Take Effect In March 2002...
Elcom UK Dec 10 2001 4:17PM GMT
Sites Forlorn When Reborn as Porn...
Wired News Dec 10 2001 2:36PM GMT
Online Archive for Coke Advertising...
New York Times Dec 10 2001 7:57AM GMT
The Net Is 30-Something, But the Web Is a Child...
New York Times Dec 10 2001 7:57AM GMT
Striving to Top the Search Lists...
New York Times Dec 10 2001 7:57AM GMT
powered by

About the author

Chris Sherman is a frequent contributor to several information industry journals. He's written several books, including The McGraw-Hill CD ROM Handbook and The Invisible Web: Uncovering Information Sources Search Engines Can't See, co-authored with Gary Price. Chris has written about search and search engines since 1994, when he developed online searching tutorials for several clients. From 1998 to 2001, he was's Web Search Guide.