Google Fires New Salvo in Search Engine Size Wars

Google announced today that its web index has grown to more than 3 billion documents, including a complete Usenet archive dating back to 1981. The search engine is also putting a major emphasis on freshness, re-indexing several million pages on a daily basis, as well as adding links to relevant news stories for many queries.

Of the 3 billion total searchable documents, 2 billion are web pages, with more than 75% of those pages fully indexed. 700 million are Usenet posts, and 330 million are images. “To search our collection of 3 billion documents by hand, it would take 5,707 years, searching twenty-four hours per day, at one minute per document,” said Larry Page, Google’s co-founder and president of Products. “With Google, it takes less than a second.”

“We’ve been pushing up the index size beyond what we’ve officially said,” said Urs HÖlzle, Google Fellow. HÖlzle added that scaling up Google’s index hasn’t required significant changes. “We’re continually monitoring quality, and haven’t actually had that many changes since the 1 billion pages announcement.”

While the enhanced web index is certainly an impressive achievement, perhaps even more noteworthy is Google’s comprehensive Usenet index of 700 million postings in more than 35,000 topical categories, with a full archive going back to 1981 — the year Usenet began.

“One of the bigger complaints in the Usenet community is that even Deja never had anything close to a full Usenet archive,” said HÖlzle. “We’ve been able to find all Usenet archives and index them” with the help of a number of individuals who maintained archives and made them available to Google, HÖlzle explained.

“The Google Groups Usenet archive reveals a detailed view into two decades of history — that’s ten years’ worth of content that existed before the birth of the web,” said Sergey Brin, Google’s co-founder and president of Technology. Google’s Usenet archive, called Google Groups, was released from beta today.

Separately, Google has been quietly testing a feature that includes links to relevant news stories with some types of queries. “When testing this new service it was very enthusiastically received,” said David Krane, Google’s Director of Corporate Communications.

News links, when they are found, are returned at the top of a result page. Not all queries cause news links to be displayed. “We’re trying to make the coverage better while at the same time not decreasing relevance,” said HÖlzle. “We’re also shortening time between when news happens and we have it.”

HÖlzle said that Google’s news crawler is adaptive, and can respond quickly to breaking news, making it available on Google in as little as 15 minutes after a story has been posted.

While HÖlzle declined to provide specifics about the news sources Google is crawling, he said it was “hundreds or even thousands” of sites. Most of the sources are identified automatically. “If it even remotely looks like a news site then it should be part of the search,” said HÖlzle.

News junkies are probably salivating at the prospect of a new search resource. Don’t expect a specialized news search, or a “news” tab added to Google’s home page any time soon, though. Links to news stories, when they’re served, will be treated just like other search results.

Along with adding timely news, Google is now working harder to make its index fresher. “Part of the index gets refreshed every day,” said HÖlzle. While news sites, which change frequently, are obvious candidates for daily indexing, other sites are being indexed daily as well. “They’re chosen by algorithm, not by hand. We concentrate on pages that are identified as important and relevant for updating,” said HÖlzle.

“This week, it’s on the order of 3 million, but that’s a number that should rapidly increase over time in a relatively short time frame.” HÖlzle noted that while 3 million pages are actually re-indexed each day, Google’s crawler visits many more than that looking for changes.

“We’re planning to scale this fairly rapidly over the next few months, with our goal to have unquestionably the freshest index on the web,” HÖlzle said.

Although the announcements were made today, it will take some time for the changes to take effect at all of Google’s data centers. About 50% of Google’s data centers are currently updated, will the rest scheduled to be fully updated by Friday.

Related reading

Simple Share Buttons