Google Unveils More of the Invisible Web

A longer, more detailed version of this article is
available to Search Engine Watch members.
Click here to learn more about becoming a member

Google has quietly extended the scope of its web index, for the first time including a number of file formats that are all but ignored by other search engines. These file formats make up a small but important part of the Invisible web, and Google's effort to make them searchable is a noteworthy advance in search engine technology.

The new file types indexed by Google include Microsoft Word, Excel and PowerPoint formats, as well as Rich Text Format and PostScript files, according to Google spokesperson Cindy McCaffrey. Search engines traditionally have snubbed these file types in favor of those in the far more common HTML format, which is widely accepted as the universal standard for web pages.

Google was the first major search engine to tackle non-HTML web content in a large way, when it began indexing Adobe Portable Document (PDF) files in January and February of this year. Google's index now contains more than 22 million PDF files.

Result listings for the new file types look similar to PDF results, prefaced by a bracketed label to the left of the document title indicating its file type. These new labels are straightforward, simply using the document's extension to indicate type. The labels are [doc” for Word documents, [xls” for Excel spreadsheets, [ppt” for PowerPoint presentations, [rtf” for Rich Text Format documents, and [ps” for Postscript documents.

For many types of searches, you may not see any results that include these file types, for a number of reasons. Google is gradually rolling out the capability to its data centers around the world. The new file formats are available at two of Google's data centers immediately, and will be fully accessible worldwide by early next week, according to Google spokesperson David Krane.

Another reason that the new formats may not show up in results is that relatively few numbers of these file formats exist on the web, compared to HTML files that make up the majority of the overall Google index.

Although Google declined to release specifics about how many documents in the new file formats it has indexed, informal testing suggests that the new formats represent a just a fraction of the 1,610,476,000 total pages Google currently claims are accessible in its index. And since non-HTML files traditionally haven't been regarded by most users as web documents, it's highly unlikely that many of these documents have links pointing to them from other web pages -- a key factor in how Google determines relevance.

The best way to find information in the new document formats is to restrict your search to a particular file type, using Google's "filetype" operator. For example:

zamboni filetype:doc
"2000 census" filetype:xls
"investment strategy" filetype:ppt

Google offers two methods for viewing the new file types. You can view a document in its native format by clicking its title in the result list. If you're running Internet Explorer, the document will open directly in your browser window. If you're running Netscape Navigator or another browser, a pop-up box will ask you whether you want to open the document or save it to disk. In either case, you run a possible security risk by opening documents that might be infected with a virus or worm.

To help users avoid this safety risk, results for these file types include a "View as HTML" option. Clicking this link opens a bare-bones copy of the document that has been stored on Google's servers, one that carries no risk of infecting your computer.

Don't expect the titles of documents in search results to always make sense. Although Word, Excel and PowerPoint documents have options for specifying titles, few document authors bother using them. If Google doesn't find a document title in the document's properties, it tries to extract a title from the first lines of the document. If Google can't sensibly determine the title it simply uses the URL of the file.

Although we're still a long way from being able to use search engines to find information in online databases -- the mother lode of the Invisible web -- Google's addition of these new file types is another welcome step along the path of being able to find information of any virtually any type in the huge expanses of the web.

Google Does PDF & Other Changes
Google now includes listings of Adobe PDF files from across the web, a first for any major search engine and a feature long overdue for them to offer.

How Google Works
A detailed look under the hood at all aspects of Google's operation.

A longer, more detailed version of this article is
available to Search Engine Watch members.
Click here to learn more about becoming a member

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.

New Search Engine for News Searching... Oct 31 2001 2:29PM GMT
Court: AOL 6.0 violates copyrights...
ZDNet Oct 31 2001 12:33PM GMT
Tridion creates free online portal to benchmark business internet strategies... Oct 31 2001 12:33PM GMT
CMGI: $220 million debt from AltaVista deal restructured...
Chicago Tribune Oct 31 2001 12:17PM GMT
Berners Lee: WWW royalties considered harmful...
The Register Oct 31 2001 8:59AM GMT
Attacks spur traffic spike at some sites...
San Francisco Chronicle Oct 31 2001 3:12AM GMT
Categorization Software Upgrades...
Content-Wire Oct 31 2001 2:15AM GMT
NeuStar: Americas New Verisign?...
Internet News Oct 30 2001 9:36PM GMT
Ex-AltaVista employee sues CMGI for $70 million...
iWon Oct 30 2001 8:32PM GMT
Intel Likes the Napster Way...
Wired News Oct 30 2001 4:55PM GMT
The Cheering Fades for Yahoo...
Fortune Oct 30 2001 4:31PM GMT
Attacks from the heart of the net...
BBC Oct 30 2001 1:19PM GMT
Does Official Taliban Site Exist?...
Wired News Oct 30 2001 11:47AM GMT
powered by