Google has quietly extended the scope of its web index, for the first time including a number of file formats that are all but ignored by other search engines. These file formats make up a small but important part of the Invisible web, and Google's effort to make them searchable is a noteworthy advance in search engine technology.
The new file types indexed by Google include Microsoft Word, Excel and PowerPoint formats, as well as Rich Text Format and PostScript files, according to Google spokesperson Cindy McCaffrey. Search engines traditionally have snubbed these file types in favor of those in the far more common HTML format, which is widely accepted as the universal standard for web pages.
Google was the first major search engine to tackle non-HTML web content in a large way, when it began indexing Adobe Portable Document (PDF) files in January and February of this year. Google's index now contains more than 22 million PDF files.
Result listings for the new file types look similar to PDF results, prefaced by a bracketed label to the left of the document title indicating its file type. These new labels are straightforward, simply using the document's extension to indicate type. The labels are [doc” for Word documents, [xls” for Excel spreadsheets, [ppt” for PowerPoint presentations, [rtf” for Rich Text Format documents, and [ps” for Postscript documents.
For many types of searches, you may not see any results that include these file types, for a number of reasons. Google is gradually rolling out the capability to its data centers around the world. The new file formats are available at two of Google's data centers immediately, and will be fully accessible worldwide by early next week, according to Google spokesperson David Krane.
Another reason that the new formats may not show up in results is that relatively few numbers of these file formats exist on the web, compared to HTML files that make up the majority of the overall Google index.
Although Google declined to release specifics about how many documents in the new file formats it has indexed, informal testing suggests that the new formats represent a just a fraction of the 1,610,476,000 total pages Google currently claims are accessible in its index. And since non-HTML files traditionally haven't been regarded by most users as web documents, it's highly unlikely that many of these documents have links pointing to them from other web pages -- a key factor in how Google determines relevance.
The best way to find information in the new document formats is to restrict your search to a particular file type, using Google's "filetype" operator. For example:
"2000 census" filetype:xls
"investment strategy" filetype:ppt
Google offers two methods for viewing the new file types. You can view a document in its native format by clicking its title in the result list. If you're running Internet Explorer, the document will open directly in your browser window. If you're running Netscape Navigator or another browser, a pop-up box will ask you whether you want to open the document or save it to disk. In either case, you run a possible security risk by opening documents that might be infected with a virus or worm.
To help users avoid this safety risk, results for these file types include a "View as HTML" option. Clicking this link opens a bare-bones copy of the document that has been stored on Google's servers, one that carries no risk of infecting your computer.
Don't expect the titles of documents in search results to always make sense. Although Word, Excel and PowerPoint documents have options for specifying titles, few document authors bother using them. If Google doesn't find a document title in the document's properties, it tries to extract a title from the first lines of the document. If Google can't sensibly determine the title it simply uses the URL of the file.
Although we're still a long way from being able to use search engines to find information in online databases -- the mother lode of the Invisible web -- Google's addition of these new file types is another welcome step along the path of being able to find information of any virtually any type in the huge expanses of the web.
Google Does PDF & Other Changes
Google now includes listings of Adobe PDF files from across the web, a first for any major search engine and a feature long overdue for them to offer.
How Google Works
A detailed look under the hood at all aspects of Google's operation.
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.
Meet Your Favorite Search Engine Watch Contributors
Many of SEW's leading expert contributors will be at ClickZ Live, the new online and digital marketing event kicking off in New York (March 31-April 3). Hear from the likes of: Thom Craver, Josh Braaten, Lisa Barone, Simon Heseltine, Josh McCoy, Lisa Raehsler, Greg Jarboe, Dan Cristo, Joseph Kerschbaum, John Gagnon, Eric Enge and more!