Google has quietly extended the scope of its web index, for the first time including a number of file formats that are all but ignored by other search engines. These file formats make up a small but important part of the Invisible web, and Google's effort to make them searchable is a noteworthy advance in search engine technology.
The new file types indexed by Google include Microsoft Word, Excel and PowerPoint formats, as well as Rich Text Format and PostScript files, according to Google spokesperson Cindy McCaffrey. Search engines traditionally have snubbed these file types in favor of those in the far more common HTML format, which is widely accepted as the universal standard for web pages.
Google was the first major search engine to tackle non-HTML web content in a large way, when it began indexing Adobe Portable Document (PDF) files in January and February of this year. Google's index now contains more than 22 million PDF files.
Result listings for the new file types look similar to PDF results, prefaced by a bracketed label to the left of the document title indicating its file type. These new labels are straightforward, simply using the document's extension to indicate type. The labels are [doc” for Word documents, [xls” for Excel spreadsheets, [ppt” for PowerPoint presentations, [rtf” for Rich Text Format documents, and [ps” for Postscript documents.
For many types of searches, you may not see any results that include these file types, for a number of reasons. Google is gradually rolling out the capability to its data centers around the world. The new file formats are available at two of Google's data centers immediately, and will be fully accessible worldwide by early next week, according to Google spokesperson David Krane.
Another reason that the new formats may not show up in results is that relatively few numbers of these file formats exist on the web, compared to HTML files that make up the majority of the overall Google index.
Although Google declined to release specifics about how many documents in the new file formats it has indexed, informal testing suggests that the new formats represent a just a fraction of the 1,610,476,000 total pages Google currently claims are accessible in its index. And since non-HTML files traditionally haven't been regarded by most users as web documents, it's highly unlikely that many of these documents have links pointing to them from other web pages -- a key factor in how Google determines relevance.
The best way to find information in the new document formats is to restrict your search to a particular file type, using Google's "filetype" operator. For example:
"2000 census" filetype:xls
"investment strategy" filetype:ppt
Google offers two methods for viewing results. You can view a document in its native format by clicking its title in the result list. If you're running Internet Explorer, the document will open directly in your browser window. If you're running Netscape Navigator or another browser, a pop-up box will ask you whether you want to open the document or save it to disk. In either case, you run a possible security risk by opening documents that might be infected with a virus or worm.
To help users avoid this safety risk, results for these file types include a "View as HTML" option. Clicking this link opens a bare-bones copy of the document that has been stored on Google's servers, one that carries no risk of infecting your computer.
Don't expect the titles of documents in search results to always make sense. Although Word, Excel and PowerPoint documents have options for specifying titles, few document authors bother using them. If Google doesn't find a document title in the document's properties, it tries to extract a title from the first lines of the document. If Google can't sensibly determine the title it simply uses the URL of the file.
Webmasters can ensure that titles are appropriately displayed by filling in document properties in Word, Excel and PowerPoint. These are set using the "properties" function in each respective program, which allows the author to specify a title, subject, author, keywords, and other metadata relevant to the document. "Webmasters are absolutely encouraged to put relevant keywords" in these fields, says Google's Krane.
Some webmasters may be surprised to find their non-HTML documents appearing in Google. Since these documents were never indexed before, webmasters may not have taken the same precautions to prevent indexing the files as they might have with HTML files.
Google respects the robots.txt protocol, so it's easy to prevent indexing by specifying which subdirectories should be excluded when the Googlebot crawler visits a server. The most direct way to prevent a crawler from indexing pages on your site is to use the robots.txt file to disallow automated retrieval of your content. But if you're not careful, the syntax used in this file can actually prevent your site from being indexed by any search engine.
Google also provides a mechanism for removing pages from its index. For urgent situations, Google also provides an automated means for removing large numbers of pages from its cache.
If you want your content removed from Google's index as quickly as possible, use Google's automatic URL removal system. To use this system, you must first register with Google. Using this system, you can remove either a single page or groups of pages, images or even subdirectories.
Google uses the robots.txt file to know which pages you want removed from its index, so you'll need to prepare or modify this file before using the automated removal system. More information on how to set up a robots.txt file can be found on The Web Robots Pages site.
Once you've submitted your request, your content will be removed from Google's server, generally within 24 hours. Google offers a status indicator allowing you to check the progress of your request online.
Although we're still a long way from being able to use search engines to find information in online databases -- the mother lode of the Invisible web -- Google's addition of these new file types is another welcome step along the path of being able to find information of any virtually any type in the huge expanses of the web.
Remove Content from Google's Index
The Web Robots Pages
How Google Works
A detailed look under the hood at all aspects of Google's operation.
Google Does PDF & Other Changes
Google now includes listings of Adobe PDF files from across the web, a first for any major search engine and a feature long overdue for them to offer.
Meet Your Favorite Search Engine Watch Contributors
Many of SEW's leading expert contributors will be at ClickZ Live, the new online and digital marketing event kicking off in New York (March 31-April 3). Hear from the likes of: Thom Craver, Josh Braaten, Lisa Barone, Simon Heseltine, Josh McCoy, Lisa Raehsler, Greg Jarboe, Dan Cristo, Joseph Kerschbaum, John Gagnon, Eric Enge and more!