Going Beyond HTML Raises Security Concerns With Google

Now that Google is indexing a wide range of document types beyond HTML and plain text formats, potential security concerns are cropping up, both for searchers and webmasters.

From the searcher point of view, the concern is that you might unwittingly open yourself up to viruses that are embedded in non-HTML files, such as Word macro viruses.

Until recently, search engines only delivered you to "safe" HTML or text files. It was possible that even these type of files might try to harm you, such as via JavaScript exploits. However, anyone who browses the web was already exposed to such potential threats routinely, and they generally don't have problems.

In contrast, people do not routinely open data documents such as Word or Excel files from those who they do not know. Google has changed this, because its search results now contain direct links to such files from across the web. These direct links mean that users might unwittingly open infected files.

For example, try a search for "clearcutting and fish populations in idaho." The second result is an oddly named document called "Clearcutting in." If you were to click on this link, instead of the document loading in your browser, your computer would instead launch Microsoft Word (assuming you have it installed).

This is because the link leads to a DOC file, a data file used by Microsoft Word. Such files can contain viruses, and if you open one without protection, you'd be exposed to any virus inside.

The safe alternative is to always view such results using the "View as HTML" link that Google provides. You'll see this link any time Google lists a non-HTML or text format file. By following it, you will be shown a safe, HTML version of the listing in your browser.

Ideally, Google would switch things around. By default, I think the main link should bring up the safe HTML version while the "View as HTML" link would instead say something like "View Original File Type." That would greatly reduce the odds of searchers getting accidentally infected by a virus. Google says it's something they'll consider.

"We're going to continue to take a close look at this, because as you know, our users and their experience with Google is our number one priority," said spokesperson David Krane.

Krane also said that Google is noticing that when non-HTML content is offered, many users are opting to use the "View as HTML" choice. Aside from avoiding viruses, another good reason to do this is because the HTML versions are typically smaller than the actual data files, which means they load faster.

Another important point to note is that while the potential for viruses to hit searchers exists, the reality is that this hasn't seemed to have actually happened.

"We've yet to see email from any of our users complaining about computer viruses that they obtained via our search results," Krane said.

Meanwhile, some webmasters are reportedly shocked to discover that Word documents, Excel files and other material they make available through public web sites can now be found by searching at Google. There's even the further concern that some of these documents might contain sensitive information, such as credit card numbers or password information.

The reality is that Google hasn't "created" a security problem with these documents. It has simply exposed them. ANY document that is made available on an Internet server (be it web, FTP, Usenet, etc.) can be found by anyone. People can (and do) even create their own spiders to seek documents of particular types, such as email harvesters that roam the Internet in search of email addresses.

If a document is sensitive, don't place it on the Internet, period. What if you must expose it to the Internet, so that selected individuals outside your company or organization can access it? Then establish a password protection or "authentication" system for your web server, and make these documents only available to those who have a username and password.

Authentication systems will stop crawler-based search engines in their tracks. It's an even better solution that using a robots.txt file, because listing sensitive data that you don't want indexed by a spider in your robots.txt file is essentially a menu for any human who reads the file to find that information. An authentication system reveals nothing, and it has the added plus of keeping humans out, as well.

Keep this in mind. None of the major search engine spiders will try to access authenticated information. However, a custom spider or a nefarious human may still try to hack their way in. Authentication is a barrier to them, but not absolute protection.


Google Unveils More of the Invisible Web
SearchDay, Oct. 31, 2001

In-depth review of new coverage of non-HTML files provided by Google. Search Engine Watch members -- use the link on this page to reach the members-only edition written for you, which covers issues about making sure the titles of your non-HTML documents make sense and how to prevent non-HTML documents from being indexed.

New internet search could turn up viruses
New Scientist, Nov. 28, 2001

Touches on issues in the story above, with more quotes from Google.

The Google attack engine
The Register, Nov. 28, 2001

Hackers might be able to use Google to attack servers, switches and routers, this article says.

Google, others dig deep--maybe too deep
News.com, Nov. 26, 2001

A long, in-depth look at the security concerns, with quotes from various analysts.