Search Engines Uncover Compromising Documents

Using a search engine and free software tools, it’s possible to dig up hidden — even deleted — information in documents posted to public web sites.

Many search engines allow you to restrict your search results to non-HTML documents, such as Microsoft office documents, PDF files, and others. In addition to the text stored in these files, these types of documents often contain other types of information not intended to be seen by users.

This information includes metadata such as author name, organization, editing history, and can also include custom data such as the names of document reviewers, who the document was received from, and so on.

In addition to this metadata, many programs also store recently deleted text, allowing you to “undo” unwanted changes. Using simple, freely available software tools, much of this hidden metadata and seemingly deleted text can be converted into visible plain text.

Simon Byers, an AT&T security researcher, used a search engine to find more than 100,000 Microsoft Word files on the web, including business documents and resumes. He then used the free software tools “antiword” and “catdoc” to convert them to plain text.

Byers found deleted text and information including names, email headers, network paths and text from related documents — potentially compromising information that people publishing the documents to the web likely did not realize was included.

Byers suggested that job seekers, in particular, may not realize that even if they delete their social security number from a resume posted to the web, that the number may still be included in the file and accessible to someone intent on identity theft.

The New Scientist has an excellent report on Byers’ research, which has been submitted for publication in the IEEE journal Security and Privacy.

If you post non-HTML documents to the web, how can you make sure potentially compromising information is not included?

The safest way is to convert the document to plain text, then paste the text into a new document. Then, use the “File, Properties” command to see what metadata has been included. This method isn’t foolproof — to be absolutely certain a document doesn’t contain information you don’t want revealed, publish it as a simple HTML file.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.

‘Good’ worm hitting computers
CNN Aug 19 2003 2:06PM GMT
MSN Search Tests Worrisome for LookSmart Aug 19 2003 3:54AM GMT
Spam king shuts down
Australian IT Aug 19 2003 2:26AM GMT
AOL 9.0 gets personal with subscribers
CNET Aug 19 2003 0:15AM GMT
Music Group Won’t Sue Small Downloaders Aug 18 2003 10:37PM GMT
Overture a better buy than expected?
CNET Aug 18 2003 10:16PM GMT
LookSmart’s Microsoft deal looks rocky
CNET Aug 18 2003 8:52PM GMT
Overture improves ad tools
CNET Aug 18 2003 6:53PM GMT
Google is most popular but others may do it better
San Francisco Chronicle Aug 18 2003 2:15PM GMT
Microsoft search development threatens LookSmart figures
Netimperative Aug 18 2003 12:55PM GMT
TOP 20: Search terms on MSN
Netimperative Aug 18 2003 9:35AM GMT
The bubble that didn’t burst
Guardian Unlimited Aug 18 2003 8:11AM GMT
Q&A – Resubmitting to the Search Engines
About Web Search Aug 18 2003 5:14AM GMT
powered by

Related reading