Using a search engine and free software tools, it’s possible to dig up hidden — even deleted — information in documents posted to public web sites.
Many search engines allow you to restrict your search results to non-HTML documents, such as Microsoft office documents, PDF files, and others. In addition to the text stored in these files, these types of documents often contain other types of information not intended to be seen by users.
This information includes metadata such as author name, organization, editing history, and can also include custom data such as the names of document reviewers, who the document was received from, and so on.
In addition to this metadata, many programs also store recently deleted text, allowing you to “undo” unwanted changes. Using simple, freely available software tools, much of this hidden metadata and seemingly deleted text can be converted into visible plain text.
Simon Byers, an AT&T security researcher, used a search engine to find more than 100,000 Microsoft Word files on the web, including business documents and resumes. He then used the free software tools “antiword” and “catdoc” to convert them to plain text.
Byers found deleted text and information including names, email headers, network paths and text from related documents — potentially compromising information that people publishing the documents to the web likely did not realize was included.
Byers suggested that job seekers, in particular, may not realize that even if they delete their social security number from a resume posted to the web, that the number may still be included in the file and accessible to someone intent on identity theft.
The New Scientist has an excellent report on Byers’ research, which has been submitted for publication in the IEEE journal Security and Privacy.
If you post non-HTML documents to the web, how can you make sure potentially compromising information is not included?
The safest way is to convert the document to plain text, then paste the text into a new document. Then, use the “File, Properties” command to see what metadata has been included. This method isn’t foolproof — to be absolutely certain a document doesn’t contain information you don’t want revealed, publish it as a simple HTML file.
NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.
| ‘Good’ worm hitting computers…
CNN Aug 19 2003 2:06PM GMT
| MSN Search Tests Worrisome for LookSmart…
Boston.Internet.com Aug 19 2003 3:54AM GMT
| Spam king shuts down…
Australian IT Aug 19 2003 2:26AM GMT
| AOL 9.0 gets personal with subscribers…
CNET Aug 19 2003 0:15AM GMT
| Music Group Won’t Sue Small Downloaders…
SiliconValley.com Aug 18 2003 10:37PM GMT
| Overture a better buy than expected?…
CNET Aug 18 2003 10:16PM GMT
| LookSmart’s Microsoft deal looks rocky…
CNET Aug 18 2003 8:52PM GMT
| Overture improves ad tools…
CNET Aug 18 2003 6:53PM GMT
| Google is most popular but others may do it better…
San Francisco Chronicle Aug 18 2003 2:15PM GMT
| Microsoft search development threatens LookSmart figures…
Netimperative Aug 18 2003 12:55PM GMT
| TOP 20: Search terms on MSN…
Netimperative Aug 18 2003 9:35AM GMT
| The bubble that didn’t burst…
Guardian Unlimited Aug 18 2003 8:11AM GMT
| Q&A – Resubmitting to the Search Engines…
About Web Search Aug 18 2003 5:14AM GMT