SEO News
author-default

Deleted "Sensitive" Web Sites Still Available via Google

by , Comments

Heightened security concerns have led a number of organizations to remove "sensitive" information from their web sites, yet much of this information is still available, even to people with relatively modest searching skills.

Among those organizations removing content, the Federation of American Scientists' (FAS) Web site removed diagrams and photos of U.S. intelligence facilities, according to an Associated Press article. A Washington Post story reported that
the Centers for Disease Control and Prevention removed a "Report on Chemical Terrorism," which describes industry's shortcomings in preparing for a possible terrorist attack.

The CDC report, as well as thousands of apparently deleted pages from the FAS web site, are still available via Google's cache feature. Also available is a page with detailed information of where and how the FAS obtained some of the images and maps it formerly made available on its site.

A search on Google using the phrase "U.S. intelligence facilities" limited to the FAS web site returned more than 4,500 results. Clicking most result links resulted in "404 - not found" messages. This should be expected since the pages were taken down relatively recently and Google has not yet recrawled the site and removed the now dead links from its database.

However, clicking the link to "cached" copies of more than a dozen of these search results displayed the expected page. Most, if not all of the deleted pages will remain available until Google recrawls the site and removes the dead links from its index.

Google isn't proactively removing content from its database, even for cases like the FAS site which was reported in major media outlets. "We're not doing anything special," said Google spokesperson David Krane. "The Net is changing so fast right now there's just too much to stay on top of."

Google provides a number of methods for removing pages from its cache, or preventing a page from being cached when the crawler first discovers it. For urgent situations, Google also provides an automated means for removing large numbers of pages from its cache.

The most direct way to prevent a crawler from indexing pages on your site is to use the robots.txt file to disallow automated retrieval of your content. But if you're not careful, the syntax used in this file can actually prevent your site from being indexed by any search engine.

A more precise way to prevent content from being included in Google's cache is to use the NOARCHIVE meta tag.

If you want to prevent all robots (crawlers) from archiving content on your site, place this tag in the section of your documents as follows:

If you want to allow other indexing robots to archive your page's content, preventing only Google's robots from caching the page, use the following tag:

This tag only removes the "cached" link for the page. Google continues to index the page and display a "snippet," allowing users to visit the page on your site.

If you want your content removed from Google's index as quickly as possible, use Google's automatic URL removal system. To use this system, you must first register with Google. Using this system, you can remove either a single page or groups of pages, images or even subdirectories.

Google uses the robots.txt file to know which pages you want removed from its index, so you'll need to prepare or modify this file before using the automated removal system. More information on how to set up a robots.txt file can be found on The Web Robots Pages site.

Once you've submitted your request, your content will be removed from Google's server, generally within 24 hours. Google offers a status indicator allowing you to check the progress of your request online.

Remove Content from Google's Index
http://www.google.com/remove.html

The Web Robots Pages
http://www.robotstxt.org/wc/norobots.html

Some of the content reported by the AP and Washington Post as being deleted is not available via Google, or any other search engine, for that matter. Most of this information appears to have been stored in web-accessible databases, making it part of the Invisible Web.

An example is the Environmental Protection Agency's RMP*Info Risk Management Plan Database. This system allowed extensive searching of hazardous materials stored at thousands of locations by facility name, chemical name, or geographic location. The database's search form is still available via Google's cache, but attempting to search the database using the form simply returns a "404 - Not Found" message.

While the concerns of organizations who have removed content from the web are understandable, it's unfortunate that the web no longer has the degree of openness it once had. What's happening is not blanket censorship, because many organizations who have yanked web content continue to make sensitive information available at their offices to people showing appropriate identification.

But it's just more difficult now. That said, clever searchers will find other ways to find the kind of detailed information no longer available on the FAS site. For example, TerraServer has a huge database of aerial and satellite images from more than 60 countries, allowing you to select and print high-resolution images sized as small as 1.96 square kilometers.

Gone are the days when you could find "anything" on the web. Chalk up another casualty of the "events" of September 11th.

Web Sites Pull Intelligence Data
http://news.excite.com/news/ap/011003/16/attacks-net-censorship
An Associated Press story reporting on pages removed by the Federation of American Scientists, the National Imagery and Mapping Agency, The U.S. Office of Pipeline Safety and others in the wake of the terrorist attacks on the U.S.

Agencies Scrub Web Sites Of Sensitive Chemical Data
http://www.washingtonpost.com/wp-dyn/articles/A2738-2001Oct3.html
The Washington Post reports on how some federal agencies have been removing documents from Internet sites to keep them away from terrorists.

Intelligence data pulled from websites
http://news.bbc.co.uk/low/english/sci/tech/newsid_1580000/1580863.stm
The BBC reports on sensitive documents and reports have been pulled from websites across the internet following the 11 September attacks.

TerraServer
http://www.terraserver.com/
TerraServer is a huge, freely accessible database of satellite imagery that's growing by additional 5,000 square kilometers each day.

How Google Works
http://www.searchenginewatch.com/subscribers/google.html
A detailed look under the hood at all aspects of Google's operation.


ClickZ Live Toronto Twitter Canada MD Kirstine Stewart to Keynote Toronto
ClickZ Live Toronto (May 14-16) is a new event addressing the rapidly changing landscape that digital marketers face. The agenda focuses on customer engagement and attaining maximum ROI through online marketing efforts across paid, owned & earned media. Register now and save!

Recommend this story

comments powered by Disqus