A longer version of this article looking more
At the end of last month, controversy erupted over the US White House preventing portions of its web site from being indexed by search engines. Was the White House doing this as a means to rewrite history unnoticed, or was it an innocent mistake?
At issue is the White House's robots.txt file, a long-standing mechanism used to tell search engines that certain material should not be indexed.
Keith Spurgeon has an excellent analysis of what happened. He outlines how sometime between April and late October of this year, the White House changed its robots.txt file to block access to many pages with the word "Iraq" in their URLs.
Same Page; Multiple URLs
Sound suspicious? For its part, the White House says this was done to prevent duplicate pages from being indexed. To understand, consider these URLs:
All of these have the exact same material, a July 2003 transcript of US president George W. Bush speaking about Afghanistan (I should say, the transcripts all look the same. I haven't compared them each word for word).
Why are there four different copies of this? The first uses a default White House "template," a look and feel that's applied to the page to make it appear part of the White House web site. This template does things like put the presidential seal in the top left-hand corner of the page and insert some navigational links on the left-hand side and top of the page.
The second URL is a text-only version of the same page. If you scroll to the bottom of the first URL shown -- the "normal" page -- you'll see an link saying "Text only." That link brings up the second URL.
Similarly, the third URL is a version of the first page designed to be printed, which you'd get if you clicked on the options saying "Printer-Friendly Version" that are shown on the original first page.
How about that fourth one? Select it, and you'll see that an "Iraq" template is used. This template has a look and feel for information apparently considered part of the White House's Iraq: Special Report section.
As you can see, the same page has at least four different ways that it can be found. There may be even more, though depending on the section of the White House site you are in, there could also be less.
The White House explained to 2600 News that it wanted to ensure that only one version of its pages got into the search engines, so that people wouldn't encounter multiple copies. That's a completely reasonable explanation. Search Engine Watch does the same thing itself, though by using the meta robots tag.
Really Just A Mistake?
While the White House explanation seems reasonable, the way restrictions were implemented caused some content that doesn't actually exist to be blocked while ALL versions of some other content was completely restricted.
Since the White House uses the word "iraq" to create pages with its Iraq template, it seems to have decided to add the word "iraq" to many preexisting restrictions in the robots.txt file. Again, Keith Spurgeon offers a good analysis of this.
As a result, that Iraq: Special Report section is an example of content that was completely blocked, since the word Iraq is in its URL:
That's now been corrected, but having blocked such important content is enough for some to disbelieve the White House's explanation.
What's Really Blocked?
It should be noted that even using the robots.txt file does NOT prevent material from showing up in a Google search, as the Democratic National Committee wrote.
Remember I said the robots.txt file was used to block some content that had the word "iraq" in the URL? If this kept content out of Google, then a search for all pages from the White House web site with the word "iraq" in the URL should bring up nothing. Instead, it gets 1,100 matches. Let's look at two of these, as they appear at Google:
These URLs haven't been spidered by Google, which is why you see no descriptions. Google calls them partially-indexed, but I prefer the term link-only listings. That's because Google only knows about these URLs from what other people say when linking to them, rather than from having read the pages themselves.
Even link-only data can be pretty powerful. Do a search for major combat operations. You'll see that www.whitehouse.gov/news/releases/2003/05/iraq/20030501-15.html is listed just as shown above, ranked fifth. Google clearly understands it's about that topic.
How? That's a pretty important White House release, covering remarks where Bush announced that combat operations had ended in Iraq. If that's not enough, this release was apparently changed to insert the word "major" to modify "combat operations" in the headline. As a result, there are plenty of people linking to the page (Google reports 440) and using those words in or near the link. That helps make the link relevant for those terms.
By the way, in about a month or less, many of these link-only URLs will change to be fully-indexed, now that the restrictions have been listed. It takes Google that long to revisit some of the pages in its database.
What About The Cache?
Remember that press release where the title was changed by the White House? I've seen it suggested that the ability to pull up old copies using the Google caching feature or the Internet Archive made spotting this possible, though this account of the change I've seen people point at makes no mention of that.
Certainly both features are handy ways to check for government revisionism of any type. But the Google caching feature is transient. At most, you're only likely to go back in time for about a month. The Internet Archive goes back further, but there can be several days or weeks between "snapshots" that it makes.
Another tool, though far less convenient, is likely the US Freedom Of Information Act or similar policies. They are supposed to give citizens access to public records. And in the case of the White House web site, I would hope that any change is recorded.
Hope isn't certainty, of course. In fact, guest author Marylaine Block in SearchDay last December explained that there seems to be no procedures on documenting changes at all. Given this, those really concerned about changes might consider crawling government web sites regardless of robots.txt commands, as I've already seen one person suggest.
It's helpful to go back to the origins of Martijn Koster's creation, the robots.txt file. It emerged originally not to block indexing but to keep "rogue" spiders under control. These were early spiders that hit web servers so hard for content that the web server collapsed under the load (The book Bots, by Andrew Leonard, has a nice history on this).
Robots.txt is a polite thing for crawlers to consider, but they aren't legally required to do so. Anyone can spider at will. Ultimately, it will be a court that will decide if they are doing something wrong.
eBay did win a "trespassing" case against Bidder's Edge, which spidered the eBay site for listings (see my Search Engines & Legal Issues: Crawling & Linking page for stories on this). However, it's difficult to imagine that a court might consider a public citizen or public interest group as trespassing by making copies of documents on a publicly-owned web server. Of course, spider too hard, and you might find an agency trying to suggest you are hacking the web server!
Finally, it's important to note that the White House apparently had asked the Internet Archive to ignore its robots.txt commands before this entire issue erupted, which lends support to the idea that this was all an honest mistake.
A longer version of this article looking more
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.