At the end of last month, controversy erupted over the US White House preventing portions of its web site from being indexed by search engines. Was the White House doing this as a means to rewrite history unnoticed, or was it an innocent mistake?
At issue is the White House's robots.txt file, a long-standing mechanism used to tell search engines that certain material should not be indexed.
Keith Spurgeon has an excellent analysis of what happened. He outlines how sometime between April and late October of this year, the White House changed its robots.txt file to block access to many pages with the word "Iraq" in their URLs.
Same Page; Multiple URLs
Sound suspicious? For its part, the White House says this was done to prevent duplicate pages from being indexed. To understand, consider these URLs:
All of these have the exact same material, a July 2003 transcript of US president George W. Bush speaking about Afghanistan (I should say, the transcripts all look the same. I haven't compared them each word for word).
Why are there four different copies of this? The first uses a default White House "template," a look and feel that's applied to the page to make it appear part of the White House web site. This template does things like put the presidential seal in the top left-hand corner of the page and insert some navigational links on the left-hand side and top of the page.
The second URL is a text-only version of the same page. If you scroll to the bottom of the first URL shown -- the "normal" page -- you'll see an link saying "Text only." That link brings up the second URL.
Similarly, the third URL is a version of the first page designed to be printed, which you'd get if you clicked on the options saying "Printer-Friendly Version" that are shown on the original first page.
How about that fourth one? Select it, and you'll see that an "Iraq" template is used. This template has a look and feel for information apparently considered part of the White House's Iraq: Special Report section.
As you can see, the same page has at least four different ways that it can be found. There may be even more, though depending on the section of the White House site you are in, there could also be less.
The White House explained to 2600 News that it wanted to ensure that only one version of its pages got into the search engines, so that people wouldn't encounter multiple copies. That's a completely reasonable explanation. Search Engine Watch does the same thing itself, though by using the meta robots tag.
Print-Only At Search Engine Watch
Specifically, Search Engine Watch offers both "normal" and "printer-friendly" pages. Just look at the URL of this page you are reading (for those not getting the article via email) and compare it to the second URL, which brings up a printer-friendly version:
To prevent our printer pages from being indexed, we use of the meta robots tag. Any printer-only page carries this tag.
Why'd we go the meta robots tag route, rather than using a robots.txt block? The reason is that robots.txt isn't flexible enough. All printer-only pages in our system will contain /print.php/ (a virtual directory) in their URLs, but the exact location of this virtual directory section will vary. That means we can't block by doing something like this with the robots.txt file:
Instead, we'd need to do something like this:
That will work for Google, because Google has its own extension of the robots.txt file that allows the "wildcard" * character to be used. But other search engines may not support this extension. That's why we went with the safer meta robots tag option. Our content management system makes it easy to insert this tag on every print-only page.
By the way, if you look at the Search Engine Watch robots.txt file at the moment, you'll see that there is a line using a wildcard:
So what's up with that? It was inserted by our development team, apparently copied over from other sites they work on. It does nothing for the site and never got removed after I spoke with them about the issue of our print only pages getting indexed earlier this year.
Doh! As you can imagine, I'll be dropping them a message about this. But as the reference has no meaning for our site, it's also had no impact on it. It also shows how it can be easy be to misread what a site may be intending, just by looking at its robots.txt file.
Really Just A Mistake?
While the White House explanation seems reasonable, the way restrictions were implemented caused some content that doesn't actually exist to be blocked while ALL versions of some other content was completely restricted.
Since the White House uses the word "iraq" to create pages with its Iraq template, it seems to have decided to add the word "iraq" to many preexisting restrictions in the robots.txt file. Again, Keith Spurgeon offers a good analysis of this.
As a result, that Iraq: Special Report section is an example of content that was completely blocked, since the word Iraq is in its URL:
That's now been corrected, but having blocked such important content is enough for some to disbelieve the White House's explanation.
What's Really Blocked?
It should be noted that even using the robots.txt file does NOT prevent material from showing up in a Google search, as the Democratic National Committee wrote.
Remember I said the robots.txt file was used to block some content that had the word "iraq" in the URL? If this kept content out of Google, then a search for all pages from the White House web site with the word "iraq" in the URL should bring up nothing. Instead, it gets 1,100 matches. Let's look at two of these, as they appear at Google:
These URLs haven't been spidered by Google, which is why you see no descriptions. Google calls them partially-indexed, but I prefer the term link-only listings. That's because Google only knows about these URLs from what other people say when linking to them, rather than from having read the pages themselves.
Even link-only data can be pretty powerful. Do a search for major combat operations. You'll see that www.whitehouse.gov/news/releases/2003/05/iraq/20030501-15.html is listed just as shown above, ranked fifth. Google clearly understands it's about that topic.
How? That's a pretty important White House release, covering remarks where Bush announced that combat operations had ended in Iraq. If that's not enough, this release was apparently changed to insert the word "major" to modify "combat operations" in the headline. As a result, there are plenty of people linking to the page (Google reports 440) and using those words in or near the link. That helps make the link relevant for those terms.
By the way, in about a month or less, many of these link-only URLs will change to be fully-indexed, now that the restrictions have been listed. It takes Google that long to revisit some of the pages in its database.
But I See A Description!
Sometimes, you will see a description for pages that Google hasn't spidered. For instance, look at this search for iraq on Google. The Iraq: Special Report page that was blocked by the White House robots.txt file comes up 16th in the results. Change to iraq report, and it's number seven. Here's how it appears:
Whitehouse.gov: Operation Iraqi Freedom
Description: Provides a summary of the position of the US government, illustrated by presidential remarks and speeches...
Category: Society > Issues > ... > North America > United States
www.whitehouse.gov/infocus/iraq/ - Similar pages
Since there's a description, you might assume it had been crawled. If that were so, there'd be a little link saying "Cached" next to the URL. Clicking on that would have brought up a copy of the page as Google saw it, assuming that caching itself had not been blocked (there's no evidence the White House has done this).
So where's the description come from? Though Google hasn't crawled it, editors of the Open Directory have listed it in their human-compiled guide to the web. Google, which uses the Open Directory for its Google Directory, has seen this. Thus, it is able to give this link-only listing an actual description.
What About The Cache?
Remember that press release where the title was changed by the White House? I've seen it suggested that the ability to pull up old copies using the Google caching feature or the Internet Archive made spotting this possible, though this account of the change I've seen people point at makes no mention of that.
Certainly both features are handy ways to check for government revisionism of any type. But the Google caching feature is transient. At most, you're only likely to go back in time for about a month. The Internet Archive goes back further, but there can be several days or weeks between "snapshots" that it makes.
Another tool, though far less convenient, is likely the US Freedom Of Information Act or similar policies. They are supposed to give citizens access to public records. And in the case of the White House web site, I would hope that any change is recorded.
Hope isn't certainty, of course. In fact, guest author Marylaine Block in SearchDay last December explained that there seems to be no procedures on documenting changes at all. Given this, those really concerned about changes might consider crawling government web sites regardless of robots.txt commands, as I've already seen one person suggest.
It's helpful to go back to the origins of Martijn Koster's creation, the robots.txt file. It emerged originally not to block indexing but to keep "rogue" spiders under control. These were early spiders that hit web servers so hard for content that the web server collapsed under the load (The book Bots, by Andrew Leonard, has a nice history on this).
Robots.txt is a polite thing for crawlers to consider, but they aren't legally required to do so. Anyone can spider at will. Ultimately, it will be a court that will decide if they are doing something wrong.
eBay did win a "trespassing" case against Bidder's Edge, which spidered the eBay site for listings (see my Search Engines & Legal Issues: Crawling & Linking page for stories on this). However, it's difficult to imagine that a court might consider a public citizen or public interest group as trespassing by making copies of documents on a publicly-owned web server. Of course, spider too hard, and you might find an agency trying to suggest you are hacking the web server!
Finally, it's important to note that the White House apparently had asked the Internet Archive to ignore its robots.txt commands before this entire issue erupted, which lends support to the idea that this was all an honest mistake.