Spotted via Threadwatch,
Keeper of Expired Web Pages Is Sued Because Archive Was Used in Another Suit from the New York Times discusses how the Internet Archive is being sued for crawling the web and making copies of web pages. A
copyright infringement case against a search engine, then? Not exactly, as we’ll see.
At issue, a court case on trademarks were evidence of past usage was found through the Internet Archive. Healthcare Advocates said copies of its pages
were made without permission. In particular, Healthcare Advocates says despite making use of a robots.txt file, there were 92 occasions when its pages still managed to be
In a further twist, the company claims the law firm getting those pages violated the Digital Millennium Copyright Act provisions of "circumventing" the
robots.txt file exclusion.
Time for a good laugh at that, honestly. As the article explains, robots.txt is a voluntary opt-out measure designed for crawlers. It has no legal
bearing. In addition, nothing in a browser prevents someone from viewing pages that have been blocked by robots.txt. In short, no one has to circumvent robots.txt to view a
page. It doesn’t try to block that at all.
As for the copyright infringement, from what I can see, the Internet Archive itself is not being sued for copyright infringement. Instead, it’s being
sued for allowing those copies to be seen despite a robots.txt block. The article says this failure has the Internet Archive under fire for "breach of contract and fiduciary
duty, negligence and other charges."
Interesting. I’d say absurd, but you never know, maybe the case will convince a court that a search engine has some type of binding contract with
company that runs a web site solely on the basis of crawling it. As said, robots.txt is a voluntary mechanism to keep pages out of a crawler. It’s not a legal requirement.
Moreover, while I haven’t seen the case yet (Gary will probably dig it up and post here, if so), red flags already go up about the robots.txt file
preventing "public viewing" of the pages.
Robots.txt traditionally removes pages entirely from an index. They don’t hang around. That’s certainly what the Internet Archive
says. If robots.txt was up, then at some point, the pages should have been entirely removed from the Internet Archive
For some further reading, my Google & Other Search Engines: The WMDs Of Copyright
Infringement and Forget Google Print Copyright Infringement; Search Engines Already Infringe articles
cover how search engines make copies of billions of documents each month without permission, relying on the opt-out non-legal provisions of robots.txt to hopefully keep them
Postscript (from Gary): If you would like to read the actual complaint filed in the lawsuit, I’ve posted a copy (48 pages; PDF) here.
Postscript 2: Internet Archive DMCA Circumvention Lawsuit from Seth Finkelstein looks at how the robots.txt file with Internet Archive doesn’t actually remove content but rather simple suppresses display. And our forum thread, Implications of the Internet Archive lawsuit also looks at this and the important impact this can have if a domain name changes ownership. What you thought was removed might very well show up again.