We've noticed that the publicly announced total size of permanently archived web pages found in The Internet Archive's Wayback Machine has increased in size 15 billion pages today from 40 billion pages to 55 billion archived web pages (versus a web search cache).
The increased page total has now been updated yet on the Wayback home page and it's also clearly visible on The IA's home page. Is The Wayback Machine a complete archive of every page on the web? No, not at all. However, it's the largest one-stop permanent web archive out there and it's a very important tool for all web researchers. Material in The Wayback Machine dates back to 1996. Kudos and congrats to Brewster and his team. Now, let's hope keyword search capabilities comeback soon. Btw, you can also find direct links to The Wayback Machine from Gigablast (look for the "Older Copies" link) next to each snippet and Yahoo. Look for The Wayback Machine link in the top box when reviewing a page found in the Yahoo cache.
Postscript: You can also find direct links to The Wayback Machine via Alexa.com. On a results page, click the "Site Info" link and then look for The Wayback Machine links in the left column.
Posted by Gary Price at 6:48 PM | Permalink
Via Boing Boing and News.com, interesting news that a case saying the Google cache violates copyright has been ruled in Google's favor. Since Google makes it possible to prevent it from showing a cached page, the court ruled the publisher should have used that. In short, it you don't block caching, Google and other search engines have an "implied license" to reproduce your material. More in the court documents on the EFF site here (PDF format), and an EFF write up is here.
Postscript: Caching Made Legal - Do You Agree? I Don't! at the Search Engine Watch Forums has more analysis of this by me and some of the major concerns it raises. Read more or comment yourself over there. There's also excellent discussion at WebmasterWorld here.
Posted by Danny Sullivan at 7:57 AM | Permalink
Brewster and crew at The Internet Archive have just debuted a new specialty collection that contains more 25 million fully archived web pages that are also full text searchable that create, "an historical record of the devastation caused by Hurricane Katrina and the massive relief effort which followed."
Material in this archive was crawled and compiled from September 4 - October 17th. More in this announcement. A complete list of the urls crawled is available here. I'm honored that our ResourceShelf collection of Katrina resources was included in the archive. Other specialty "web collections" compiled by The Internet Archive can be found in the middle of this page. Btw, the search on the Hurricanes Katrina & Rita Web Archive is powered by Nutch.
Posted by Gary Price at 7:00 PM | Permalink
A new and growing web archive of "social, historic and culturally significant web-based material from the UK domain" is now online. It's called the UK Web Archive and is being developed by the UK Web Archiving Consortium (UKWAC) that includes members from the Joint Information Systems Committee of the Higher and Further Education Councils (JISC), The National Archives, The National Library of Wales, the National Library of Scotland and the Wellcome Trust.
Here are a few fast facts about the project: + UKWAC Has Been Archiving Material for Six Months + Browse by Topic or Search (You're searching metadata) + Archive Currently Contains 299 Titles and 1090 Web Sites + 84 GB of Data Archived So Far + Using PANDAS software developed at the National Library of Australia
Another project, UK Government Web Archive that comes from The National Archives and The Internet Archive might also be of interest. Finally, If you're never visited the PANDORA archive from Australia, here's a link.
Posted by Gary Price at 7:04 PM | Permalink
Since we all love to toss around total database size numbers, I just noticed that in the past few hours the Internet Archive has posted a total size update for The Wayback Machine. The database now contains more than 40 billion archived pages. The previous total size number that the IA provided was 30 billion pages. I'm assuming that the total page count reflects updates that have been made to the database during the past couple of years. Now, if we could only get the searchable full text access that was once available back online. (-:
Posted by Gary Price at 6:44 PM | Permalink