What will it take for Google or another search engine to truly assemble a library of all of the world’s information? A thought-provoking essay by Wired magazine’s “senior maverick” takes a fascinating look at the challenges.
The various book scanning projects underway throughout the world don’t snare as much media coverage as higher-profile products and services introduced by the search engines, but they’re nonetheless important initiatives. As Wired co-founder Kevin Kelly writes in a recent New York Time Magazine article, “The dream is an old one: to have in one place all knowledge, past and present. All books, all documents, all conceptual works, in all languages.”
Building a Universal Library a huge undertaking, and not just because the physical effort of scanning tens of millions of books is in itself such a massive task. Once scanned, the books must be indexed and made searchable, all the while respecting the copyrights of books not yet in the public domain.
Kelly offers some interesting stats about the current progress of various large-scale book scanning projects that we’ve written about at Search Engine Watch, such as Google Print, the Yahoo and Microsoft-backed Open Content Alliance, The Internet Archive’s Million Books Project and others.
He says these projects are scanning about a million books a year. Although this sounds like an impressive pace, it amounts to just 5% of all books currently in print. Fortunately, much of the new information created by humans is now in digital format, so it can more easily be included in the Universal Library without the extensive physical effort of scanning books.
And let’s not forget the web. Although the search engines have become fairly proficient at creating comprehensive indexes of the surface web, they’re still missing massive amounts of content located in databases or other dynamic sources (the Invisible web)—not to mention web pages that have disappeared.
“The grand library naturally needs a copy of the billions of dead Web pages no longer online and the tens of millions of blog posts now gone—the ephemeral literature of our time.”
Including this “ephemeral literature” could prove to be a major challenge. Various studies have put the “half-life” of an average web page at just under two years, with the half-life of a typical web site being just over two years.
The most complete publicly accessible archive of the web, the Internet Archive, contains just a fraction of all content that has been posted to the web—some 55 billion pages in all.
But I think it’s a fair bet to say that Google and Yahoo haven’t thrown away the pages they’ve crawled through the years. And there’s a precedent for digital restoration on a massive scale: Google’s painstaking effort to build an archive of the Usenet.
Assembling archives stored on magnetic tape, CD-ROM and other sources, Google restored a comprehensive archive of Usenet, dating back to 1981, and made this available to users in December 2001. Although still not totally complete, the renamed Google Groups now likely contains more than 99 percent of all Usenet postings ever made.
It’s not unthinkable that Google and Yahoo, the longest surviving crawler-based engines, could collaborate to restore a comprehensive archive of the web. Surely there are data archives from search engines now long-gone that could also be mined to build out an archive.
Apart from the challenges of simply creating the Universal Library and making it searchable, Kelly thinks the entire paradigm of how we consume information must change. He envisions the emergence of Wikipedia-like directories where fans of particular types of information can write reviews, or create pointers to obscure works for other fans. In essence, we will all become librarians in the Universal Library, helping each other navigate the vast amount of information that’s difficult for us to cope with today.
And, just as we do with our digital music now, we’ll be able to mix and mash content to create “playlists” (Kelly calls them “bookshelves”) to share with others.
Ah, but what about copyright? How can we create mashups without violating existing laws? Kelly spends a lot of time analyzing the current state of copyright laws, and how it poses a major barrier to the creation and fluid operation of the Universal Library.
These are just a few of the topics Kelly touches on in his terrific intellectual romp mulling the issues with a Universal Library, Scan This Book! It’s a fascinating and thoughtful read, well worth the time of anyone who spends a lot of time consuming digital information and is impatiently awaiting the arrival of the Universal Library.
NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.