There's nothing more frustrating than turning up a promising lead in search results only to see the aggravating "404 - Page Not Found" message when you click to fetch the page. Although some pages simply are removed from the web, the problem usually occurs when a webmaster thoughtlessly moves a web page in the name of "housekeeping" -- tidying up a server by putting files in more "logical" locations.
Unfortunately, until a search engine's crawler finds the file in its new location, it will continue to display a link to the old URL, which, of course, is now broken. The result: 404 - File Not Found.
U.C. Berkeley Computer Science Professors Thomas A. Phelps and Robert Wilensky have proposed a solution they call "robust hyperlinks." A robust hyperlink is a URL augmented with a small "signature" computed from the document the URL points to.
"Robust hyperlinks exhibit a number of desirable qualities," they write. "They can be computed and exploited automatically, are small and cheap to compute (so that it is practical to make all hyperlinks robust), do not require new server or infrastructure support, can be rolled out reasonably well in the existing URL syntax, can be used to automatically retrofit existing links to make them robust, and are easy to understand."
In essence, robust hyperlinks use an idea similar to the keywords meta tag. They're created by extracting unique words on a page and embedding them in the document's URL. When a page moves, you can simply enter the signature into a search engine, and in theory retrieve the page.
In practice, robust hyperlinks don't always work. They won't be able to locate new content that hasn't yet been indexed by search engines, for example. The robust hyperlink for the web version of this newsletter page is "html hchen wilensky cyberatlas searchenginewatch dilutes unsubscribing banishing supersearch helpfulness." But you won't be able to find this page via its robust hyperlink until some time passes and the various engines index the page -- if they ever do.
This is more of a problem than it may seem. Search engines simply don't index every page on the web. AltaVista says its collection of 250 million pages came from an original set of 400 million. FAST says its 400 million page index was developed from a group of 700 million. Excite's 250 million pages were retained after reviewing 920 million. And Inktomi says its core index of 110 million pages was created after analyzing over 1 billion across the web.
Another problem is that the algorithm that generates robust hyperlinks doesn't seem to take into account advertising or other promotional text on a page. Some of the keywords in the robust hyperlink for this page have nothing to do with this article, but rather are found in text links to other INT Media sites, like cyberatlas and searchenginewatch. Others are words used on every SearchDay page, such as "html" and "unsubscribing." Including these words in a robust hyperlink dilutes its uniqueness and may cause a search engine to return a "false positive" for a similar, but not identical page.
This isn't to pick nits with the idea of robust hyperlinks, but rather to illustrate that relying on a search engine to locate a page by its signature is an iffy proposition at best. The real power of robust hyperlinks won't be seen until support for them is built into content development tools and web browsers, bypassing the need to use a search engine altogether.
Nonetheless, the idea of robust hyperlinks has merit, and could go a long way toward banishing those pesky 404s from the web forever. You can start using them today, if you'd like, using the online robust hyperlink signature generator at the link below. Alternatively, being a savvy webmaster, simply avoid the temptation to "tidy up" and just leave your pages where they are so they'll never "go 404" in the first place.
Robust Hyperlinks and Robust Locations
Professor Thomas A. Phelps and Robert Wilensky's page with more information on Robust Hyperlinks, with links to white papers, software downloads, and other similar projects.
Robust Hyperlink Signature Generator
Use this form to automatically compute a robust hyperlink for any page on the web.
404 Research Lab
The 404 Research Lab Mission Statement: "The 404 Research Lab is committed to improving the internet experience through the systematic eradication of ugly and confusing '404 Not Found' errors. The Lab acknowledges and respects the importance of 404 to the web, and works to actively improve the quality and helpfulness of 404s throughout the world."
The 404 Homepage
A tribute to those sites that either gracefully help you find the page you're searching for, or that offer amusing or otherwise unique 404 pages.
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.
Thursday, July 18: ClickZ Live will be in Vancouver, BC. Register before July 1 to save $100!