The End of 404s?

There's nothing more frustrating than turning up a promising lead in search results only to see the aggravating "404 - Page Not Found" message when you click to fetch the page. Although some pages simply are removed from the web, the problem usually occurs when a webmaster thoughtlessly moves a web page in the name of "housekeeping" -- tidying up a server by putting files in more "logical" locations.

Unfortunately, until a search engine's crawler finds the file in its new location, it will continue to display a link to the old URL, which, of course, is now broken. The result: 404 - File Not Found.

U.C. Berkeley Computer Science Professors Thomas A. Phelps and Robert Wilensky have proposed a solution they call "robust hyperlinks." A robust hyperlink is a URL augmented with a small "signature" computed from the document the URL points to.

"Robust hyperlinks exhibit a number of desirable qualities," they write. "They can be computed and exploited automatically, are small and cheap to compute (so that it is practical to make all hyperlinks robust), do not require new server or infrastructure support, can be rolled out reasonably well in the existing URL syntax, can be used to automatically retrofit existing links to make them robust, and are easy to understand."

In essence, robust hyperlinks use an idea similar to the keywords meta tag. They're created by extracting unique words on a page and embedding them in the document's URL. When a page moves, you can simply enter the signature into a search engine, and in theory retrieve the page.

In practice, robust hyperlinks don't always work. They won't be able to locate new content that hasn't yet been indexed by search engines, for example. The robust hyperlink for the web version of this newsletter page is "html hchen wilensky cyberatlas searchenginewatch dilutes unsubscribing banishing supersearch helpfulness." But you won't be able to find this page via its robust hyperlink until some time passes and the various engines index the page -- if they ever do.

This is more of a problem than it may seem. Search engines simply don't index every page on the web. AltaVista says its collection of 250 million pages came from an original set of 400 million. FAST says its 400 million page index was developed from a group of 700 million. Excite's 250 million pages were retained after reviewing 920 million. And Inktomi says its core index of 110 million pages was created after analyzing over 1 billion across the web.

Another problem is that the algorithm that generates robust hyperlinks doesn't seem to take into account advertising or other promotional text on a page. Some of the keywords in the robust hyperlink for this page have nothing to do with this article, but rather are found in text links to other INT Media sites, like cyberatlas and searchenginewatch. Others are words used on every SearchDay page, such as "html" and "unsubscribing." Including these words in a robust hyperlink dilutes its uniqueness and may cause a search engine to return a "false positive" for a similar, but not identical page.

This isn't to pick nits with the idea of robust hyperlinks, but rather to illustrate that relying on a search engine to locate a page by its signature is an iffy proposition at best. The real power of robust hyperlinks won't be seen until support for them is built into content development tools and web browsers, bypassing the need to use a search engine altogether.

Nonetheless, the idea of robust hyperlinks has merit, and could go a long way toward banishing those pesky 404s from the web forever. You can start using them today, if you'd like, using the online robust hyperlink signature generator at the link below. Alternatively, being a savvy webmaster, simply avoid the temptation to "tidy up" and just leave your pages where they are so they'll never "go 404" in the first place.

Robust Hyperlinks and Robust Locations˜phelps/Robust/info.html
Professor Thomas A. Phelps and Robert Wilensky's page with more information on Robust Hyperlinks, with links to white papers, software downloads, and other similar projects.

Robust Hyperlink Signature Generator
Use this form to automatically compute a robust hyperlink for any page on the web.

404 Research Lab
The 404 Research Lab Mission Statement: "The 404 Research Lab is committed to improving the internet experience through the systematic eradication of ugly and confusing '404 Not Found' errors. The Lab acknowledges and respects the importance of 404 to the web, and works to actively improve the quality and helpfulness of 404s throughout the world."

The 404 Homepage˜isixtyfive/404page/404.html
A tribute to those sites that either gracefully help you find the page you're searching for, or that offer amusing or otherwise unique 404 pages.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.

About the author

Chris Sherman is a frequent contributor to several information industry journals. He's written several books, including The McGraw-Hill CD ROM Handbook and The Invisible Web: Uncovering Information Sources Search Engines Can't See, co-authored with Gary Price. Chris has written about search and search engines since 1994, when he developed online searching tutorials for several clients. From 1998 to 2001, he was's Web Search Guide.