In Google Has the Largest Number of Dead and Old Pages, the Google Operating System Blog points to a video and some research from Google’s Ziv Bar-Yossef that discusses how to grab a random sample of pages from major search engines and extrapolate from those pages information about the search engines. This can be used in a number of ways.
One interesting piece of information that you can determine from the method he discusses is the percentage of dead and of old pages that a search engine may contain. In comparing Google, MSN, and Yahoo! following these methods, Google appears to contain the largest number of dead pages. The Video is from an August 17th Techtalk from Google covering this.
In addition to the information that it provides about search engines, and this method of sampling them, the video also discloses that Ziv Bar-Yossef joined Google a couple of weeks before the video footage was shot. Ziv Bar-Yossef previously worked at the IBM Almaden Research Center, and was most recently at Technion – Israel Institute of Technology, Israel.
I also wrote a little about this reseach at the SEO by the Sea Blog in How Do You Estimate the Size of A Search Engine?, and include with that post a listing of some of the patents that he was involved in developing while at IBM. One of the more interesting was one on Methods and apparatus for assessing web page decay.
Ziv Bar-Yossef brings a wealth of knowledge to Google. Another interesting recent paper he was involved with, while at Technion, looks carefully at Different URLs with Similar Text, and ways that search engines could identify those more easily.