Andrei Broder, former vice president of research at AltaVista and until recently Distinguished Engineer & CTO, IBM Research, is joining Yahoo as research fellow and vice president of emerging search technology at Yahoo Research, according to this News.com article.
Broder has been involved in a wide-range of research activities related to the web and information retrieval, including the famous “bow-tie” study of web size and connectivity, and the web archaeology project together with other-well known researchers Krishna Bharat (Google news) and Monika Henzinger (Google research).
Postscript from Gary: Here’s a list with a few research papers and articles that Broder has authored or co-authored that might be of interest.
Title: A Taxonomy of Web Search
Author: Andrei Broder
Source: ACM SIGIR Forum
8 pages; PDF
Abstract: “Classic IR (information retrieval) is inherently predicated on users searching for information, the socalled “information need”. But the need behind a web search is often not informational — it might be navigational (give me the url of the site I want to reach) or transactional (show me sites where I can perform a certain transaction, e.g. shop, download a file, or find a map). We explore this taxonomy of web searches and discuss how global search engines evolved to deal with web-specific needs.”
Title: Sampling Search-Engine Results
Authors: Aris Anagnostopoulos, Andrei Z. Broder, David Carmel
Source: WWW 14 Conference (2005)
12 pages; PDF.
From the abstract:
We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as:
? Determining the set of categories in a given taxonomy spanned by the search results;
? Finding the range of metadata values associated to the result set in order to enable ?multi-faceted search;?
? Estimating the size of the result set;
? Data mining associations to the query terms.
Title: Sic Transit Gloria Telae: Towards an Understanding of the Web’s Decay
Source: WWW 13 Conference (2004)
Authors: Z. BarYossef, A. Broder, R. Kumar and A. Tomkins
10 pages; PDF.
From the Abstract:
“The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration.”
Title: Towards the next generation of enterprise search technology
Authors: A. Z. Broder and A. C. Ciccolo
Source: IBM Systems Journal (2004)
Abstract: “Unstructured information represents the vast majority of data collected and accessible to enterprises. Exploiting this information requires systems for managing and extracting knowledge from large collections of unstructured data and applications for discovering patterns and relationships. This paper elucidates the differences between search systems for the Web and those for enterprises, with an emphasis on the future of enterprise search systems. It also introduces the Unstructured Information Management Architecture (UIMA) and provides the context for the unstructured information management (UIM) papers that follow.”
Title: A technique for measuring the relative size and overlap of public Web search engines
Authors: Krishna Bharat and Andrei Broder
Source: WWW 7 Conference
From the Abstract:
” Search engines are among the most useful and popular services on the Web. Users are eager to know how they compare. Which one has the largest coverage? Have they indexed the same portion of the Web? How many pages are out there? Although these questions have been debated in the popular and technical press, no objective evaluation methodology has been proposed and few clear answers have emerged. In this paper we describe a standardized, statistical way of measuring search engine coverage and overlap through random queries.”