A number of technical papers relating to indexing the web and searching were presented at the recent 10th International World Wide Web Conference, which was held from May 1 to 5 in Hong Kong.
Below, I've highlighted some presentations that seemed particularly interesting. Be warned -- many these are highly technical documents.
WWW10 Proceedings: Table of Contents
A full list of papers presented at the conference.
Rank Aggregation Methods for the Web
Discusses techniques to rerank results when combining data from different data sources, which among other things may help improve the quality of meta search results and help eliminate spam from search listings.
A Case Study in Web Search using TREC Algorithms
TREC is a long standing study designed to evaluate the effectiveness of information retrieval algorithms. However, do these tests evaluate web-wide search engines fairly? The authors of this paper try a new test, to measure how well an ideal TREC algorithm fares in a situation common in the real world of web search -- navigational requests. Results? Web search engines do better than the TREC algorithm. Well worth reading, especially to understand the great difficulties in assessing relevance.
When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics
Examines the "Hilltop" system, which can identify pages that can be classified as "experts" from within a large index. These expert pages are then associated with keyphrases, depending on their context. The result is an algorithm that produces high-quality search results for popular or broad topics from a much smaller set of web pages than used by the major crawler-based search engines.
Breadth-first search crawling yields high-quality pages
Finds that crawling links in the order that they are discovered -- "breadth-first" -- provides a fairly good collection of high-quality web pages when compared to a system that crawls in order of perceived page quality. Why care? Calculating page quality uses a lot of computational resources that breadth-first crawling doesn't require, which means breadth-first crawling may save money without greatly decreasing the quality of an index.
Intelligent Crawling On the World Wide Web with Arbitrary Predicates
Covers the idea of making a crawling system that learns how to find the best pages on a particular subject by examining the context of URLs (when available) and learning relationships that mark the type of pages it should collect.
Building a Distributed Full-Text Index for the Web
Explores new techniques and methods for building large indexes of the web.
An Adaptive Model for Optimizing Performance of an Incremental Web Crawler
Examines how IBM's WebFountain crawler attempts to keep its index constantly updated with changes from the web.
Finding Authorities and Hubs From Link Structures on the World Wide Web
Compares various link analysis ranking algorithms to discover how well they work to bring back documents and to find ways so that such algorithms can be measured against each other.
PicASHOW: Pictorial Authority Search by Hyperlinks on the Web
Describes a system that uses link analysis to bring back relevant image files in response to searches.
Placing Search in Context: The Concept Revisited
Examines a system to improve search requests by looking at the context of a web page that a searcher may be viewing.
On Integrating Catalogs
Explores a technique for automatically classifying documents.
You Are What You Link
Link analysis can provide a means of easily determining what someone is interested in and who they associate with, when run against a known set of home pages.
Scaling Question Answering to the Web
MULDER's not a character from the X-Files but instead a system that processes questions in a way so that web search engines are more likely to respond with answers.
Clustering User Queries of a Search Engine
Examines how measuring clickthrough can reveal if two or more different queries are related to the same answer. This can both help identify documents that should be manually promoted by search engine editors and which queries they should appear for.
Learning Search Engine Specific Query Transformations for Question Answering
Examines a method to automatically detect question-oriented queries and translate them in a way to improve the results returned by web search engines.
Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction
Examines a proposed technique for refining link analysis so that results are less prone to being influenced by link farms, bogus links and other attempts to inflate link relevance.
Geospatial Mapping and Navigation of the Web
Examines ways that web pages can be geocoded and introduces a tool for browsing the web geographically.
Retrieving and Organizing Web Pages by "Information Unit"
If a web page doesn't actually contain the words you are searching for, a crawler-based search engine might not list it in response to your query. However, this paper suggests a solution might be to have search engines that list "information units." These would be combinations of pages. The advantage is that when several similar pages are combined, they may be better defined and thus easier to find.
Towards a Highly-Scalable and Effective Metasearch Engine
Examines a method for better understanding the data sources a meta search engine can tap into and when it should actually access these.