Here's a new and interesting research paper that might be of interest and make for some interesting reading. It's being made available for free in IEEE's Transactions on Knowledge and Data Engineering.
Title: Link Contexts in Classifier-Guided Topical Crawlers
16 pages; PDF.
Authors: Gautam Pant and Padmini Srinivasan (University of Iowa)
From the abstract:
Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. Link contexts have been applied to a variety of Web information retrieval and categorization tasks. Topical or focused Web crawlers have a special reliance on link contexts. These crawlers automatically navigate the hyperlinked structure of the Web while using link contexts to predict the benefit of following the corresponding hyperlinks with respect to some initiating topic or theme. Using topical crawlers that are guided by a Support Vector Machine, we investigate the effects of various definitions of link contexts on the crawling performance. We find that a crawler that exploits words both in the immediate vicinity of a hyperlink as well as the entire parent page performs significantly better than a crawler that depends on just one of those cues. Also, we find that a crawler that uses the tag tree hierarchy within Web pages provides effective coverage.
The papers bibliography is also more than worth a look.
Introducing... ClickZ Live!
SES Conference & Expo has merged with ClickZ to bring you ClickZ Live! The new global conference series takes on the identity of the industry's premier digital marketing publication, ClickZ.com, and kicks off March 31-April 3 in New York City. Join the industry's leading tech-advertisers in the advertising capital of the world! Find out more ››
*Super Saver Rates expire Jan 24.