Here's a new and interesting research paper that might be of interest and make for some interesting reading. It's being made available for free in IEEE's Transactions on Knowledge and Data Engineering.
Title: Link Contexts in Classifier-Guided Topical Crawlers
16 pages; PDF.
Authors: Gautam Pant and Padmini Srinivasan (University of Iowa)
From the abstract:
Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. Link contexts have been applied to a variety of Web information retrieval and categorization tasks. Topical or focused Web crawlers have a special reliance on link contexts. These crawlers automatically navigate the hyperlinked structure of the Web while using link contexts to predict the benefit of following the corresponding hyperlinks with respect to some initiating topic or theme. Using topical crawlers that are guided by a Support Vector Machine, we investigate the effects of various definitions of link contexts on the crawling performance. We find that a crawler that exploits words both in the immediate vicinity of a hyperlink as well as the entire parent page performs significantly better than a crawler that depends on just one of those cues. Also, we find that a crawler that uses the tag tree hierarchy within Web pages provides effective coverage.
The papers bibliography is also more than worth a look.