Here's a new and interesting research paper that might be of interest and make for some interesting reading. It's being made available for free in IEEE's Transactions on Knowledge and Data Engineering.
Title: Link Contexts in Classifier-Guided Topical Crawlers
16 pages; PDF.
Authors: Gautam Pant and Padmini Srinivasan (University of Iowa)
From the abstract:
Context of a hyperlink or link context is defined as the terms that appear in the text around a hyperlink within a Web page. Link contexts have been applied to a variety of Web information retrieval and categorization tasks. Topical or focused Web crawlers have a special reliance on link contexts. These crawlers automatically navigate the hyperlinked structure of the Web while using link contexts to predict the benefit of following the corresponding hyperlinks with respect to some initiating topic or theme. Using topical crawlers that are guided by a Support Vector Machine, we investigate the effects of various definitions of link contexts on the crawling performance. We find that a crawler that exploits words both in the immediate vicinity of a hyperlink as well as the entire parent page performs significantly better than a crawler that depends on just one of those cues. Also, we find that a crawler that uses the tag tree hierarchy within Web pages provides effective coverage.
The papers bibliography is also more than worth a look.
The Original Search Marketing Event is Back!
SES Denver (Oct 16) offers an intense day of learning all the critical aspects of search engine optimization (SEO) and paid search advertising (PPC). The mission of SES remains the same as it did from the start - to help you master being found on search engines. Early Bird rates available through Sept 12. Register today!