Want a detailed glimpse into the black boxes we call search engines? Mining the Web is a textbook that discusses everything from building your own crawler to the future of information finding on the web.
Search engines are designed to be simple to use. Type a few words into a query box, and voila, you're presented with a set of probable results that match your information need.
This simplicity masks some heavy-duty complexity. Although we refer to a "search engine" in the singular, Google, Teoma, AlltheWeb and others are actually software systems made up of a number of components, each specialized and tuned to perform a specific function that contributes to the whole.
Mining the Web: Discovering Knowledge from Hypertext Data is one of the first books that actually describes, in detail, the parts of contemporary search engines and how they function. The author, Soumen Chakrabarti, is an assistant professor of computer science and engineering at the Indian Institute of Technology in Bombay, and the book reveals a rare glimpse at the inner workings of our favorite search tools.
Most commercial search engines guard the details of their innermost operations closely, revealing casual hints here and offhand remarks there, but almost never offering complete information about the "secret sauce" underlying their operations.
That's what makes this book so interesting. If you really want to understand how search engines work, this book provides an excellent and fairly detailed explanation of the processes they all use, to one degree or another.
The book's not for the technically faint of heart, however. It assumes a good working knowledge of math, logic and computer science, and the book is dense with formulae and graphs. But don't let that scare you -- Dr. Chakrabarti writes clearly, and the book is well organized, progressing logically from topic to topic.
Even if you find technical language challenging, skimming past the details will leave you with a good fundamental understanding of search engine technology.
The book begins with an introduction to search engine technology. Subsequent chapters deal with crawling the web, search and information retrieval, and basic relevance algorithms. The second part of the book is dedicated to machine learning -- how search engines can be engineered to get "smarter" about processing queries and returning better results.
Part three shifts gears, focusing on practical techniques and applications of search engine technology. Here's where Dr. Chakrabarti really gives us a peek behind the curtain, talking about the differences between Google's PageRank algorithms and some of the techniques used by other commercial search engines to differentiate themselves from one another.
The last chapter takes a look at the future of web mining, offering tantalizing glimpses of what we can expect over the next few years as new technologies are adopted and refined, gradually becoming part of the mainstream search tools that we use on a daily basis.
While it's not a book that you'll want to take to the beach this summer, Mining the Web offers an excellent armchair guide to the inner workings of the black boxes we know as search engines.
Mining the Web: Discovering Knowledge from Hypertext Data
by Soumen Chakrabarti
Morgan Kaufmann Publishers, $54.95
The link above goes to a page with more information about the book, including links to open source software for search engines and several tutorial slides and links to other Web mining resources prepared by Dr. Chakrabarti.
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.