How to Build Your Own Search Engine

Want a detailed glimpse into the black boxes we call search engines? Mining the Web is a textbook that discusses everything from building your own crawler to the future of information finding on the web.

Search engines are designed to be simple to use. Type a few words into a query box, and voila, you’re presented with a set of probable results that match your information need.

This simplicity masks some heavy-duty complexity. Although we refer to a “search engine” in the singular, Google, Teoma, AlltheWeb and others are actually software systems made up of a number of components, each specialized and tuned to perform a specific function that contributes to the whole.

Mining the Web: Discovering Knowledge from Hypertext Data is one of the first books that actually describes, in detail, the parts of contemporary search engines and how they function. The author, Soumen Chakrabarti, is an assistant professor of computer science and engineering at the Indian Institute of Technology in Bombay, and the book reveals a rare glimpse at the inner workings of our favorite search tools.

Most commercial search engines guard the details of their innermost operations closely, revealing casual hints here and offhand remarks there, but almost never offering complete information about the “secret sauce” underlying their operations.

That’s what makes this book so interesting. If you really want to understand how search engines work, this book provides an excellent and fairly detailed explanation of the processes they all use, to one degree or another.

The book’s not for the technically faint of heart, however. It assumes a good working knowledge of math, logic and computer science, and the book is dense with formulae and graphs. But don’t let that scare you — Dr. Chakrabarti writes clearly, and the book is well organized, progressing logically from topic to topic.

Even if you find technical language challenging, skimming past the details will leave you with a good fundamental understanding of search engine technology.

The book begins with an introduction to search engine technology. Subsequent chapters deal with crawling the web, search and information retrieval, and basic relevance algorithms. The second part of the book is dedicated to machine learning — how search engines can be engineered to get “smarter” about processing queries and returning better results.

Part three shifts gears, focusing on practical techniques and applications of search engine technology. Here’s where Dr. Chakrabarti really gives us a peek behind the curtain, talking about the differences between Google’s PageRank algorithms and some of the techniques used by other commercial search engines to differentiate themselves from one another.

The last chapter takes a look at the future of web mining, offering tantalizing glimpses of what we can expect over the next few years as new technologies are adopted and refined, gradually becoming part of the mainstream search tools that we use on a daily basis.

While it’s not a book that you’ll want to take to the beach this summer, Mining the Web offers an excellent armchair guide to the inner workings of the black boxes we know as search engines.

Mining the Web: Discovering Knowledge from Hypertext Data
by Soumen Chakrabarti
Morgan Kaufmann Publishers, $54.95
ISBN: 1-55860-754-4

The link above goes to a page with more information about the book, including links to open source software for search engines and several tutorial slides and links to other Web mining resources prepared by Dr. Chakrabarti.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.

Chinese Internet Market Starts To Blossom
Investors Business Daily Jul 9 2003 7:05AM GMT
Impressive interactive guide to Tour de France Jul 9 2003 5:28AM GMT
Information taxonomy plays a critical role in Web site design and search processes
CNET Jul 9 2003 5:22AM GMT
Eat SPAM, Say Spam, Just Dont Try to Trademark Spam: Hormel Jul 9 2003 5:19AM GMT
Yahoo Results Expected to Please Jul 9 2003 4:07AM GMT
Telstra portal’s mortal
The Australian Jul 8 2003 11:27PM GMT
Overture usurps Google at Freeserve U.K
CNET Jul 8 2003 10:16PM GMT
Moreover’s Results Integrated into Yahoo
Research Buzz Jul 8 2003 5:28PM GMT
Google News Now Offering German Version
Research Buzz Jul 8 2003 5:28PM GMT
Web Searching Strategies for Health- Related Data
BeSpacific Jul 8 2003 5:26AM GMT
Online Ad Spending to Reach $6.3B in 2003: eMarketer Jul 8 2003 5:21AM GMT
E*Trade Financial Launches Advertising Agency Search
Technology Marketing Jul 8 2003 5:19AM GMT
Words of experience on the new world of travel Jul 6 2003 10:04AM GMT
powered by

Related reading

visual search engine

Simple Share Buttons