A special report from the Search Engine Strategies 2001 Conference, August 12-14, San Jose, CA.
Search engines aren't just black boxes -- they are programs continually updated to improve indexing, search responsiveness and relevance ranking. Here's an insider's look at Google.
The Anatomy of a Search Engine panel offered insider views of two major crawler-based search engines: Google and FAST. The first speaker was Craig Nevill-Mannning, a senior engineer at Google.
The mission of Google is: To organize the world's information to make it universally accessible and useful. (You notice that this says nothing about the web, crawling, PageRank or any of the other details -- these guys think big).
Google today has several components:
- Crawling the web
- Building an index
- Serving search results
- User interface and design
- Google infrastructure
"Crawling" is the process of following links to locate pages, and then reading those pages to make the information on them searchable (this is sometimes known as robot spidering, gathering or harvesting). The Google crawler, known as GoogleBot, crawls all the URLs it knows about every few weeks. It checks that the page is still available, gets any updated information, and follows links to pages it hasn't seen before. Some sites, such as news sites, get crawled more frequently, so that the Google index has the most recent data -- they could be indexed daily or even hourly.
The robot crawler reads the pages just like a browser. If you wanted to, you could reproduce the process by opening your browser, starting with any URL, saving the page, following every link on that page, saving those pages, following every link on those pages, until there are no more links you have not followed yet.
These robots have to be sensitive to webmasters, so they limit the number of times they hit each site per minute. The software is very fast, so they can crawl many sites in parallel. The GoogleBot, like all the other major search engine crawlers, obeys the "robots.txt" directives, avoiding pages which the webmaster has designated as off limits (for more information, see the Robots Exclusion Protocol).
Building Google's Index
A search engine index is much like the index listings in the back of a book: for every word, the system must keep a list of the pages that word appears in. It's quite hard to store these efficiently. For example, the word "flamingo" appears in about 492,000 out of about 2 billion pages known to Google.
For the average size of 1,000 words per page, they have to be very careful to use techniques such as storing information in RAM: it would take 8 months to check for that word if everything was on disk.
Google knows about 3 billion web documents, including images, PDF and other file formats, Usenet newsgroups and news.
Once Google has matched a word in the index, it wants to put the best document first. It chooses the best document using a number of techniques:
- Text analysis: evaluating the documents based on matching words, font size, proximity, and over 100 other factors
- Links & link text: external links are a somewhat independent guide to what's on page.
- PageRank: a query-independent measure of the quality of both pages and sites. This is an equation that tries to indicate how often a truly random searcher, following links without any thought, would end up at a particular page.
Google has a "wall" between the search relevance ranking and advertising: no one can pay to be the top listing in the results, but the sponsored link spots are available.
Google has a distributed architecture consisting of a fleet of web server that shows the forms and the search results, index servers that store the searchable listings, and document servers that contain the full text of each page, to extract the "snippets" -- those bits of text surrounding the match words in the search results. The document servers also provide the cached pages and the HTML versions of Acrobat, Word and PowerPoint files.
The system design is scalable and highly parallel, distributed search, so each query goes across multiple machines. They choose cost-effective, mid-quality commodity PCs running Linux. Of the 10,000 machines, several fail every day, because they run so much more than normal desktops, so they have designed in search redundancy, assuming some of the machines may fail at any time.
Google's User Interface and Design
The Google approach is to keep the user interface clean and simple. All changes are put through user studies, analysis, and testing. They are concerned both about simplicity and about server stability. The User Interface design is the responsibility of cross-functional teams, including psychologists, business analysts, and blue-sky researchers.
Tomorrow: a look inside FAST, the search engine that powers Lycos, AllTheWeb.com and numerous other regional portals.
Avi Rappoport is a Search Engine Consultant and maintains the Complete Guide to Search Engines for web Sites and Intranets. Contact her at firstname.lastname@example.org
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.