SEO News
Search

Anatomy of a Search Engine: Inside Google

by , Comments

A special report from the Search Engine Strategies 2001 Conference, August 12-14, San Jose, CA.

Search engines aren't just black boxes -- they are programs continually updated to improve indexing, search responsiveness and relevance ranking. Here's an insider's look at Google.

The Anatomy of a Search Engine panel offered insider views of two major crawler-based search engines: Google and FAST. The first speaker was Craig Nevill-Mannning, a senior engineer at Google.

The mission of Google is: To organize the world's information to make it universally accessible and useful. (You notice that this says nothing about the web, crawling, PageRank or any of the other details -- these guys think big).

Google today has several components:

- Crawling the web
- Building an index
- Ranking
- Serving search results
- User interface and design
- Google infrastructure

"Crawling" is the process of following links to locate pages, and then reading those pages to make the information on them searchable (this is sometimes known as robot spidering, gathering or harvesting). The Google crawler, known as GoogleBot, crawls all the URLs it knows about every few weeks. It checks that the page is still available, gets any updated information, and follows links to pages it hasn't seen before. Some sites, such as news sites, get crawled more frequently, so that the Google index has the most recent data -- they could be indexed daily or even hourly.

The robot crawler reads the pages just like a browser. If you wanted to, you could reproduce the process by opening your browser, starting with any URL, saving the page, following every link on that page, saving those pages, following every link on those pages, until there are no more links you have not followed yet.

These robots have to be sensitive to webmasters, so they limit the number of times they hit each site per minute. The software is very fast, so they can crawl many sites in parallel. The GoogleBot, like all the other major search engine crawlers, obeys the "robots.txt" directives, avoiding pages which the webmaster has designated as off limits (for more information, see the Robots Exclusion Protocol).

Building Google's Index

A search engine index is much like the index listings in the back of a book: for every word, the system must keep a list of the pages that word appears in. It's quite hard to store these efficiently. For example, the word "flamingo" appears in about 492,000 out of about 2 billion pages known to Google.

For the average size of 1,000 words per page, they have to be very careful to use techniques such as storing information in RAM: it would take 8 months to check for that word if everything was on disk.

Google knows about 3 billion web documents, including images, PDF and other file formats, Usenet newsgroups and news.

Relevance Ranking

Once Google has matched a word in the index, it wants to put the best document first. It chooses the best document using a number of techniques:

- Text analysis: evaluating the documents based on matching words, font size, proximity, and over 100 other factors

- Links & link text: external links are a somewhat independent guide to what's on page.

- PageRank: a query-independent measure of the quality of both pages and sites. This is an equation that tries to indicate how often a truly random searcher, following links without any thought, would end up at a particular page.

Google has a "wall" between the search relevance ranking and advertising: no one can pay to be the top listing in the results, but the sponsored link spots are available.

Serving Results

Google has a distributed architecture consisting of a fleet of web server that shows the forms and the search results, index servers that store the searchable listings, and document servers that contain the full text of each page, to extract the "snippets" -- those bits of text surrounding the match words in the search results. The document servers also provide the cached pages and the HTML versions of Acrobat, Word and PowerPoint files.

The system design is scalable and highly parallel, distributed search, so each query goes across multiple machines. They choose cost-effective, mid-quality commodity PCs running Linux. Of the 10,000 machines, several fail every day, because they run so much more than normal desktops, so they have designed in search redundancy, assuming some of the machines may fail at any time.

Google's User Interface and Design

The Google approach is to keep the user interface clean and simple. All changes are put through user studies, analysis, and testing. They are concerned both about simplicity and about server stability. The User Interface design is the responsibility of cross-functional teams, including psychologists, business analysts, and blue-sky researchers.

Tomorrow: a look inside FAST, the search engine that powers Lycos, AllTheWeb.com and numerous other regional portals.

Google
http://www.google.com

Avi Rappoport is a Search Engine Consultant and maintains the Complete Guide to Search Engines for web Sites and Intranets. Contact her at [email protected]

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.

Online portals news
Redesigned Journalism.org Website A Portal for Journalists, Public Alike...
URLwire Oct 30 2002 10:51AM GMT
Domain name news
China opens up .cn domain...
ZDNet Oct 30 2002 10:35AM GMT
Online search engines news
Inktomi Improves Enterprise Search...
SiliconValley.Internet.com Oct 30 2002 10:17AM GMT
Online marketing news
Verizon, Spam Co. Reach Settlement...
New York Times Oct 30 2002 7:32AM GMT
Technology features
How Intel Took Moore's Law From Idea to Ideology...
Fortune Oct 30 2002 7:26AM GMT
Online portals news
Vignette to acquire portal maker...
CNET Oct 30 2002 3:04AM GMT
Domain name news
Reuters: Hacking or public domain?...
ZDNet Oct 29 2002 8:49PM GMT
Online information news
Reuters targeted in Internet privacy case...
Nando Times Oct 29 2002 1:44PM GMT
Online marketing news
ANALYSIS: How the Web is changing election campaigns...
Nando Times Oct 29 2002 1:44PM GMT
Tech latest
Scientists try for a touchy-feely Net...
CNET Oct 29 2002 1:36PM GMT
Online portals news
No broadband could be the ruin of AOL...
Silicon.com Oct 29 2002 10:44AM GMT
Recycling campaign to AOL: You'll get mail...
ZDNet Oct 29 2002 9:59AM GMT
Online legal issues news
E-Commerce Patent Disputes Erupt...
PC Magazine Oct 29 2002 4:19AM GMT
powered by Moreover.com


SES LondonOptimising Digital Marketing Campaigns with Search, Social and Analytics
At SES London (9-11 Feb) you'll get an overview of the latest tools, tips, and tactics in Paid, Owned, Earned, Integrated Media and Business Intelligence to streamline your marketing campaigns in 2015. Register by 31 October to take advantage of Early Bird Rates.

Recommend this story

comments powered by Disqus