SEO News

Anatomy of a Search Engine: Inside FAST

by , Comments

A special report from the Search Engine Strategies 2001 Conference, August 12-14, San Jose, CA.

At the Anatomy of a Search Engine panel, Tim Mayer, vice president web Search, Fast Search & Transfer, offered a behind the scenes glimpse at the inner workings of FAST, the search engine that powers Lycos, AlltheWeb.com and other regional portals.

FAST has a powerful search engine infrastructure:

- 2 data centers (for redundancy)
- 150 racks
- 1100 Dell machines
- running mainly FreeBSD *(and other platforms including Windows 2000 and Sun's Solaris)
- Over 2.1 billion documents

The search engine responds to 40 million queries per day.

Crawling

The FAST crawlers follow links to collect data from the web. They collect each document, and send them to the indexer which builds a searchable index. They also store a copy of the page so they can show query matches in context.

Challenges for crawlers include tracking updates and new documents as the web continues its exponential growth. Link structures are constantly changing, links are added and removed, so index have to track both the pages themselves and the relationships. New content types provide new challenges, to index the metadata and useful context for multimedia files, FTP archives, new file formats, structured (database) data, dynamic content, Flash, and 3500 sources for real-time news.

FAST crawls in two ways: batch crawling starts from scratch, with an empty "local store", but it never fetches the same page twice in the indexing cycle. Incremental crawling starts from the existing local store, and tries to discover new documents. It looks at certain sites with frequent changes by tracking average freshness for pages and revisiting accordingly, then updates the local store periodically.

The crawler farm is a set of independent machines in nodes, coordinated by a scheduler, which each have their own segment of the web to crawl. They coordinate work and share some link information and meta information. The document scheduler provides a prioritized ranking of documents to crawl next, tracks rules in robots.txt and keeps the crawlers from sending too many requests at a time to a server, as they are capable of requesting 1,000 documents per second.

Some crawls are focused on finding new documents, others on freshness, updating existing documents. Up to 2/3 of new URLs point to documents which duplicate already found content, and in a 7 to 11-day crawl cycle, about 5% of URLs are dead links.

In addition to web crawling, FAST has a multimedia crawler (including Flash sites), and a real time news crawler. It has a special crawler farm dedicated to paid-inclusion pages, which checks for updates every 24 hours.

Document Indexing

The document processor retrieves the documents themselves, handles the various file formats, checks for document duplication and parses data. FAST analyzes the documents in a number of ways, converting them to XML, detecting the main language, checking for adult subject matter, classifying the content by topic, and calculating a static quality ranking. In addition to indexing for search, FAST provides alerting services, notifying users when new information on designated topics appears.

Searching

When a user searches AllTheWeb or Lycos, FAST performs some analysis on the query itself. It checks for language settings and looks for linguistic cues, to match results to the language of the searcher. Additional phrasing processes recognize multiword terms, so it can search for San Francisco and New York as phrases rather than unrelated words.

Lycos
http://www.lycos.com

AllTheWeb
http://www.alltheweb.com

Avi Rappoport is a Search Engine Consultant and maintains the Complete Guide to Search Engines for web Sites and Intranets. Contact her at [email protected]

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.

Online search engines news
Does Google's power threaten the Web?...
ZDNet Oct 31 2002 1:45PM GMT
Online portals news
Can he take down Yahoo?...
CNET Oct 31 2002 12:33PM GMT
Internet features
Thinking in the Future Tense...
WebTalkGuys Radio Oct 31 2002 12:01PM GMT
Online search engines news
Gimpsy Search Engine Offers Innovative Approach to Web Searching...
URLwire Oct 31 2002 11:50AM GMT
Online portals news
Yahoo Goes PHP in Open Source Embrace...
SiliconValley.Internet.com Oct 31 2002 10:20AM GMT
Online marketing news
Pop-Up 'Alerts': Spam in Sheep's Clothing...
New York Times Oct 31 2002 7:04AM GMT
Online search engines news
Who Will Be the Next Google?...
High Rankings Oct 31 2002 6:57AM GMT
Top internet stories
Making the Web Child-Safe...
New York Times Oct 31 2002 5:01AM GMT
Online marketing news
Spam-blockers hall of fame...
ZDNet Oct 31 2002 1:04AM GMT
Online search engines news
Google: What's it Worth to You?...
SiliconValley.Internet.com Oct 30 2002 6:54PM GMT
Domain name news
National domain names challenge global Web body...
Forbes Oct 30 2002 6:38PM GMT
Top internet stories
China drops plan for Chinese-script web address...
CNN Oct 30 2002 2:54PM GMT
Online portals news
Redesigned Journalism.org Website A Portal for Journalists, Public Alike...
URLwire Oct 30 2002 10:51AM GMT
powered by Moreover.com


ClickZ Live Toronto Twitter Canada MD Kirstine Stewart to Keynote Toronto
ClickZ Live Toronto (May 14-16) is a new event addressing the rapidly changing landscape that digital marketers face. The agenda focuses on customer engagement and attaining maximum ROI through online marketing efforts across paid, owned & earned media. Register now and save!*
*Early Bird Rates expire April 17.

Recommend this story

comments powered by Disqus