A special report from the Search Engine Strategies 2001 Conference, August 12-14, San Jose, CA.At the Anatomy of a Search Engine panel, Tim Mayer, vice president web Search, Fast Search & Transfer, offered a behind the scenes glimpse at the inner workings of FAST, the search engine that powers Lycos, AlltheWeb.com and other regional portals.
FAST has a powerful search engine infrastructure:
- 2 data centers (for redundancy)
- 150 racks
- 1100 Dell machines
- running mainly FreeBSD *(and other platforms including Windows 2000 and Sun's Solaris)
- Over 2.1 billion documents
The search engine responds to 40 million queries per day.
The FAST crawlers follow links to collect data from the web. They collect each document, and send them to the indexer which builds a searchable index. They also store a copy of the page so they can show query matches in context.
Challenges for crawlers include tracking updates and new documents as the web continues its exponential growth. Link structures are constantly changing, links are added and removed, so index have to track both the pages themselves and the relationships. New content types provide new challenges, to index the metadata and useful context for multimedia files, FTP archives, new file formats, structured (database) data, dynamic content, Flash, and 3500 sources for real-time news.
FAST crawls in two ways: batch crawling starts from scratch, with an empty "local store", but it never fetches the same page twice in the indexing cycle. Incremental crawling starts from the existing local store, and tries to discover new documents. It looks at certain sites with frequent changes by tracking average freshness for pages and revisiting accordingly, then updates the local store periodically.
The crawler farm is a set of independent machines in nodes, coordinated by a scheduler, which each have their own segment of the web to crawl. They coordinate work and share some link information and meta information. The document scheduler provides a prioritized ranking of documents to crawl next, tracks rules in robots.txt and keeps the crawlers from sending too many requests at a time to a server, as they are capable of requesting 1,000 documents per second.
Some crawls are focused on finding new documents, others on freshness, updating existing documents. Up to 2/3 of new URLs point to documents which duplicate already found content, and in a 7 to 11-day crawl cycle, about 5% of URLs are dead links.
In addition to web crawling, FAST has a multimedia crawler (including Flash sites), and a real time news crawler. It has a special crawler farm dedicated to paid-inclusion pages, which checks for updates every 24 hours.
The document processor retrieves the documents themselves, handles the various file formats, checks for document duplication and parses data. FAST analyzes the documents in a number of ways, converting them to XML, detecting the main language, checking for adult subject matter, classifying the content by topic, and calculating a static quality ranking. In addition to indexing for search, FAST provides alerting services, notifying users when new information on designated topics appears.
When a user searches AllTheWeb or Lycos, FAST performs some analysis on the query itself. It checks for language settings and looks for linguistic cues, to match results to the language of the searcher. Additional phrasing processes recognize multiword terms, so it can search for San Francisco and New York as phrases rather than unrelated words.
Avi Rappoport is a Search Engine Consultant and maintains the Complete Guide to Search Engines for web Sites and Intranets. Contact her at [email protected]
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.
Twitter Canada MD Kirstine Stewart to Keynote Toronto
ClickZ Live Toronto (May 14-16) is a new event addressing the rapidly changing landscape that digital marketers face. The agenda focuses on customer engagement and attaining maximum ROI through online marketing efforts across paid, owned & earned media. Register now and save!*
*Early Bird Rates expire April 17.