Though they're nothing more than simple pieces of software, there's an aura of mystery about web crawlers. These are the programs that canvass the web looking for pages to include in search engine indexes. They have cute names like Slurp, Scooter and Googlebot, and may evoke images of sprites scuttling along the luminous strands of cyberspace.
In reality, crawlers are relatively simple programs, though they have the power to bring a web site to a standstill. They can also automatically and rapidly fetch material that a site owner may not want anyone to see. For this reason, most crawlers (also called "robots") abide by the "robots exclusion protocol," an informal set of rules that constrains their behavior.
Part of the protocol says that robots should identify themselves to web servers the visit, offering both name and IP address, and preferably the name of a contact person who can check a misbehaving robot.
All of the major search engines have crawlers, like Slurp (Inktomi), Scooter (AltaVista) and Googlebot (Google). Some have multiple crawlers assigned to different tasks, such as finding new pages, checking on existing pages, and so on. It's relatively easy to discern when your site has been crawled just by looking at your server logs.
But there are hundreds -- perhaps thousands -- of other robots loose on the web that may also be poking through your server. They have odd names ranging from AbachoBOT to ZyBorg. And not all robots work for search engines. Some other types of robots include link checkers, page-change monitors, FTP Clients -- even web browsers.
Dr. John A Fotheringham has assembled an impressive list of the robots (crawlers and others) that have visited his software company's web server, together with the IP addresses they use, and links to their home page. Even if you're not technically oriented, this page provides a fascinating glimpse of the world of automated programs running around on the web, tirelessly gathering information that for the most part makes our life in cyberspace much easier.
Search Engine Robots
A huge list of active web robots.
All About Search Indexing Robots and Spiders
The title says it all -- an excellent article covering everything you ever wanted to know about search engine crawlers and how they work, with extensive links for more information.
If you have access to your server logs, you can ferret out which spiders or crawlers have visited your site, and exactly what they've fetched. Here's how to do it.
The ABCi/IAB Master Industry List Of Spiders And Robots
The Interactive Advertising Bureau (IAB) and ABCi announced that they will create and maintain the ABCi/IAB master industry list of spiders and robots. The list will be updated monthly and is available to IAB members and ABCi clients free of charge. They also offer an informative FAQ about robots at http://www.abcinteractiveaudits.com/abci_iab_spidersandrobots/faqs.html.
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.
| AT&T execs leave Excite@Home board... |
CNET Oct 24 2001 1:36PM GMT
| Locked Doorways?... |
InternetDay Oct 24 2001 1:19PM GMT
| EarthLink: Forget DSL, give us cable... |
ZDNet Oct 24 2001 12:01PM GMT
| InfoSpace Posts Loss, to Cut 200 Jobs... |
digitalMASS Oct 24 2001 11:55AM GMT
| AOL, MSN and Yahoo judge smutty content... |
Silicon.com Oct 24 2001 11:39AM GMT
| AltaVista serving up dated listings... |
MSNBC Oct 24 2001 11:04AM GMT
| Bertelsmann to Use Napster Features... |
New York Times Oct 24 2001 6:07AM GMT
| Much Bioterror Info on Net... |
Washington Post Oct 23 2001 9:40PM GMT
| Web titans welcome content ratings... |
ZDNet Oct 23 2001 12:39PM GMT
| Post-Napster, Peer-to-Peer Computing Gets Ready for Prime Time... |
Business 2.0 Oct 23 2001 7:38AM GMT
| New Yahoo service targets MSN search... |
Business 2.0 Oct 23 2001 7:38AM GMT
| AltaVista to sell corporate solutions via its search engine in Russia... |
Europemedia.net Oct 23 2001 7:07AM GMT