SEO News
Search

SpiderSpotting: When A Search Engine, Robot Or Crawler Visits

author-default
by , Comments

Search engines send out what are called spiders, crawlers or robots to visit your site and gather web pages. These robots leave traces behind in your access logs, just as an ordinary person does. If you know what to look for, you can tell when a spider has come to call. That can save you worrying that you haven't been visited. You can tell exactly what a robot has recorded or failed to record. You can also spot robots that may be making a large number of requests, which can affect your page impression statistics or even burden your server.

Searching for Spiders

How do you identify a spider? Those from the major search engines can sometimes be identified from their host names. These often incorporate part of the search engine's name or the company's name. For example, one of WebCrawler's host names is spidey.webcrawler.com.

A better way of spotting spiders is to look for their agent names, or what some people call browser names. Spiders have their own names, just like browsers. For example, Netscape identifies itself by saying Mozilla. Alta Vista's spider says Scooter, while HotBot's spider is named Slurp.

Some resources for getting a list of host and agent names for the major search engines is below. However, it's useful to know how to spot any robot, because names can change, or new robots can appear. So, let's take a look at how to spot robots when you don't know what to look for.

Be aware that in the examples below, spider names are from when this article was originally written, in 1997. The principles of spotting spiders still remains the same, however.

Your Best Clue: robots.txt

Start your search with a review of requests for the robots.txt file. This is a file that tells robots what they may and may not index within a site. Not all spiders follow the robots.txt convention, but most do. Anything requesting this file is almost certainly a spider, robot or an agent.

By reviewing the requests, you can usually spot spiders from the major search engines by their host names, which in turn tells you the latest agent names. You'll probably be surprised to see how many smaller search engines, personal agents and other robots are also accessing your site.

I call this review of the robots.txt file a crawler report. Here's an example of one:

Crawler Report

I created the report by using log analysis software to analyze three months of log activity. The report lists requests first by agent name, then by host name, and ranked in order of visits.

As you can see, there are all sorts of robots visiting the site. Small search engines and even experimental search engines make visits, such as Stanford's BackRub (the predecessor to Google). Offline browsers also come to call, such as NetCarta's WebMapper. These are essentially personal spiders.

Naturally, the major search engines make appearances. It's easy to spot Infoseek as InfoSeek Sidewinder/0.9 and WebCrawler as WebCrawler/3.0 Robot libwww/5.0. But you need to know that HotBot is produced by Inktomi to match it to Slurp/2.0, or that Architext is the parent company to Excite to know that ArchitextSpider means that Excite has visited.

Even More Agent Names

Armed with agent names from the crawler report, you can go back and run a report for a specific search engine's spider. For example, I can look for Infoseek by searching for InfoSeek Sidewinder/0.9.

It's often a good idea to slightly broaden the search. By searching for *infoseek* with my log analysis software, I told the program to search the logs for any matches that have the "infoseek" in the agent name. I also told the program to list all the host names from these visits. Here's the report:

Infoseek Report

From the report, you can see the usefulness of broadening the search. Agent names can change, which is why InfoSeek Sidewinder and Infoseek Robot 1.17 are also listed.

Helpful Host Names

In addition, listing host names helps you spot any uses of the Infoseek crawler by someone other than Infoseek. For example, the seven.eccosys.com host requests are probably from a company that has licensed Infoseek's technology for its own purposes. Look at the HotBot report to see something similar.

HotBot Report

See the crawler1.anzwers.ozemail.net request? This is probably from the Australian partner of HotBot called OzeMail. I want to exclude it from the report, so that I only get true HotBot requests.

I had another reason to want to know host names. I only began recording agent name information in mid-January 1997, but my statistics database went back to December. So, I needed to run reports filtered by host name in order to get a complete picture. In the case of Infoseek, it was easy to see that the host names all had the form of *-bbn.infoseek.com, so I entered this into my log analysis program in order to get only Infoseek requests. Here's a revised version of that earlier report:

Infoseek Report (from host names)

Actually, when I ran this report, I broadened the search to *infoseek.com, to ensure I didn't miss anything. That's why a single request from topgun.infoseek.com appears.

But by also listing the agent names, I could see that there was a single request from Netscape 3.x. This is no doubt a human being from Infoseek that came to the site. By narrowing the filter to *-bbn.infoseek.com, as shown above, this non-spider request would be eliminated.

Some Final Clues

OK, now you know what to look for. However, there are a few things that can throw you off. For example, some search engines can go nuts now and then. Really. Go back and look at the HotBot report, above. Look way down on the report, where actual page requests are listed. You can see that for several days, all HotBot did was request the home page from the site, and nothing more.

During this period, HotBot was upgrading its crawler technology. It went back into normal mode in late February. You can see the change, because suddenly the search engine requests a large number of different pages from the web site.

That's what a good search engine will do. It will visit your site on a regular basis and request a large number of pages, perhaps spread out over a period of a few days. This is the search engine being "polite" and trying not to overburden your server all at once.

So, be sure to look at what the spiders are actually requesting, rather than just adding up the requests. It can be surprising.

Another oddity can be caused by "instant indexing" search engines, such as Alta Vista, Infoseek and WebCrawler. These will instantly spider any page submitted to them (though WebCrawler takes much longer to add that page to its the index, unlike the other two).

Sometimes, people see that spiders from these search engines have retrieved a page or two and mistakenly believe everything has been added. In reality, the spider still needs to return to properly catalog the site. Watch your logs, and you'll know when it has.

More Resources

How Search Engines Work
Spiders are just one part of a search engine. This page within the site puts it all in context for you.

ABCi and IAB Spiders and Robots
http://www.abcinteractiveaudits.com/abci_iab_spidersandrobots/

A list of spiders for IAB or an ABCi clients (for more about this, see the Web Spiders List Planned article, below).

The Web Robots Database
http://www.robotstxt.org/wc/active.html

The web's oldest list of spiders and crawlers, based on self-reported data.

SpiderHunter.com
http://www.spiderhunter.com/

A huge collection of resources devoted to tracking spiders. You can pretend to be a spider, view a collection of spiders by name and IP addresses, read tutorial information and more.

Search Engine Spider IP Addresses
http://www.searchengineworld.com/spiders/spider_ips.htm

Comprehensive list of agent names and IP addresses.

Web Spiders List Planned
The Search Engine Report, Nov. 5, 2001

Search engine spiders and crawlers can skew page view statistics, which means that advertisers might pay for impressions that human beings never see. To solve this, the Interactive Advertising Bureau is now offering a list of robotic visitors.

BotWatch Configuration File
http://www.tardis.ed.ac.uk/˜sxw/robots/index.html

Lists a wide range of robots, with agent and host names.


ClickZ Live New York What's New for 2015?
You spoke, we listened! ClickZ Live New York (Mar 30-Apr 1) is back with a brand new streamlined agenda. Don't miss the latest digital marketing tips, tricks and tools that will make you re-think your strategy and revolutionize your marketing campaigns. Super Saver Rates are available now. Register today!

Recommend this story

comments powered by Disqus