Unless your website has been hiding under some virtual rock for an extended period of time, chances are you've got your work spelled out for you analyzing your daily traffic stats. Especially when you're concerned with SEO: you want to know when, if, and how the crawler based search engines spider your pages, which search queries (a.k.a. keywords) are pulling visitors to your site, whether someone has linked to you out of the goodness of their heart, how frequently the engines are paying your pages a visit, and lots more.
In an ideal world, these crawlers would be quite open about what they're doing: just like anyone with a modicum of good manners up their sleeve, they'd say something like "hello, I'm Googlebot from Google.com and I've come to check out what exciting new stuff you've put up recently."
Regrettably, there's more than a slight disconnect between this scenario and what's going on in the real world. Because there's a veritable onslaught of search engine spiders being all sneaky about what they actually are. And it would be utterly naive to assume that they aren't doing it intentionally: being run by some of the smartest people in the world, you can safely bet the farm on the assumption that they're doing it with a purpose.
So let's look at some of the most common techniques you'll encounter when taking a closer look at that Wild West that is currently spiderland.
Robotic Fakes: Search Engine Spiders Pretending to be Human Browsers
Here is one of many run-of-the-mill Bing spiders with their typical User Agent:
In this example, the User Agent is: "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
Don't rely on their sticking to this format religiously, though: recently, we've detected an increasing number of MSN bots featuring the following, slightly modified User Agent:
Notice the trailing "._" characters? Conceivably, Bing is leveraging this rudimentary change to trick simplistic scripts of the "poor man's cloaking" kind. Such scripts will conduct their cloaking activities based on a visitor's "User Agent" signature. Nothing sophisticated about this: if a given website is on the lookout for an exact User Agent match, the spider in question won't be recognized anymore.
Nor is this the limit of Bing's shenanigans. There are plenty of MSN bots featuring an entirely unobtrusive Internet Explorer User Agent -- meaning, of course, that they cannot be determined for the crawlers they are via their User Agent alone: if that's what you're going by (e.g. when analyzing your traffic stats), there's no way you can tell whether it's actually a human visitor you're dealing with or merely some bot.
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.2)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SLCC1; .NET CLR 1.1.4325; .NET CLR 2.0.40607; .NET CLR 3.0.04506.648)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; MS-RTC LM 8)
Bing deploys these bots to spider both your standard HTML pages as well as ".js" and ".css" files.
Used to be a time when every man and his dog believed that things were perfectly straightforward with Google -- at least, after all, you know that Mountain View, California Googlebot from scratch, right?
Wrong: there are plenty of Googlebots out there sporting an inconclusive browser User Agent. More specifically, they'll simply mimic a Firefox Browser.
Here's a real world example:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:220.127.116.11) Gecko/20100315 Firefox/3.5.9 GTB7.1 (.NET CLR 3.5.30729)
This one is featuring the regular Googlebot User Agent:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Here's another Google spider, this time pretending to be Internet Explorer:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
Google uses this spider to crawl AdWords campaign Quality Pages.
Thus, if you're cloaking your Quality Page by looking out for a regular Googlebot User Agent, more likely than not you'll soon be found out and possibly penalized.
Human Visitors or Non-Search-Engine Spiders Pretending to Be Search Engine Crawlers
Frequently, you'll hit on entries in your webserver's log file that appear to be regular search engine spiders. Once you take a close hard look at their respective IPs, however, it becomes obvious that things aren't quite what they seem to be.
Mozilla/5.0 (compatible; Yahoo! Slurp/3.0;
Not to go on a forensic tangent here, suffice it to say that you'll have to make use of specialized services such as DomainTools and OS based tools such as nslookup or dig to detect that "crucialx.net" in this example is either a fake entry altogether or shares its IP with surf-anon.com. Doesn't look like Yahoo's Slurp after all, does it?
Finally, here's a visitor from an entirely ordinary IP pretending to be Googlebot:
Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html) S0106c80aa9952f80.cg.shawcable.net
Chances are that non-SE spiders hiding behind search engine User Agents can crawl loads of webpages, generally remaining undetected while eating up your bandwidth and server resources without giving anything back in return -- not good!
Perhaps you're thinking, "I'm not cloaking my pages, so why should I be concerned?" Well, consider this: even if you're not into black hat SEO, these spiders can muck up your traffic stats royally -- something that actually affects every webmaster under the sun.
Detecting cloaked pages will only be an issue when operating with poor man's cloaking. (Your competitors can easily find you out in this manner and snitch on you to the search engines, let's not forget.) So for all serious, heavy duty deployment, you should always go with IP delivery rather than User Agent based cloaking. Like it or not, it's the only reliable approach.
Not everything is what it pretends to be on the Web, with even the big respected players going for stealth tactics on a grand scale -- and your traffic stats will ill reflect what's really happening on your website unless you apply plenty of effort and expertise to analyzing them. Simplistic stats tools, whether free or commercial, won't normally be of much help and may arguably even make things worse by indicating certainty where some serious doubting would be far more to the point.