The Taming of the Bots

All savvy search engine marketers know how to create a search-friendly site that can be easily read by a search engine spider, or bot. But what happens when those bots go bad?

A special report from the Search Engine Strategies conference, August 7-10, 2006, San Jose, CA.

Many site owners tell horror stories of lost uptime, wasted bandwidth and stolen content that have resulted from rogue spiders invading their virtual domains. Several experts, including representative from search engines Google, Yahoo and Become.com, sat on a panel to discuss the issues of "bot obedience."

Jon Glick of search engine Become.com started the session with a brief overview of the topic.

"Robots are great at finding links and content," said Glick, "But they are dumb. They can't process cookies or javascript."

Glick went on to highlight how to create bot friendly sites using tactics such as hyper link navigation, or navigation that uses "pure" links with no javascript or Flash elements. Glick also suggested that web authors avoid excessive parameters in dynamic URLs when building bot friendly sites.

Dan Thies of SEO Research Labs talked about one of the darker consequences of untamed bots—duplicate content. Thies said that unscrupulous bots regularly target highly performing web pages, scrape their content and place it in thousands of other pages throughout the web. This can cause consumer confusion, along with the fact that duplicate content is eventually filtered out of the search engines main results and relegated to supplemental results at best. At worst, the content is booted out of the search engine index.

Thies said he has seen client's sites lose "considerable revenue" after sites were scraped and unauthorized duplicate content was posted throughout the web.

"My client saw a drop of 50% drop in SEO referrals," said Thies. This would be disastrous for most sites; however since [the client was” good at PPC and other online advertising, the effect was noticeable but not devastating."

Bill Atchison of Crawlwall.com started off his presentation by getting the audience's attention.

"Most bots steal, steal, steal from me," said Atchison. "I've had bad bots hit my site so hard they took it down. Before I took action, my website was under constant attack."

Atchison said that before he took steps to alleviate the problem, 10% of his site's page views were from "bad bots," or bots that ignored all instructions from robots.txt files or meta information. Atchison claimed that he has had copyrighted material stolen from his web site and his servers were overloaded and brought down.

"What was happening [with the bad bots” was unacceptable and had to be stopped," said Atchison.

Atchison said the first step to Bot obedience is to differentiate good bots from bad bots. Below is a list of characteristics Atchison said he uses to identify if a bot is good or bad.

Good Bot Characteristics:

  • Obey Internet standards like robots.txt

  • Don't crawl your server abusively fast

  • Return to get fresh content in a reasonable timeframe

  • Provide traffic in return for crawling your site
Bad Bot Characteristics:
  • Will go to any length to get your content

  • Ignore Internet standards like robots.txt

  • Spoof 'bot names used by major search engines

  • Change the User Agent randomly to avoid filters

  • Masquerade as humans (stealth) to completely bypass filters

  • Crawl as fast as possible to avoid being stopped

  • Crawl as slow as possible to slide under the radar

  • Crawl from as many IPs as possible to avoid detection

  • Return often to get your new content and get indexed first

  • Violate your copyrights and repackage your site

  • Hijack your search engine positions

  • Provide no value in return for crawling
Atchison outlined several failed approaches that are commonly used to control rogue bots on web sites, including the "opt-out" approach. The opt-out approach involves using Robots.txt files, meta information and user-agent blacklists to direct bots on where they can and cannot go and keep blacklisted bots out. Atchison said this approach fails most of the time because rogue bots ignore this type of data and mask their true identity by falsifying their user-agent data. Rogue bots can also turn the tables on site owners by using the prohibited files information in robots.txt files as a road map for the places they want to go.

Atchison said he employs an "opt-in" approach. In this approach only bots known to be good are allowed into the site. All other bots are blocked. This approach is more risky, as some bots that do provide quality traffic may be blocked from accessing the site, but Atchison said the benefits outweigh the risk.

Both Rajat Mukherjee from Yahoo and Vanessa Fox from Google stated that the best way to control their respective bots is to use the robots.txt protocol.

"Slurp (the name for Yahoo's bot) is an obedient bot," Said Mukherjee. "I always recommend reading the information at robotstxt.org in order to control slurp."

Both Fox and Mukherjee said Yahoo and Google want to be contacted if their bots are found to be misbehaving.

Mukherjee also took some time to introduce the revamped Yahoo Site Explorer program. The Yahoo! Site Explorer program allows site owners to authenticate their sites and get more detailed information about how their sites are being viewed by Yahoo. The program also allows users to manage site feeds (other than paid inclusion feeds) and export information in .CSV format.

During the Q&A section of the presentation, an audience member suggested that all responsible bots implement an identity program to certify the bot is actually what its user-agent says it is, a suggestion that was met with enthusiasm from the panelists as well as the audience.

Tony Wright is vice president of client services for Kinetic Results, LLC.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.

From The SEW Blog...

Headlines & News From Elsewhere

About the author

Tony Wright is founder and CEO of WrightIMC, a Dallas-based digital consultancy dedicated to helping advertising agencies, PR firms, interactive shops and corporate in-house teams build successful and profitable search engine marketing teams, while also assisting select clientele with their interactive and search marketing needs.

With almost a decade of hands-on and strategic experience in interactive marketing and a background in traditional and interactive public relations and journalism, Wright has helped hundreds of fortune 500 and mid-level clients successfully create or enhance their interactive advertising and search engine marketing teams.