The Taming of the Bots
A special report from the Search Engine Strategies conference, August 7-10, 2006, San Jose, CA.
Many site owners tell horror stories of lost uptime, wasted bandwidth and stolen content that have resulted from rogue spiders invading their virtual domains. Several experts, including representative from search engines Google, Yahoo and Become.com, sat on a panel to discuss the issues of "bot obedience."
Jon Glick of search engine Become.com started the session with a brief overview of the topic.
"Robots are great at finding links and content," said Glick, "But they are dumb. They can't process cookies or javascript."
Glick went on to highlight how to create bot friendly sites using tactics such as hyper link navigation, or navigation that uses "pure" links with no javascript or Flash elements. Glick also suggested that web authors avoid excessive parameters in dynamic URLs when building bot friendly sites.
Dan Thies of SEO Research Labs talked about one of the darker consequences of untamed bots—duplicate content. Thies said that unscrupulous bots regularly target highly performing web pages, scrape their content and place it in thousands of other pages throughout the web. This can cause consumer confusion, along with the fact that duplicate content is eventually filtered out of the search engines main results and relegated to supplemental results at best. At worst, the content is booted out of the search engine index.
Thies said he has seen client's sites lose "considerable revenue" after sites were scraped and unauthorized duplicate content was posted throughout the web.
"My client saw a drop of 50% drop in SEO referrals," said Thies. This would be disastrous for most sites; however since [the client was] good at PPC and other online advertising, the effect was noticeable but not devastating."
Bill Atchison of Crawlwall.com started off his presentation by getting the audience's attention.
"Most bots steal, steal, steal from me," said Atchison. "I've had bad bots hit my site so hard they took it down. Before I took action, my website was under constant attack."
Atchison said that before he took steps to alleviate the problem, 10% of his site's page views were from "bad bots," or bots that ignored all instructions from robots.txt files or meta information. Atchison claimed that he has had copyrighted material stolen from his web site and his servers were overloaded and brought down.
"What was happening [with the bad bots] was unacceptable and had to be stopped," said Atchison.
Atchison said the first step to Bot obedience is to differentiate good bots from bad bots. Below is a list of characteristics Atchison said he uses to identify if a bot is good or bad.
Good Bot Characteristics:
- Obey Internet standards like robots.txt
- Don't crawl your server abusively fast
- Return to get fresh content in a reasonable timeframe
- Provide traffic in return for crawling your site
- Will go to any length to get your content
- Ignore Internet standards like robots.txt
- Spoof 'bot names used by major search engines
- Change the User Agent randomly to avoid filters
- Masquerade as humans (stealth) to completely bypass filters
- Crawl as fast as possible to avoid being stopped
- Crawl as slow as possible to slide under the radar
- Crawl from as many IPs as possible to avoid detection
- Return often to get your new content and get indexed first
- Violate your copyrights and repackage your site
- Hijack your search engine positions
- Provide no value in return for crawling
Atchison said he employs an "opt-in" approach. In this approach only bots known to be good are allowed into the site. All other bots are blocked. This approach is more risky, as some bots that do provide quality traffic may be blocked from accessing the site, but Atchison said the benefits outweigh the risk.
Both Rajat Mukherjee from Yahoo and Vanessa Fox from Google stated that the best way to control their respective bots is to use the robots.txt protocol.
"Slurp (the name for Yahoo's bot) is an obedient bot," Said Mukherjee. "I always recommend reading the information at robotstxt.org in order to control slurp."
Both Fox and Mukherjee said Yahoo and Google want to be contacted if their bots are found to be misbehaving.
Mukherjee also took some time to introduce the revamped Yahoo Site Explorer program. The Yahoo! Site Explorer program allows site owners to authenticate their sites and get more detailed information about how their sites are being viewed by Yahoo. The program also allows users to manage site feeds (other than paid inclusion feeds) and export information in .CSV format.
During the Q&A section of the presentation, an audience member suggested that all responsible bots implement an identity program to certify the bot is actually what its user-agent says it is, a suggestion that was met with enthusiasm from the panelists as well as the audience.
Tony Wright is vice president of client services for Kinetic Results, LLC.
Search Headlines
NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.From The SEW Blog...
- Yahoo 'Goes' for Windows Mobile
- Matt Cutts & Vanessa Fox On WebmasterRadio.FM Thursday AT 1PM (PST)
- Over 1.2 Million Flickr Photos Tagged In 24 Hours
- Internet Archive Suit Settled
- Yahoo Hiring Up All The Brains
- Google's Accuracy Can Easily Cost Advertisers $285 Million
- Windows Live QnA Beta Now Live
- Download Books For Free From Google Book Search
- Google CEO Eric Schmidt Joins Apple's Board Of Directors
- Accoona Review In Ariadne
- Microsoft ContentAds Inviting Advertisers into Pilot
Headlines & News From Elsewhere
- Web 2.0 Expo and technical conference, O'Reilly Radar
- Create and Share Feed Pages with SpeedyFeed, Micro Persuasion
- The Census Snapshot of America, Wall Street Journal
- Google Image Tagger coming soon?, Googling Google
- Meta Tags Abuse Illegal, Google Blogoscoped
- Jennifer Laycock of Search Engine Guide on Search Optimization for Small Businesses, Chris Pirillo
- Google Video Search Widget, Google Blogoscoped
- Second Edition of Search Marketing Standard Released, Search Engine Roundtable
- Yahoo! Please Enable a No Yahoo! Directory Tag, Search Engine Roundtable
- Google AdWords Seminars, API Updates, Google Blogoscoped
- Search Engine Score Card, Simpatico
- Navigate Digital Images Via A Zoom And Click Interface Metaphor: Photo-In-Photo By Zooomr Portals Is Here, Robin Good
- All About Site Searchiness, ClickZ
- Yahoo To Expand Mobile Software Suite To Windows Phones, Dow Jones
- Video: Datacenter comments, Matt Cutts
- Webshots Beautifies… Who Next?, GigaOM
- The Arrogance of SEO, Search Engine Guide
- Google Apps For Domains, Mocked By A Hosting Company, Google Operating System
- Social Media Optimization, eh? Let's See What's in Our Bag o' Goodies, SEOMoz

Newsletter signup
Biography
Tony Wright
Tony Wright is founder and CEO of WrightIMC, a Dallas-based digital consultancy dedicated to helping advertising agencies, PR firms, interactive shops and corporate in-house teams build successful and profitable search engine marketing teams, while also assisting select clientele with their interactive and search marketing needs.
With almost a decade of hands-on and strategic experience in interactive marketing and a background in traditional and interactive public relations and journalism, Wright has helped hundreds of fortune 500 and mid-level clients successfully create or enhance their interactive advertising and search engine marketing teams.
Article Archives by Tony Wright
Tips for Being a Great PPC Client - Aug 27, 2007
The Long Tail Is Dead -- Long Live the Long Tail - Aug 20, 2007
Google's Local Search Land Grab - Aug 13, 2007
Microsoft adCenter Could Win the Search Battle - Aug 6, 2007
More article archives











