Bye-bye, Crawler: Blocking the Parasites

Date published 4 August 2010 Author

Ralph Tegtmeier

Categories

Have you ever experienced a chock-full of misbehaved, hyper-aggressive spiders hitting your servers with request rates to the tune of several thousand per second?

Much as we may wish for all the world’s search engines to take notice of our Web equity — when they’ve actually managed to crash your system a few times you may be pardoned for having second thoughts. Equally, when they’re hitting your servers with such a task load, your visitors may easily get the impression that viewing your pages is akin to plodding through treacle — impeding sales and your company’s reputation, not to mention the fact that it’s anything but a great user experience.

So what to do about it? Well, how about simply blocking them? After all, as always in business, it’s a question of what kind of a trade-off you’re actually stuck with here.

Not all spiders are created equal, and only your specific online business model should govern your decision to either bear with them accessing your pages regularly or telling them to get lost. After all, bandwidth doesn’t come cheap and losing sales due to poor server performance isn’t particularly funny either.

Are you targeting the Russian market at all? If not, all that traffic created by Yandex search engine crawlers is something you may very well do without.

How about China? Japan? Korea? Chinese search engines such as Baidu, SoGou and Youdao will merrily spider your sites to oblivion if you let them. In Japan it’s Goo, and in South Korea it’s Naver that can mutate into performance torpedoes once they’ve started to fancy your website.

Nor is that all, because the search engines aren’t the only culprits in this field.

Are you happy with your competition sussing out your entire linking strategy (both incoming and outgoing)? A number of services around will help them do exactly that. Fortunately, at least one major contender, namely Majestic-SEO is perfectly open about things and lets you block their crawlers gracefully. (No such luck with most other setups…)

Beyond the link snoops, you may want to take a long hard look at a service like Copyscape that will blithely crawl your entire site* — to do what? Merely to allow your competition to harass you with copyright infringement suits should they find proof of it on any of your pages. Now don’t get me wrong: I’m positively not advocating violation of intellectual property rights here in any way, quite the contrary. (With a background in offline publishing and book retail and 30+ published books to my credit, I’m an ardent supporter of copyright protection.)

But if copyright violation is exactly what you’re not perpetrating, what’s the point of letting some self-serving commercial watchdog setup eat away at your bandwidth and server resources in the first place? It’s not as if you were getting anything in return from them, right?

At the end of the day, it’s solely your choice to block certain spiders or not. However, here’s how to do it if you do want or need to.

How to Prevent Specific Spiders From Crawling Your Pages

Let’s briefly discuss three different ways of blocking spiders. Before we start, however, you’ll need some fundamental data to work from in order to identify specific spiders reliably. These are mainly the User Agent header field (a.k.a., identifier) and, in the case of Copyscape, the spider’s originating IP address.

Basic Spider Data: User Agents

Yandex (RU)
Russian search engine Yandex features the following User Agents:

Mozilla/5.0 (compatible; YandexBlogs/0.99; robot; B; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexMedia/3.0; +http://yandex.com/bots)
YandexSomething/1.0

Goo (JP)
Japanese search engine Goo features the following User Agents:

DoCoMo/2.0 P900i(c100;TB;W24H11) (compatible; ichiro/mobile goo; +http://help.goo.ne.jp/help/article/1142/)
ichiro/2.0 (http://help.goo.ne.jp/door/crawler.html)
moget/2.0 (moget@goo.ne.jp)

Naver (KR)
Korean search engine Naver features the following User Agents:

Mozilla/4.0 (compatible; NaverBot/1.0; http://help.naver.com/customer_webtxt_02.jsp)

Baidu (CN)
China’s number-one search engine Baidu features the following User Agents:

Baiduspider+(+http://www.baidu.com/search/spider.htm)
Baiduspider+(+http://www.baidu.jp/spider/)

SoGou (CN)
Chinese search engine SoGou features the following User Agents:

Sogou Pic Spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou head spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou Orion spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou-Test-Spider/4.0 (compatible; MSIE 5.5; Windows 98)
sogou spider
Sogou Pic Agent

Youdao (CN)
Chinese search engine Youdao (which also spells itself “Yodao” on occasion) features the following User Agents:

Mozilla/5.0 (compatible; YoudaoBot/1.0; http://www.youdao.com/help/webmaster/spider/; )
Mozilla/5.0 (compatible;YodaoBot-Image/1.0;http://www.youdao.com/help/webmaster/spider/;)

Majestic-SEO
Link analysis service Majestic-SEO http://www.majesticseo.com/ is using the distributed search engine Majestic-12:

Majestic-12
UA: Mozilla/5.0 (compatible; MJ12bot/v1.3.3; http://www.majestic12.co.uk/bot.php?+)

Copyscape
Copyscape Plagiarism Checker – Duplicate Content Detection Software
Site info: http://www.copyscape.com

Copyscape
User Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
IP: 212.100.254.105
Host: googlealert.com

Copyscape works in an underhanded manner, hiding its spider behind a generic User Agent and a domain name that gives you the entirely false impression of somehow being connected to Google while in reality it belongs to Copyscape itself.

This means that you cannot identify their sneaky spider via the User Agent header field. The only reliable way to block it is via their IP.

Blocking Spiders via robots.txt

For a general introduction to the robots.txt protocol, please see: http://www.robotstxt.org/

Search engines are called to disclose which code to deploy in a given robots.txt file to deny their spiders access to a site’s pages. Moreover, the page outlining this process should be easy to find.

Regrettably, most spiders listed above feature their robots.txt specs only in Chinese, Japanese, Russian, or Korean — not very helpful for your average English speaking webmaster.

The following list features info links for webmasters and the code you should actually deploy to block specific spiders.

Yandex (RU)
Info: http://yandex.com/bots gives us no information on Yandex-specific robots.txt usage.

Required robots.txt code:

User-agent: Yandex
Disallow: /

Goo (JP)
Info (Japanese): http://help.goo.ne.jp/help/article/704/
Info (English): http://help.goo.ne.jp/help/article/853/

Required robots.txt code:

User-agent: moget
User-agent: ichiro
Disallow: /

Naver (KR)
Info: http://help.naver.com/customer/etc/webDocument02.nhn

Required robots.txt code:

User-agent: NaverBot
User-agent: Yeti
Disallow: /

Baidu (CN)
Info: http://www.baidu.com/search/spider.htm

Required robots.txt code:

User-agent: Baiduspider
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /

SoGou (CN)
Info: http://www.sogou.com/docs/help/webmasters.htm#07

Required robots.txt code:

User-agent: sogou spider
Disallow: /

Youdao (CN)
Info: http://www.youdao.com/help/webmaster/spider/

Required robots.txt code:

User-agent: YoudaoBot
Disallow: /

Because the robots.txt protocol doesn’t allow for blocking IPs, you’ll have to resort to either of the two following methods to block Copyscape spiders.

Blocking Spiders via .htaccess and mod_rewrite

Seeing that not all spiders are abiding by the robots.txt protocol, it’s safer to block them via .htaccess and mod_rewrite on Apache systems.

Like robots.txt, the .htaccess file applies to single domains only. For a solution covering your entire Web server, please see the section on Apache’s httpd.conf below.

Here’s a simple example for blocking Baidu and Sogou spiders:

In your .htaccess file, include the following code:

RewriteEngine on
Options +FollowSymlinks
RewriteBase /

RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR”
RewriteCond %{HTTP_USER_AGENT} ^Sogou
RewriteRule ^.*$ – [F”

Explanation:

The various User Agents to be blocked from access are listed one per line.
The Rewrite conditions are connected via “OR”.
“NC”: “no case” – case-insensitive execution.
The caret “^” character stipulates that the User Agent must start with the listed string (e.g. “Baiduspider”).
“[F”” serves the spider a “Forbidden” instruction.

Thus, if you want to block Yandex spiders, for instance, you can use the following code:

RewriteCond %{HTTP_USER_AGENT} Yandex

In this particular case the block will be effected whenever the string “Yandex” occurs in the User Agent identifier.

As mentioned above, Copyscape can only be blocked via their IP. The specific code is:

RewriteCond %{REMOTE_ADDR} ^212.100.254.105$

Blocking Spiders via the Apache Configuration File httpd.conf

An alternative method of blocking spiders can be executed from the Apache webserver configuration file by listing the pertinent User Agent header fields there. The main advantage of this approach is that it will apply to the entire server (i.e., it’s not limited to single domains). This can save you lots of time and effort, provided you actually wish to apply these spider blocks uniformly across your entire system.

Include your new directives in the following section of Apache’s httpd.conf file:

# This should be changed to whatever you set DocumentRoot to.
#

…
SetEnvIfNoCase User-Agent “^Baiduspider” bad_bots
SetEnvIfNoCase User-Agent “^Sogou” bad_bots
SetEnvIf Remote_Addr “212.100.254.105” bad_bot

Order allow,deny
Allow from all

Deny from env=bad_bots
…

*Copyscape’s CTO and founder contends that this statement is not correct.

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Opinion

Information

Follow us

Bye-bye, Crawler: Blocking the Parasites

Leave a Reply Cancel reply

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

Solving the agency search intelligence gap

What to expect from SEO in 2021?

Search engine saturation: The ever evolving SERP and how brands are respond...

Are search engines dead in China?

How writers can optimize content for a variety of search engines

Nine types of meta descriptions that win more clicks

Google bans adverts for anti-censorship sites in China

What's it like using DuckDuckGo in 2019?