WebmasterWorld’s Brett Tabke Speaks On Rogue Spidering Woes, Plus The Need For Expanded Feeds

Brett Tabke from WebmasterWorld dropped me a note about a new thread where
he’s answering many questions about WebmasterWorld banning all spiders, while
Barry over at Search Engine Roundtable also has an interview with him. In both
places, you’ll learn of spiders being an increasing burden to the site, though I
still am very, very wary that others should follow the route that Brett’s taken.

Attack of the Robots,
Spiders, Crawlers.etc
at WebmasterWorld picks up from the
Lets try this for a
month or three
thread where Brett announced last week that WebmasterWorld
was banning all spiders by excluding them via robots.txt and through other
measures such as required logins.

WebmasterWorld
Bans Spiders From Crawling
and
WebmasterWorld
Out Of Google & MSN
from the SEW Blog covers more about the move fallout
with WebmasterWorld no longer being visible in two major search engines.

In his latest posts, Brett explains:

  • The flat file nature of WebmasterWorld makes it apparently more vulnerable
    to spiders.
     
  • Spider fighting has been taking a considerable and increasing amount of
    time.
     
  • A ton of efforts have been done to stop spiders but cookie-based login
    still seen as necessary
     
  • Major search engines other than Google (Ask Jeeves, MSN and Yahoo) were
    all banned for more than 60 days before this latest move.

Brett Tabke
Interviewed on Bot Banning
from Search Engine Roundtable takes the interview
approach, where it is much easier to see what Brett’s thinking and reacting to
than wandering through the forum posts. Beyond the points above, he addresses
not wanting to make use of non-standard extensions to robots.txt that Google,
MSN and some other search engines have added precisely because they aren’t
standard.

Overall, I can appreciate much of what Brett’s going through, but there still
have to be better ways for this to be addressed. His solution is simply not one
that the vast majority of sites will want to try, because it will simply wipe
out the valuable search traffic they gain.

To be clear, I’m NOT saying that any site should be entirely dependent on
search traffic. But neither do you cut yourself off from them, either. It’s a
matter of balance and moderation. To quote from what I posted in our forum
thread
on the WebmasterWorld situation:

People would often ask how much of their traffic they get from search
engines. There is no right answer, but I’d often said that if you were looking
at 60, 70, 80 percent or higher, you might have a search engine dependency
problem. You want to have a variety of sources sending you traffic, so no one
single thing wipes you out.

But to suggest that a site is so successful that it doesn’t need search
traffic at all? That’s foolishness. I have absolutely no doubt that WMW will
survive. It’s a healthy community with plenty of alternative traffic. But
people seeking answers to things it has answers to give are no longer going to
be finding it.

Hmm, we’ll maybe those people aren’t good members, just generate to noise
and so on. Yeah, maybe. But that also assumes that every single quality person
must be there already. That’s just not so. You always have good new people
coming onto the web.

Search engines are a way you build up loyal users. People often discover
you for the first time through search, then they keep coming back. It’s not a
dependency to have a small amount of your traffic bringing in new people this
way. But it is, in my view, a marketing screw-up to cut yourself off from that
potential audience.

Geez, it’s like the basic rule of SEO/SEM. Ensure your site is accessible
to search engines. If they can’t get in, you stand no chance of getting
traffic at all from them. And when people are paying by the click for search
traffic, why don’t you want that free publicity. Why wouldn’t you seek other
ways of retaining it but also restricting the bad bandwidth you don’t want?

Overall, WMW obviously can and will do what it wants, and perhaps there’s
some magical master plan that down the line will make us all say "Genius!"
Maybe. But this is a very, very bad model for any site to be considering, if
they’re having the same spidering problems that are the stated reason for why
WMW is doing this. It’s like saying you’re getting too many phone calls to
your business, so you’re going to pull out the phone entirely!

So what is a site owner to do, if they are suffering from rough spiders? I’ll
share a bit from our own experience, plus point at what maybe the search engines
should be doing.

We’ve encountered rogue spiders. It was one reason why our own
Search Engine Watch Forums
were down briefly
last month, coincidentally the same time WebmasterWorld and Threadwatch went
offline for different reasons. Rogue spiders aren’t just something unique to
Brett’s set-up. They can and do indeed cause problems even for less "flat file"
sites and URL structures.

In fact, want to have some fun. Check

this
out. That shows you all the people on our SEW Forums at the moment you
click on the link, up to 200 visitors. Scan the IP Address column, and you’ll
see how Yahoo’s Slurp spider is in many, many different threads all at once.
That’s a burden on our server, though since we’re getting well indexed as a
result, it’s a burden we live with.

Our own solution has been for our developers to throttle or ban spiders at
the IP level that seem to be hitting us hard, in particular spiders that aren’t
identifying themselves as to their purpose. Good spiders often leave behind a
URL string in your logs so you know they are from Google, Yahoo or whatever. For
example, Yahoo points you
here
. Google points you
here
. No good identification? Then we don’t worry that banning you is going
to harm us seriously in some way.

What about improving the robots.txt system? Unfortunately, that’s not a
solution for rogue spiders. Brett’s right when he points out the real story is
moving to required logins. Rogue spiders aren’t paying attention to robots.txt.
Put in a ban against them, and they’ll ignore it. Robots.txt only works with
"polite" spiders.

Because robots.txt isn’t a solution, it also means that wishing that the
major search engines would come together to endorse new improved "standards" for
the protocol also isn’t a solution. Since rogue spiders are ignoring robots.txt,
it doesn’t then matter for there to be some type of universal agreement to have
a "crawl delay" feature or more wildcard support, for example.

Still, while improving robots.txt isn’t a solution to rogue spiders, there
are things it could do if improved, and I’m right with Brett in wishing that the
major search engines wouldn’t unilaterally make their own improvements, as I’ve
written

before
(and
here
).

So if we can’t depend on robots.txt, what is the solution? If more and more
sites face heavy spidering, we’ll likely have to see a shift toward feeding
content to search engines.

Feeding content isn’t a new idea. Yahoo’s paid inclusion program is pretty
well known as a way for site owners to feed not URLs into the search engine but
actual page content. Yahoo also has partnerships with some sites to take in
content on a non-paid basis. Google also takes in feeds of content through
things like
Google Scholar
or Google’s

Froogle shopping feeds program
.

To be absolutely clear, these types of program aren’t situations where you
feed URLs, as with
Google Sitemaps
or Yahoo’s bulk
submit
. These are programs where you feed actual page content. The spider
doesn’t come to you and hunt and guess at what you’ve got. You tell the spider
what you’ve got.

Expanding feed programs to everyone would be a much more efficient way of
gathering content, with one exception. You can expect that some sites will abuse
feeds to send misleading content. Heck, it’s bad enough how ping servers are
already abused being wide open this way, as I
wrote about on Matt
Mullenweg’s blog last month, when the future of ping servers was raised:

Whether we have an "independent" ping service almost seems beyond the point
when both Dave and Matt are talking about the ping spam problem they have
experienced. I’m actually surprised any the open ping servers are surviving.
If they are open to anyone to ping, a small number of people will abusively
ping for marketing gains

We?ve had 10 years of history knowing this with web search. Web search
engines could long ago have had instant add facilities. Indeed, Infoseek and
AltaVista even did for a short period of time. They found that without
barriers, a small number of people would flood them with garbage. That?s why
they don?t take content in rapidly. It?s not that they aren?t smart enough to
take pings or let website owners flow content in. Instead, it is that they?ve
learned you can?t leave a wide door open like that without being abused.

There?s absolutely no reason for anyone to have assumed that RSS/blog/feed
search services were going to be immune to the same problem. If the ping
outlook is bleak, it?s not because Verisign or Yahoo has purchased some
service. It?s because you simply can?t leave doors open on the web like this
for search, not for any search that?s going to attract significant traffic.
Blog search is gaining that traffic, and you can expect the spam problem will
simply get worse and worse until some barriers are put into place. You also
cannot expect that you?ll simply come up with some algorithmic way to stop
ping spam. Again, 10 years of web search engines diligently trying to stop
spam has simply found it?s a never ending arms race.

I don?t know what the solution is. I suspect that for the major search
players, the Googles & Yahoos, they?ll eventually move to a combination of
rapid crawling, trusted pings and open pings as a backup. Remember, they get
news content very fast. If they have a set of trusted sites, they can spider
and hammer those hard. They?ll know to keep checking Boing Boing, Scripting
and maybe 1,000 other major blogs that really, really matter ? and that when
you check them, you quickly discover other links from blogs you may want to
fetch quickly.

So throwing feeds wide open to everyone without vetting isn’t the solution.
But certainly we’re overdue for feeds to be available to more people without
requiring payment, through some type of trusted mechanism.

WebmasterWorld is a perfect poster child for this. People want the content
there, and the search engines should want the content to be found via their
sites as well. Allowing the site to feed its content gets around the barriers
erected to stop rogue spiders very nicely.

But WebmasterWorld isn’t the only candidate in this class. Many others,
including myself, want the ability to feed actual content to the search engines.
Let’s see them move ahead with a way to make this more a reality, to establish
real "trusted feeds" that aren’t based on payment or whether your site falls
within an area that the business development teams think need more support.
Google Base may become Google’s means of doing this, but at the moment, that’s
not feeding into web search.

Want to comment or discuss? Visit our Search Engine Watch Forum thread,

WebmasterWorld Off Of Google & Others Due To Banning Spiders
.

Related reading

Super food diet selection in wooden bowls. High in antioxidants, vitamins, minerals and anthocyanins.
cma feature
Search Console Search Analytics
i_fought_the_law
Simple Share Buttons