WebmasterWorld's Brett Tabke Speaks On Rogue Spidering Woes, Plus The Need For Expanded Feeds

Brett Tabke from WebmasterWorld dropped me a note about a new thread where he's answering many questions about WebmasterWorld banning all spiders, while Barry over at Search Engine Roundtable also has an interview with him. In both places, you'll learn of spiders being an increasing burden to the site, though I still am very, very wary that others should follow the route that Brett's taken.

Attack of the Robots, Spiders, Crawlers.etc at WebmasterWorld picks up from the Lets try this for a month or three thread where Brett announced last week that WebmasterWorld was banning all spiders by excluding them via robots.txt and through other measures such as required logins.

WebmasterWorld Bans Spiders From Crawling and WebmasterWorld Out Of Google & MSN from the SEW Blog covers more about the move fallout with WebmasterWorld no longer being visible in two major search engines.

In his latest posts, Brett explains:

  • The flat file nature of WebmasterWorld makes it apparently more vulnerable to spiders.
  • Spider fighting has been taking a considerable and increasing amount of time.
  • A ton of efforts have been done to stop spiders but cookie-based login still seen as necessary
  • Major search engines other than Google (Ask Jeeves, MSN and Yahoo) were all banned for more than 60 days before this latest move.

Brett Tabke Interviewed on Bot Banning from Search Engine Roundtable takes the interview approach, where it is much easier to see what Brett's thinking and reacting to than wandering through the forum posts. Beyond the points above, he addresses not wanting to make use of non-standard extensions to robots.txt that Google, MSN and some other search engines have added precisely because they aren't standard.

Overall, I can appreciate much of what Brett's going through, but there still have to be better ways for this to be addressed. His solution is simply not one that the vast majority of sites will want to try, because it will simply wipe out the valuable search traffic they gain.

To be clear, I'm NOT saying that any site should be entirely dependent on search traffic. But neither do you cut yourself off from them, either. It's a matter of balance and moderation. To quote from what I posted in our forum thread on the WebmasterWorld situation:

People would often ask how much of their traffic they get from search engines. There is no right answer, but I'd often said that if you were looking at 60, 70, 80 percent or higher, you might have a search engine dependency problem. You want to have a variety of sources sending you traffic, so no one single thing wipes you out.

But to suggest that a site is so successful that it doesn't need search traffic at all? That's foolishness. I have absolutely no doubt that WMW will survive. It's a healthy community with plenty of alternative traffic. But people seeking answers to things it has answers to give are no longer going to be finding it.

Hmm, we'll maybe those people aren't good members, just generate to noise and so on. Yeah, maybe. But that also assumes that every single quality person must be there already. That's just not so. You always have good new people coming onto the web.

Search engines are a way you build up loyal users. People often discover you for the first time through search, then they keep coming back. It's not a dependency to have a small amount of your traffic bringing in new people this way. But it is, in my view, a marketing screw-up to cut yourself off from that potential audience.

Geez, it's like the basic rule of SEO/SEM. Ensure your site is accessible to search engines. If they can't get in, you stand no chance of getting traffic at all from them. And when people are paying by the click for search traffic, why don't you want that free publicity. Why wouldn't you seek other ways of retaining it but also restricting the bad bandwidth you don't want?

Overall, WMW obviously can and will do what it wants, and perhaps there's some magical master plan that down the line will make us all say "Genius!" Maybe. But this is a very, very bad model for any site to be considering, if they're having the same spidering problems that are the stated reason for why WMW is doing this. It's like saying you're getting too many phone calls to your business, so you're going to pull out the phone entirely!

So what is a site owner to do, if they are suffering from rough spiders? I'll share a bit from our own experience, plus point at what maybe the search engines should be doing.

We've encountered rogue spiders. It was one reason why our own Search Engine Watch Forums were down briefly last month, coincidentally the same time WebmasterWorld and Threadwatch went offline for different reasons. Rogue spiders aren't just something unique to Brett's set-up. They can and do indeed cause problems even for less "flat file" sites and URL structures.

In fact, want to have some fun. Check this out. That shows you all the people on our SEW Forums at the moment you click on the link, up to 200 visitors. Scan the IP Address column, and you'll see how Yahoo's Slurp spider is in many, many different threads all at once. That's a burden on our server, though since we're getting well indexed as a result, it's a burden we live with.

Our own solution has been for our developers to throttle or ban spiders at the IP level that seem to be hitting us hard, in particular spiders that aren't identifying themselves as to their purpose. Good spiders often leave behind a URL string in your logs so you know they are from Google, Yahoo or whatever. For example, Yahoo points you here. Google points you here. No good identification? Then we don't worry that banning you is going to harm us seriously in some way.

What about improving the robots.txt system? Unfortunately, that's not a solution for rogue spiders. Brett's right when he points out the real story is moving to required logins. Rogue spiders aren't paying attention to robots.txt. Put in a ban against them, and they'll ignore it. Robots.txt only works with "polite" spiders.

Because robots.txt isn't a solution, it also means that wishing that the major search engines would come together to endorse new improved "standards" for the protocol also isn't a solution. Since rogue spiders are ignoring robots.txt, it doesn't then matter for there to be some type of universal agreement to have a "crawl delay" feature or more wildcard support, for example.

Still, while improving robots.txt isn't a solution to rogue spiders, there are things it could do if improved, and I'm right with Brett in wishing that the major search engines wouldn't unilaterally make their own improvements, as I've written before (and here).

So if we can't depend on robots.txt, what is the solution? If more and more sites face heavy spidering, we'll likely have to see a shift toward feeding content to search engines.

Feeding content isn't a new idea. Yahoo's paid inclusion program is pretty well known as a way for site owners to feed not URLs into the search engine but actual page content. Yahoo also has partnerships with some sites to take in content on a non-paid basis. Google also takes in feeds of content through things like Google Scholar or Google's Froogle shopping feeds program.

To be absolutely clear, these types of program aren't situations where you feed URLs, as with Google Sitemaps or Yahoo's bulk submit. These are programs where you feed actual page content. The spider doesn't come to you and hunt and guess at what you've got. You tell the spider what you've got.

Expanding feed programs to everyone would be a much more efficient way of gathering content, with one exception. You can expect that some sites will abuse feeds to send misleading content. Heck, it's bad enough how ping servers are already abused being wide open this way, as I wrote about on Matt Mullenweg's blog last month, when the future of ping servers was raised:

Whether we have an "independent" ping service almost seems beyond the point when both Dave and Matt are talking about the ping spam problem they have experienced. I'm actually surprised any the open ping servers are surviving. If they are open to anyone to ping, a small number of people will abusively ping for marketing gains

We?ve had 10 years of history knowing this with web search. Web search engines could long ago have had instant add facilities. Indeed, Infoseek and AltaVista even did for a short period of time. They found that without barriers, a small number of people would flood them with garbage. That?s why they don?t take content in rapidly. It?s not that they aren?t smart enough to take pings or let website owners flow content in. Instead, it is that they?ve learned you can?t leave a wide door open like that without being abused.

There?s absolutely no reason for anyone to have assumed that RSS/blog/feed search services were going to be immune to the same problem. If the ping outlook is bleak, it?s not because Verisign or Yahoo has purchased some service. It?s because you simply can?t leave doors open on the web like this for search, not for any search that?s going to attract significant traffic. Blog search is gaining that traffic, and you can expect the spam problem will simply get worse and worse until some barriers are put into place. You also cannot expect that you?ll simply come up with some algorithmic way to stop ping spam. Again, 10 years of web search engines diligently trying to stop spam has simply found it?s a never ending arms race.

I don?t know what the solution is. I suspect that for the major search players, the Googles & Yahoos, they?ll eventually move to a combination of rapid crawling, trusted pings and open pings as a backup. Remember, they get news content very fast. If they have a set of trusted sites, they can spider and hammer those hard. They?ll know to keep checking Boing Boing, Scripting and maybe 1,000 other major blogs that really, really matter ? and that when you check them, you quickly discover other links from blogs you may want to fetch quickly.

So throwing feeds wide open to everyone without vetting isn't the solution. But certainly we're overdue for feeds to be available to more people without requiring payment, through some type of trusted mechanism.

WebmasterWorld is a perfect poster child for this. People want the content there, and the search engines should want the content to be found via their sites as well. Allowing the site to feed its content gets around the barriers erected to stop rogue spiders very nicely.

But WebmasterWorld isn't the only candidate in this class. Many others, including myself, want the ability to feed actual content to the search engines. Let's see them move ahead with a way to make this more a reality, to establish real "trusted feeds" that aren't based on payment or whether your site falls within an area that the business development teams think need more support. Google Base may become Google's means of doing this, but at the moment, that's not feeding into web search.

Want to comment or discuss? Visit our Search Engine Watch Forum thread, WebmasterWorld Off Of Google & Others Due To Banning Spiders.