Yesterday I wrote about how several proxy servers used by those wishing to search and surf anonymously had apparently been blocked by Google, including the popular Tor service. Google's since explained why these were blocked and how human users can get around the barrier.
Google told me that someone or something was using the Tor system to hit them with an extremely large number of queries, which caused the block on the network to come online.
Couldn't Google have done this in a way to filter out the humans but block the spiders? Cory Doctorow, who wrote the Boing Boing post on the subject, especially felt Google was being too heavy handed. In an email exchange we had, he wrote me:
Google has a lot of engineering talent, but it approached this problem with a fireax, not a scalpel.
Actually, Google is using both a fireax and a scalpel. It's just that some Tor users might not see the scalpel, if they have cookies disabled, from what I can tell.
A human user, with a browser that accepts cookies, would get a slightly different block page. This one would allowing them to prove they weren't a spider via a CAPTCHA code.
Notice how the second example has a part that says:
To continue searching, please type the characters you see below
After this is a code, a CAPTCHA, a system to filter out robots that can't read the text in the image.
Anyone set to accept cookies will see the CAPTCHA challenge, be able to fill it out and continue searching. But isn't accepting cookies defeating the purpose of using a system like Tor designed to keep you anonymous?
Not necessarily. For example, in Firefox, you could choose to have cookies cleared every time you close the browser. That means for your searching session, Google will only know that someone from an anonymous IP (it can't be traced back to you, remember) did a series of searches for a particular session of time.
Close your browser, come back to Google, and you'd get a new cookie (along with an entirely new IP address). There would be no way to associate your searches over a long period of time, which potentially could lead to how one person was identified in the recent AOL data release case -- assuming somehow, someway, someone got to all of Google's data over time.
It's unlikely -- though still possible -- that you could do enough searching within one session to give yourself away just based on your queries. For those still concerned about this, I suppose you could do a search, then clear your cookie and search again. Alternatively, don't search for anything that you think could potentially reveal who you are.
For more on protecting your search privacy, see my past posts Which Search Engines Log IP Addresses & Cookies -- And Why Care? and Protecting Your Search Privacy: A Flowchart To Tracks You Leave Behind.
Could Google do things better? Absolutely. Since many people using services like Tor might not be allowing cookies, Google should change the page that comes up for "robots" to say something like "if you're a human, please allow cookies, and then you'll get a code to let you in." Google could even take the further step of detailing how to set up cookies and clear them in popular browsers to better guide those concerned about privacy. And to be fair, all the search engines could do more on that front.
That page can definitely be more helpful in other ways. When I've heard of this happening in the past, it was typically because someone from a particular ISP or shared IP address was doing a lot of rank checking. That might cause the entire IP range to get closed.
Unfortunately, Google's current warning page doesn't give the unfortunate innocent users much guidance that things outside their control might be to blame. Instead, it sends them thinking that maybe they've got a virus or spyware. I can see that has caused at least one person to waste time checking how to "fix" a problem they didn't have.
It would also be nice to see more help pages on Google about this in general. All these things are ideas Google said it will consider.
Postscript: Cory emailed me this:
Danny, I believe that they could solve this problem without requiring cookies -- for example, they could embed a RESTful, expriring GUID in the URL-line on the successful solution of a CAPTCHA: