When the AOL privacy case broke earlier this month, I wrote about how the intention of releasing the data was honorable despite the ineptness of how it was done. Those trying to research search behavior have been starved for decent data. Researchers Yearn to Use AOL Logs, But They Hesitate from the New York Times covers this in more detail, about how the existing data sets out there are nearly 10 years old.
Along the way, we discover researchers are debating if they should use the data. I'd say you might as well. It's not like you'll be getting more any time soon. As long as the researchers aren't themselves republishing in a way to violate someone's privacy, it's hard to see the harm. At this point, the data has been spread so far and wide, accessible in many ways, that it's difficult to see what the researchers think they'd be protecting by studying it.
The story also touches on data releases from other search engines (Yahoo and Microsoft say they've done some controlled, limited releases; Google says they hand nothing out). It also highlights how the researcher who put the data out -- again with the best of intentions -- simply didn't realize that people would be able to be tracked down through their search profiles.
Most interesting is the end of the story, looking at if there's a way to scrub the search stream so that data could be released and be untraceable. I've said I'd love to see that type of solution happen. But it would have to be foolproof, and I'm not sure how that can happen unless you have human review of profiles that might go out.
Meanwhile, the San Jose Mercury News in What do Google, Yahoo, AOL and Microsoft's MSN know about you? effectively does over the same survey of how long data is kept that News.com did last February, in the wake of the US Department Of Justice search privacy debate. I mentioned the story before, but let me highlight a key part of it:
While AOL is unique among the Big Four in that its users are easily identified by an AOL user name after they have logged in, people who frequent Google, Yahoo and MSN are also monitored by a combination of digital tracking systems.
Nope, AOL is not that unique. If you've logged into Google, Yahoo or MSN to use any of their services, chances are when you search, they'll also have you keyed to a particular profile that's more unique that just looking at your IP address or a cookie. The story does explain this more, and my previous post Which Search Engines Log IP Addresses & Cookies -- And Why Care? goes into the explanation in more depth. In looking at that previous post, I also saw this:
[News.com]: Given a list of search terms, can you produce a list of people who searched for that term, identified by IP address and/or cookie value?
[AOL]: No. Our systems are not configured to track individuals or groups of users who may have searched for a specific term or terms, and we would not comply with such a request.
Despite the response, I'm 99 percent certain AOL does indeed log IP addresses and cookies along with search data. Searching on AOL creates a page request with the search terms embedded in the page's URL. That request will be logged. If it's logged, it can be analyzed. In fact, AOL later says they can give you a list of searches that were done by a particular IP address or cookied browser. If you have that information, you have the opposite.
Of course, we now know that it was indeed the case that you could take AOL's data, give it a search term and get a list of individuals who searched for it. Yes, the individuals were given anonymous numbers, so the AOL answer is technically correct. But the overall profile of what someone was searching for in some cases turned out to be personally revealing.
I'm planning a longer recap on some of the latest out of the AOL case, but in the meantime, I still keep coming back to this conclusion from an earlier post:
I think consumers will need more faith and control over how long search data is kept for them, plus the ability to opt-out or delete histories with a push of a button, perhaps the type of privacy/data control panel John Battelle has wished for. And as I've written, that has to include ISPs, many of which merrily sell search data that they monitor to third party companies.
I'm working on a longer look back at the fallout from the AOL release and ways forward. But a quick shout-out to Daniel Brandt of Google Watch is in order. Seth Finkelstein just gave him one, and I'll add to it. I've felt Brandt's often twisted things or focused on stuff that didn't matter much (Google's 30 year cookie that most people won't really have last for more than a year or two, if that). But his long-standing call for regular data destruction -- something other privacy advocates have also pushed for -- seems the most secure solution going forward.
Introducing SES Online
Want to view one of the sessions you missed or listen to an especially informative presenter a second time? SES New York sessions are available for purchase on ClickZ Academy's new e-Learning site. SES is now Online!