AOL Releases Search Data & Raises Privacy Concerns

Techmeme is reporting a huge amount of concern over AOL releasing, then pulling, search logs done by 500,000 users over three months. The purpose of the release was to help search researchers better understand user behavior in conjunction with an industry event for search researchers happening in Seattle, SIGIR. The data was posted on the AOL research site, but has since been pulled.

Unlike what TechCrunch suggests, this isn't private data in that no personally identifiable information has been released. Instead, actual usernames have been replaced with anonymous one. However, this still means it's possible to track the behavior of a particular user and potentially know who they are if their searches contained personally identifiable information.

To understand this more, this page gives some examples gleaned from the new AOL data. Also see this example of someone who might be planning to murder his wife. Danny's earlier post, Private Searches Versus Personally Identifiable Searches, also covers the general difference between private data versus personally identifiable stuff.

How does what AOL compare to what the Department of Justice asked for from search engines earlier this year? It actually goes further. The DOJ simply wanted searches, not any further information that would allow a group of searches to be linked with an individual, even if that individual as kept anonymous.

Danny may have more to say about this next week. He's at the SES San Jose conference this week and very busy with that, but he sent me some notes from a brief review of the AOL move to give perspective here as he sees it.

Postscript From Danny: Just a few quick thoughts and updates in the short time I have between sessions.

AOL: Dooooooh! from John Battelle and AOL apologizes for release of user search data from have AOL apologizing for the release, now said to be data involving about 658,000 individuals from March through May of this year. AOL says the release of the data wasn't properly vetted for privacy issues and that the release intentions were innocent.

I believe that. Make no mistake, this was a big screw up. The researchers providing the data didn't think hard enough about how making it possible to build a profile of individuals, even if they were given anonymous names, might then make it possible to determine who those people are if they revealed enough information in their searches.

In addition, it's going to be very difficult for some law enforcement agency not to want to subpoena AOL for actual user names when they read about things that suggest a murder is being planned or may have happened, as covered above. I'm not saying they'll get it, but I think it's almost inevitable that someone will try. That will set off further privacy fireworks.

But yes, the original intention was innocent. I got an email about the research site last week (and with my traveling all last week, simply did not have a chance to check it out). Here's what a researcher involved with it emailed me:

Over the last few years I have witnessed a divide developing within Information Retrieval research - between the haves and have-nots. The ‘haves' are the companies like Google, Yahoo, MSN, and ourselves, with lots of resources and data. The ‘have-nots' are people without those resources such as academic researchers and smart guys at small companies. We want to be able to help anyone work on great ideas by giving them the data and infrastructure they need.

So we started building data sets and made them available for everyone to test their ideas with. Each data set features a dynamic view, which allows you to inspect the data without having to download it. We also built some APIs for news, video, audio and podcasts, which will save people time from having to do that themselves. We have tried to stay away from interfaces like web search as those are already around.

There's nothing evil in that. In fact, there's much to appreciate, intention-wise.

We all use search engines so much, and they are so important in our daily lives, yet they remain one of the most poorly researched media venues out there. Yes, we're getting new labs like the one from Yahoo at UC Berkeley. But most search behavior studies outside of the search engines have depended on ancient search logs from places like Excite from back in 2001 or so. Newer studies, if the search engines are doing them, simply don't come out often. So the intention to promote learning with this release was innocent, if not honorable. The execution was poor and inexcusable.

This is the second major milestone in raising awareness of search privacy issues this year. The first was the Department of Justice action, which rightly focused on whether we need more safeguards over what governments can request. Today's upset highlights the protections that are needed again corporate releases of data.

The good news is that perhaps it will spur better protections even more. Microsoft, Google & Others Call For Unified Federal Privacy Protection covers how the major search engines recently asked for better legal protections from the government. But perhaps the search industry itself will move forward to develop better privacy standards. I've hoped recently for some type of Search Privacy Bill Of Rights. Since I doubt the government will act quickly, perhaps the industry will go faster before a third incident causes searchers to completely lose faith in them.

AOL's Jason Calacanis, who runs Netscape, is proposing that AOL not keep search records at all. That might sound like a nice idea, but it's not practical. To not keep records raises issues with click fraud, plus with internal tracking to determine how to improve a search engine itself in how it responds and feeds queries. Putting better limits on how long data is kept might help, as might developing ways to somehow remove personally identifiable information that might get into search records.

Then again, Ixquick recently tried a PR push on how it doesn't keep records. Perhaps that's going to be a way for some players to win new users. Just make sure you also use some tool like Anonymizer to keep your ISP from logging your actions. Otherwise, your data is still out there and being recorded in another way.

For more on search privacy issues, here's a big giant list of recent posts: