I've got some follow-up items about yesterday's story where AOL released user query records, including how anyone can now easily look at the data.
First, after Barry did a recap of the news, I added a postscript to the story with more of my thoughts. In case you missed it, here are the key parts below:
AOL: Dooooooh! from John Battelle and AOL apologizes for release of user search data from News.com have AOL apologizing for the release, now said to be data involving about 658,000 individuals from March through May of this year. AOL says the release of the data wasn't properly vetted for privacy issues and that the release intentions were innocent.
I believe that. Make no mistake, this was a big screw up. The researchers providing the data didn't think hard enough about how making it possible to build a profile of individuals, even if they were given anonymous names, might then make it possible to determine who those people are if they revealed enough information in their searches.
In addition, it's going to be very difficult for some law enforcement agency not to want to subpoena AOL for actual user names when they read about things that suggest a murder is being planned or may have happened, as covered above. I'm not saying they'll get it, but I think it's almost inevitable that someone will try. That will set off further privacy fireworks.
But yes, the original intention was innocent. I got an email about the research site last week (and with my traveling all last week, simply did not have a chance to check it out). Here's what a researcher involved with it emailed me:
Over the last few years I have witnessed a divide developing within Information Retrieval research - between the haves and have-nots. The ‘haves' are the companies like Google, Yahoo, MSN, and ourselves, with lots of resources and data. The ‘have-nots' are people without those resources such as academic researchers and smart guys at small companies. We want to be able to help anyone work on great ideas by giving them the data and infrastructure they need.
So we started building data sets and made them available for everyone to test their ideas with. Each data set features a dynamic view, which allows you to inspect the data without having to download it. We also built some APIs for news, video, audio and podcasts, which will save people time from having to do that themselves. We have tried to stay away from interfaces like web search as those are already around.
There's nothing evil in that. In fact, there's much to appreciate, intention-wise.
We all use search engines so much, and they are so important in our daily lives, yet they remain one of the most poorly researched media venues out there. Yes, we're getting new labs like the one from Yahoo at UC Berkeley. But most search behavior studies outside of the search engines have depended on ancient search logs from places like Excite from back in 2001 or so. Newer studies, if the search engines are doing them, simply don't come out often. So the intention to promote learning with this release was innocent, if not honorable. The execution was poor and inexcusable.
This is the second major milestone in raising awareness of search privacy issues this year. The first was the Department of Justice action, which rightly focused on whether we need more safeguards over what governments can request. Today's upset highlights the protections that are needed again corporate releases of data.
The good news is that perhaps it will spur better protections even more. Microsoft, Google & Others Call For Unified Federal Privacy Protection covers how the major search engines recently asked for better legal protections from the government. But perhaps the search industry itself will move forward to develop better privacy standards. I've hoped recently for some type of Search Privacy Bill Of Rights. Since I doubt the government will act quickly, perhaps the industry will go faster before a third incident causes searchers to completely lose faith in them.
AOL's Jason Calacanis, who runs Netscape, is proposing that AOL not keep search records at all. That might sound like a nice idea, but it's not practical. To not keep records raises issues with click fraud, plus with internal tracking to determine how to improve a search engine itself in how it responds and feeds queries. Putting better limits on how long data is kept might help, as might developing ways to somehow remove personally identifiable information that might get into search records.
Then again, Ixquick recently tried a PR push on how it doesn't keep records. Perhaps that's going to be a way for some players to win new users. Just make sure you also use some tool like Anonymizer to keep your ISP from logging your actions. Otherwise, your data is still out there and being recorded in another way.
The postscript then goes on with a long list of links to stories about search privacy issues, so check it out, if you want to read more background about the issue.
Next, via TechCrunch, the AOL Search Database is a new site that has taken all the data and allows anyone to search through it. The site's up and down due to demand, so be forewarned. It also lacks documentation, but here's a very quick guide to what I've played with so far.
User ID: To see the searches done by a particular person, you enter the anonymous user number they've been given. The main problem is that I have no idea where the numbering sequence starts. For example, enter 1 into the box, and you get nothing. Enter 1083349, and that brings up the records for that user (well, it should -- when I tried, I got a database error because of a behind the scenes glitch).
Search Keywords: Enter a term here, and you'll see all the people who searched for that word. For example, entering [murder] gave me a list of everyone who looked for that word or phrases that include it (such as murder.com). I haven't tested to see if there's a way to do an exact match yet. This is also an easy way to obtain user numbers, if you want to then check out particular user records.
Date Of Search: I haven't tried it yet, but I assume this will give you all searches done on a particular day.
Website Results: Again, I didn't have a chance to play with this, but I assume if you enter a URL (say playboy.com), you'd see all the people who did a search, got that site listed and perhaps clicked through.
When you are done exploring, you can enter your findings into Valleywag's Find the scariest AOL user search record contest. So far, this isn't scary but funny: Scariest search records: AOL saves crew of Oceanic flight 815. Over at Consumerist, AOL User 231392 Illuminated is a little more scary.
Prefer to roll through the data on your own, or perhaps build a better interface for searching it? This mirror site offers the data that AOL pulled yesterday.