Researchers Debate Whether To Study The AOL Data

When the AOL privacy case broke earlier this month, I
wrote about
how the intention of releasing the data was honorable despite the ineptness of
how it was done. Those trying to research search behavior have been starved for
decent data.

Researchers Yearn to Use AOL Logs, But They Hesitate
from the New York Times
covers this in more detail, about how the existing data sets out there are
nearly 10 years old.

Along the way, we discover researchers are debating if they should use the
data. I’d say you might as well. It’s not like you’ll be getting more any time
soon. As long as the researchers aren’t themselves republishing in a way to
violate someone’s privacy, it’s hard to see the harm. At this point, the data
has been spread so far and wide, accessible in many ways, that it’s difficult to
see what the researchers think they’d be protecting by studying it.

The story also touches on data releases from other search engines (Yahoo and
Microsoft say they’ve done some controlled, limited releases; Google says they
hand nothing out). It also highlights how the researcher who put the data out —
again with the best of intentions — simply didn’t realize that people would be
able to be tracked down through their search profiles.

Most interesting is the end of the story, looking at if there’s a way to
scrub the search stream so that data could be released and be untraceable. I’ve
said I’d love
to see that type of solution happen. But it would have to be foolproof, and I’m
not sure how that can happen unless you have human review of profiles that might
go out.

Meanwhile, the San Jose Mercury News in

What do Google, Yahoo, AOL and Microsoft’s MSN know about you?
does over the same survey of how long data is kept that
did last February, in
the wake of the US Department Of Justice search privacy debate. I mentioned the
story before, but let me highlight a key part of it:

While AOL is unique among the Big Four in that its users are easily
identified by an AOL user name after they have logged in, people who frequent
Google, Yahoo and MSN are also monitored by a combination of digital tracking

Nope, AOL is not that unique. If you’ve logged into Google, Yahoo or MSN to
use any of their services, chances are when you search, they’ll also have you
keyed to a particular profile that’s more unique that just looking at your IP
address or a cookie. The story does explain this more, and my previous post
Which Search
Engines Log IP Addresses & Cookies — And Why Care?
goes into the
explanation in more depth. In looking at that previous post, I also saw this:

[]: Given a list of search terms, can you produce a list of
people who searched for that term, identified by IP address and/or cookie

[AOL]: No. Our systems are not configured to track individuals or groups
of users who may have searched for a specific term or terms, and we would
not comply with such a request.

Despite the response, I’m 99 percent certain AOL does indeed log IP
addresses and cookies along with search data. Searching on AOL
creates a page request with the search terms embedded in the page’s URL. That
request will be logged. If it’s logged, it can be analyzed. In fact, AOL later
says they can give you a list of searches that were done by a particular IP
address or cookied browser. If you have that information, you have the

Of course, we now know that it was indeed the case that you could take AOL’s
data, give it a search term and get a list of individuals who searched for it.
Yes, the individuals were given anonymous numbers, so the AOL answer is
technically correct. But the overall profile of what someone was searching for
in some cases turned out to be personally revealing.

I’m planning a longer recap on some of the latest out of the AOL case, but in
the meantime, I still keep coming back to this conclusion from an earlier

I think consumers will need more faith and control over how long search
data is kept for them, plus the ability to opt-out or delete histories with a
push of a button, perhaps the type of privacy/data control panel John Battelle
has wished for. And
as I’ve written, that has to include ISPs, many of which merrily sell search
data that they monitor to third party companies.

I’m working on a longer look back at the fallout from the AOL release and
ways forward. But a quick shout-out to Daniel Brandt of
Google Watch is in order. Seth
just gave him one
, and I’ll add to it. I’ve felt Brandt’s often twisted
things or focused on stuff that didn’t matter much (Google’s 30 year cookie
that most people won’t really have last for more than a year or two, if that).
But his long-standing

for regular data destruction — something other privacy advocates
have also pushed for — seems the most secure solution going forward.

Related reading

The word PREPARED is written on a blackboard with the UN crossed out. A hand is underlining it.
A hand holding a transparent piece of plastic or glass, with the Google logo superimposed onto it.
Simple Share Buttons