How The US Department of Justice May Analyze Search Data & Freedom Of Information Act Request For Disclosure

So what did the Department Of Justice ask for exactly from AOL, MSN, and Yahoo? What did they get, exactly? What negotiations went on? How will the DOJ analyze the data? How useful will that analysis be? And how about AOL, MSN and Yahoo disclosing what they released now, rather than waiting for a Freedom Of Information Act request to force the material into the public domain? A look at all of this and more in an update on the Google versus Department Of Justice story, below.

First, want to pretend you are the Department Of Justice and analyze search queries yourself? What would the search engines be giving the feds? from Shaun Ryan at SLI Systems gives you the chance. Shaun’s been involved with search for years, currently one of the people behind Eurekster and previously behind GlobalBrain, which Snap/NBCi later invested in then sold back to Shaun and his brother. Shaun’s posted a day’s worth of searches from 2001, from searches that happened on Snap/NBCi. Have fun analyzing. As Shaun says:

This isn’t exactly the format that the search engines would have given the data to the feds and although it does make interesting reading, it does underscore how difficult it would be to get any useful information out of it. Privacy issues aside this seems like a silly request from the Department of Justice. Can anyone see how they could analyze this file to support the child protection law?

I’ve actually been trying to get this exact question answered. Just what is the Department Of Justice’s plan? My guess is that they want to pull a random sample of results from the query logs, then do some searching to see if they encounter pornography with porn filters switched on and off at each of the major search engines.

OK, if that’s the plan, it’s a good enough way to go. But as I wrote before, they didn’t need to demand data to do this. They could have bought search queries from a place like WordTracker. They could have simply come up with their own list of terms to check, just as the US Government Accountability Office did last year when it wanted to test porn filtering in image search. Bush Administration Demands Search Data; Google Says No; AOL, MSN & Yahoo Said Yes from me last month covers both of these issues in more depth.

The weakness in either of those approaches is that potentially, the ACLU (which opposes the law the Department Of Justice is backing) might argue that the tests aren’t based on real query data. Sure. But then again, I’ve also explained in that article above that the raw search logs themselves don’t represent “real” queries. The Department Of Justice failed to ask for removal of automated queries, or have queries broken down by age, or have queries from outside the US be removed. Any of those elements can be argued to skew the data and make it unrealistic or unrepresentative of actual user behavior.

Those failures are one sign that the Department Of Justice doesn’t really know enough about search engines to be producing any study that will stand up in court. But the amount of data they wanted was another sign. Remember, they initially asked for 2 months worth of queries — and hey, just put them all in a text file for us.

The file Shaun’s released is one single day’s worth of queries from a minor search engine and is a 19MB text document. It contains half a million terms. If Google really had given up the original two month’s worth of data requested, it would have been an incredibly large file, perhaps requiring its own computer in the same way that Bill Gates’s tax return does. It would have involved billions of queries — overkill to say the least for what the DOJ intends.

Meanwhile, Google argues that one reason it shouldn’t have to hand over data is that by now, the Department Of Justice has enough from the other search engines. Indeed, it does. Moreover, every day that it delays in analyzing that data is another day it becomes less useful to them. That’s because query terms can change over time. Waiting to get the Google data means the other data is just getting older and older.

But back to the point. Where are we in the process? I mean, what does the Department Of Justice know already? I contacted them by phone on January 23 and was told by spokesperson Charles Miller that he didn’t know if any of the data had been analyzed yet or the exact process that would be done. I then sent up a follow-up email that same day. Here’s the key part of that:

I’d like to better understand exactly how the testing will happen, something the DOJ should be able to comment upon irregardless to the current legal action. There must have been some plan on how the data would be analyzed developed before any request would be done. That plan is what I’d like to know more about.

I can see from documents already released that Professor Stark intends to make use of the data to do some type of statistical analysis of the presence of HTMs within search results. But exactly how will this be done? My questions include things such as:

1) Will you randomly sample queries from the data received, or will you try to sub-divide the queries in various ways and test on more popular versus less popular ones?

2) How exactly will you test the queries? Will you use each query at each search engine that it came from, to see what comes up on the results? To what depth will you search (first page of results, second and so on)?

3) What is the criteria for determining if a page is HTM?

4) What’s the purpose of requiring a separate URL list from the query logs you are receiving?

My questions aren’t limited to those above. I’ll have more depending on what exactly the plan is. The easiest way forward would be to arrange an interview with someone knowledgeable about the testing planned, such as Professor Stark himself or other experts you may have engaged.

I’d also like to understand how the plan came about. What types of studies were considered and rejected before this one was settled upon?

In addition to the above, I’d also like to understand exactly what data has been provided by each search engine. Were they all served with the same original request as Google? Did they comply to provide, or did they negotiate to send an agreed upon smaller amount.

I’m also wondering if there were any non-legal requests made to the search engines before the subpoenas?

Finally, I’d like access to all correspondence sent on this matter between the Department Of Justice and the major search engines involved: Google, Yahoo, MSN, AOL and any others that may not have been named. Three documents have come out so far, but there is obviously much more. If required, I’ll make a Freedom Of Information Act request. However, I wanted to see if I could first voluntarily get this information released.

I never got a response from Miller nor the lead attorney on the case, Joel McElvain, who was cc’d on the email. I also cc’d Philip Stark, who is the government witness in this case. Stark actually responded promptly, unlike the US government itself. Stark said that he couldn’t personally comment on the case, which was perfectly reasonable.

This brings me to a Boing Boing challenge that was issued to the major search engines on January 30:

So, America Online, Microsoft, and Yahoo: will you please release the data publicly — or show us where it already exists online? This way, everyone who uses your services can take a look for themselves, and evaluate whether they believe the information shared was privacy-violating.

So far, none of these companies have done this. I think they should, and Google should release everything it sent to the US government in terms of negotiations, as well.

Go back up to my letter to the Department Of Justice, and you’ll see I plan to file a Freedom Of Information Act for this type of material. I haven’t had a chance to do that yet, but I probably will next week. But anyone can do so: here’s the info. Go ahead — beat me to it! Maybe someone else already has. The downside is that these things typicall take months to process — but they do get processed, and information does come out.

Heck, perhaps US Senator Patrick Leahy could do this. We’ve covered previously his demand that the Department Of Justice hand over more information about the subpoenas that were served. He wrote to the DOJ asking for various things, including:

Please identify the type(s) of information and/or data that the Department requested in its subpoenas for records issued to the Internet companies — including whether the Department requested, or obtained, any personal identifying information and/or data in connection with the subpoenas — and state how the Department intends to use this commercial information and/or data.

I see no reason why the information I’m seeking through a FOIA request won’t be released. There’s no national security impact here. It’s hard to see what confidentiality of either party might be disclosed. If this is data gathered by the US government to be argued in an open court of law, then it is public data. You’ve got a US senator seeking the same. The search companies themselves can release it now voluntarily or they can wait to have it released through an FOIA request.

Meanwhile, I’ll end with a mention of something John Battelle blogged about. One of his readers, Adam Fields, wanted to know if Google could come up with a list of people who search for a particular topic by IP address or cookie, as well as whether a profile of all searches done from a particular IP address or cookied browser could be produced. Google said yes.

That doesn’t surprise me at all. In fact, that’s just standard log data that any web server would record. Google’s said they’ve kept such data in the past; Yahoo has said the same. It’s fair to assume everyone has data like this.

Whether you should freak out that you might be identified by IP address or cookie is another matter. Did you just change from Internet Explorer to Firefox? Congrats — you just got a new cookie, in all likelihood. Get a new computer? Same thing. Switch ISPs? Assuming someone could get search data from a search engine and IP data from ISPs, then yes, stuff can be tracked back to you.

Of course, it remains far easier for the government to just get that data from your ISP directly or from your home computer. For more on all of this, see these articles:

For more on the entire current fight between Google and the Department Of Justice, see these articles:

Want to comment on things discussed in this article? We have three Search Engine Watch Forum threads where everyone is welcome:

Postscript: The Department Of Justice has now responded to my request:

Unfortunately, due to the fact that this is ongoing litigation, we are unable to make anyone available at this time. I think the brief we filed in this matter speaks for itself.

I wrote back:

Are you saying the Department Of Justice never comments on ongoing litigation? You’ve already commented on this case, such as to reassure people that no private information has been requested.

The brief does not explain how you plan to analyze this data. That analysis will be a matter of public record. You must have some type of plan in place now, and that plan is going to come out at some point. It will either come out in a court hearing, or you’ll have to reveal it as part of any analysis you present.

Again, I’d like to talk more about the analysis you intend, so that people can cease speculating about it. At the very least, the DOJ should be able to answer whether any analysis of the data you’ve received so far has been done. Let me know if you can answer further. In the meantime, I’ll get started on the FOIA request.

And the response was the contact details for the department’s FOIA request officer. So I guess it’s the long road….

Related reading

Simple Share Buttons