News.com has a good report on how porn filtering at Google still has serious problems in blocking innocent material by mistake, despite much publicized criticisms last year.
Tens of thousands of random pages were tested. Exactly how this was done isn't clear, and I didn't get an answer from News.com investigator Declan McCullagh on the exact method used to process so many pages.
Nevertheless, McCullagh's report has plenty of disturbing examples of blocked material. It leads off with the situation of PartsExpress.com getting filtered out, because the word "sex" makes up part of its domain name.
This type of blocking bug apparently has plagued filtering systems for ages -- and its one you wouldn't have expected Google to fall into.
How about the situation at chief Google competitor Yahoo? The story found it wasn't as hypersensitive as Google. This was based on some preliminary tests that McCullagh ran, he said, rather than a full battery that Google was subjected to.
Problems Remain Despite Earlier Criticism
McCullagh said his story was originally intended to follow up on last year's revelations out of Harvard University that Google was blocking innocent sites such as the American Library Association. However, McCullagh said the blocking he found was even broader than that seen in the Harvard report.
I wondered if the problems in the original report had been addressed. McCullagh said to the best of his knowledge, Google hadn't fixed this, and that the Harvard report's author Benjamin Edelman agrees.
I did a quick look to see whether the US Library Of Congress was still being filtered. That was probably the most embarrassing example out of last year's report.
Originally, I thought this had been. I could easily locate the library's web site, as I said in an earlier version of this story. But Edelman contacted me after the story was posted and pointed out that it was the library's Thomas legislative search site he found originally to be filtered by Google.
That's still the situation with Google (at Yahoo, this doesn't happen). To see this, compare this search to this search. The first link brings up filtered results. In the second, you should see the Thomas site (thomas.loc.gov) appear around the fourth listing. (It looks weird because the site actually bans search engines from spidering it, so Google shows only a partially-indexed URL).
How about a less prominent site? I looked for that of Jackson County, said to be omitted as of April 2003. A year later, the problem with that site appears to have been fixed. I found it listed in the first page of results for a search on jackson government, even though filtering was enabled.
That's just two tests, and they are hardly conclusive. It could very well be that Google still has trouble with many of the sites listed in the original Harvard report. But there's certainly no doubt based on McCullagh's report that plenty of innocent sites are getting blocked. It's something Google itself confesses to in his article.
Test Them All!
Overall, I love this type of testing. I only wish the latest report, as well as the original report, had at least included a comprehensive test of one or two Google competitors. That way, we'd have a better and more quantitative idea how badly Google is failing -- or whether its failures actually are indicative of an industry-wide search problem.
Meanwhile, my mantra about these type of reports remains the same. Use a porn filter, and you may miss important material. So if you need to do comprehensive research, push the kids out of the room and turn the filter off.
Introducing SES Online
Want to view one of the sessions you missed or listen to an especially informative presenter a second time? SES New York sessions are available for purchase on ClickZ Academy's new e-Learning site. SES is now Online!