Cloaking By NPR OK At Google

Stefanie Olsen had a great News.com article out yesterday talking about how National Public Radio is turning its audio content into textual transcripts in an effort to gain better visibility with search engines.

Unfortunately, the same technique appears to be putting NPR in the position of cloaking content, something that got WhenU thrown out of indexes at Google and Yahoo earlier this month with much publicity.

Here's an example that illustrates this clearly. At the end of April, I was looking for some past Overture financial filings using Google. By mistake, I did a news search rather than the web search I'd intended. One of the results of that search was this:

Google's IPO
NPR (audio) - Apr 27, 2004
... And with us now to discuss Google's financial standing is
... BATTELLE : And Overture is a public company, or was a
... We've seen scores of filings in the technology ...

Though this wasn't the content I was originally seeking, it sounded interesting. I clicked through to read the transcript. Unfortunately, no such transcript existed. Instead, I reached a page allowing me to listen to the audio broadcast for free or to purchase the transcript at a cost of $4.95.

Google's spider had indexed a page that was shown in Google News but delivered me to an entirely different one. That's cloaking -- when a spider sees something but a human visitor visiting the same page sees something else. It's also something Google has long warned against:

The term "cloaking" is used to describe a website that returns altered webpages to search engines crawling the site. In other words, the webserver is programmed to return different content to Google than it returns to regular users, usually in an attempt to distort search engine rankings. This can mislead users about what they'll find when they click on a search result. To preserve the accuracy and quality of our search results, Google may permanently ban from our index any sites or site authors that engage in cloaking to distort their search rankings.

In particular, this was the page the web server was programmed to show Google:

http://npr.streamsage.com/google/programlist/feature.php?wfid=1853267

That page contained the text transcript of the show, produced for NPR by multimedia indexing company StreamSage. Only Google's spider gets to see that page. A regular user is redirected behind the scenes here:

http://www.npr.org/features/feature.php?wfId=1853267

The content doesn't appear restricted just to Google News. A search for URLs from the npr.streamsage.com domain used to host transcripts returns over 230 pages in the Google web index.

None of the pages provides the ability to view what was spidered using the Google cache feature. This has been disabled by the use of Google's nocache meta tag command -- something that's often done by those who don't want their cloaked content to be seen by human visitors.

Ironically, Yahoo has also indexed some of these pages -- over 100, from what I can see. Unlike with Google, Yahoo's not been forbidden to show cached copies of the pages. However, viewing cached copies shows exactly what a human visitor would see. This is likely because the cloaking is designed to show the actual transcripts only to Google's crawler.

I'm waiting for comment from Google about this and will update when that comes. The company confirmed a relationship with NPR in the News.com story, so they are likely aware of the cloaking that is happening.

As a searcher, I'm actually glad the method is being used. It does mean I'm more likely to find audio content of interest. Moreover, I can listen to that for free via the NPR site.

As a search engine marketer, I'm not so thrilled. I'm well aware that many other companies would like the ability to feed Google content in this manner. In addition, they have just as compelling arguments as NPR about having good content that isn't adequately indexed by the Google crawler. Unfortunately, they're denied the privilege of feeding relevant material just to Google's crawler.

What about Yahoo? Anyone can enjoy the same benefits that NPR has, the ability to cloak content when relevant, through Yahoo's content acquisition program. Non-profit organizations are offered this for free. Commercial organizations have to pay, making use of Yahoo's trusted feed program.

For more background on cloaking, and especially how paid inclusion programs have allowed it, see my article from last year, Ending The Debate Over Cloaking. For more on multimedia searching, including the challenges it poses to search engines, see our Multimedia Search Engines page.