Indexing Versus Caching & How Google Print Doesn't Reprint

I've written before that legal concerns about book indexing and Google Print may have repercussions for web indexing. Kevin Werback and David Winer look at this again, afresh. A look at this, plus the crucial difference between indexing (making something searchable) and caching (reprinting content). Google's library scanning program makes things searchable in Google Print but reprinted.

Breaking Apart at the Seams from Kevin stresses as I've done that indexing the words on a web page isn't that much different than indexing the words on a printed page. He wonders if a lawsuit preventing book indexing might a type of unraveling of sharing content online in general.

A turning point for the web? from Dave goes much longer to counter the notion that an opt-out approach is acceptable. Unfortunately, he's just not getting some of the points of what's involved correct. Specifically:

If you publish a site, Google reads the whole site into its cache and then lets you find things in it. Generally people who publish sites know this, and want Google to do this.

Google's index and its cache are two different things, and it's critical -- absolutely critical -- they not be confused like this.

When any search engine visits a web page, it effectively makes a copy of that page which is stored in the index. But the index literally breaks apart the page. It stores where words were located, were they in bold, what other words were they near, were the words in a hyperlink and so on.

Nothing in the index is anything you as a human being could read. I've described the index in searching classes to being like a "big book of the web." But it's not, really. It's more like a giant spreadsheet, where all the words of a page are in one row of the spreadsheet, each word to a different column, then the next page in the row below that, and so on. It's not something a human being would read.

Aside from the index, Google, Yahoo, MSN and Ask Jeeves also make "cached" copies of pages available. You can see a copy of the exact page the search engine spidered. These cached pages are kept separate from the index. They are useful for when a page is down or for a copyright holder wants to see if someone has stolen and cloaked their content to feed to a spider. But the legality of showing such cached pages is also in question. No one today has challenged them in court. The reason seems to be that Google, which mainstreamed cached copies, lets site owners opt out of caching if they want.

All major search engines also let you opt out of being in their indexes, as well -- a completely different thing -- and another reason why the index shouldn't be confused with the cache. To take Google as an example, you can:

  • Have your page listed in the index (available to be found through searches) and have your page available as a cached copy
  • Have your page listed in the index but not cached
  • Have your page NOT listed in the index and thus also not cached.

The ability to opt-out of the index is another reason why we really haven't had a major search engine sued over web search indexing. In addition, site owners as Dave notes generally want to be indexed, so they can get traffic. In fact, the reason so many are upset over the current indexing update at Google is that they feel changes are causing them to lose traffic. But whether it is LEGAL to do this type of indexing (as opposed to caching) still really hasn't been tested.

So indexing and caching are NOT the same. Back to Dave's piece. He writes:

Google clearly does not have the right to make a copy of the book and republish it without the permission of or compensation to the copyright owner. The publishers appear to be on the right side of this one, and while I'm not a lawyer, I can't imagine that they won't prevail in court.

I'm not a lawyer either, but I can completely imagine that Google might win. Maybe not, but it's hardly far-fetched or doubtful, and even some lawyers feel they may win.

Here's the thing. Google is NOT, repeat NOT, republishing copies of books that it scans out of libraries. This is a fundamental mistake that many people seem to be making.

Google is scanning books into an index, just as it spiders web pages and adds them to its index. It is making the books searchable by doing this, but that process does not republish the books in a way you can read.

Think about it in web search terms. You can find a matching book, but there's NO hyperlink to click on that will take you to an online version of the book itself. There's just a snippet -- maybe -- of the text surrounding the words matching what you looked for.

Want the actual book? Google Print won't give it to you. Instead, you have to go someplace and buy it or find it in a library. Google Print merely tells you the book may be what you're looking for.

The only exception to this is if a publisher OPTS-IN. Not opt-out. If a publisher chooses, then -- and only then for books that are in copyright -- will Google display some of the actual book. The exact amount is left up to the publisher.

So, I've covered that indexing means making a book (or web page) searchable while caching means making a page (or a book) viewable online, without having to go to the source material (the book or the page). Let's recap then how both systems work:

Search Type Indexing Caching Snippets/
Web Opt-Out Opt-Out Opt-Out
Books Opt-Out Opt-In Opt-Out

As you can see, book search is actually more opt-in than web search is. Books themselves aren't cached or shown. But they are made searchable without permission.

That systems has worked on the web, because of the aforementioned feeling that site owners want traffic. As for book publishers, Why Don't Book Publishers Object To Web Indexing? from me earlier covers how many seem not to mind getting traffic through an opt-out system on the web, as well.

It remains to a court to decide whether it should be workable when it comes to book indexing. If not, then absolutely, you might see search engines ponder if web indexing itself -- which really hasn't been legally tested -- is something they'll need to require an opt-in for. And if that's the case, web indexing will get pretty bad, since many publishers will simply fail to make the opt-in effort.

What's that third column, the snippets/description one? That's the place where I think book publishers might prevail, and certainly a change that Google should consider. Legal Experts Say Google Library Digitization Project Likely OK; Will It Revolve Around Snippets? covers how it's possible that in some cases, even the limited description that Google puts on pages might give away some of the value of a book and thus real harm might be proven to a publisher. Solution? Make showing descriptions an opt-IN thing.

Lastly, Dave makes a couple of other comments:

It's time to realize that Google is no longer the little company we used to love. They're now a huge company that pushes individuals around like a lot of other huge companies. They need some balance to their power. And it's ridiculous to blindly take their side on every issue. Sometimes they're wrong, and I believe this is one of those times. It's certainly worth considering the possibility that they're wrong.

Absolutely, Google is a big giant company, not some tiny lovable start-up. If anyone still has that idea, definitely get it out of your mind now. But whether you think they push others around or not may depend on what area we're talking about. And whether a company of any type should be hated because they're big is another issue, as well. Nor should it be assumed that Google is always right. The most definitely are not.

As for this:

This situation is much like the disagreement we had with Google a few months back, when they wanted to put ads on our sites without permission and without paying....and right now they're putting ads on your content without your permission, without compensating you. Now how do you feel about that?

Dave is talking about Google's AutoLink. I'd disagree that the links Google may insert if someone clicks on the right button in the Google Toolbar are ads, so don't freak out if you aren't familiar with AutoLink and are suddenly scanning your pages to find how Google got real AdSense ads on it. They didn't.

I would agree that Google should to the opt-out route with AutoLink, as I wrote before. But it's also a harder argument to have, when there's been the incredible popularity of GreaseMonkey for Firefox, which can insert links into pages. Plenty of people use CustomizeGoogle, which inserts links into Google's own pages. Fair turnabout, some who hate AutoLink would say. Yes, it is -- but then it also weakens the argument that Google itself can't let people put links into pages with its own tools.

Postscript: Ray Gordon writes to say he has filed a complaint arguing that web search on an opt-out basis is in violation of copyright. You can read the filings here. I've skimmed them, and he seems more concerned about usenet material (rather than web material) that can't be removed, apparently because others may have reprinted his own posts.

Postscript 2: Dan Thies writes that an search index is even less readable than a spreadsheet, and he's correct. I was trying to keep things simple yet familiar to illustrate the difference between words arranged on a page for reading and words indexed to make a search engine. As Dan says, he understands I was keeping things simple -- but he also takes you deeper into how inaccessible to a "reader" a real index actually is.

About the author

Danny Sullivan was the founder and editor of Search Engine Watch from June 1997 until November 2006.

To contact current Search Engine Watch editorial staff, please click here.