HarperCollins To Digitize Own Books; Unclear How That Will Feed Into Book Search Engines

HarperCollins To Digitize And Control Its Book Content from the Wall Street Journal looks at HarperCollins saying it will digitize its active backlist of 20,000 titles and up 3,500 books per year. Part of the idea is that by doing this itself, the publisher can give content to the search engines to index but keep the files themselves. That leads me to think HarperCollins doesn't understand how book indexing works. From the story:

Search companies such as Google will then be allowed to create an index of each book's content so that when consumers do a search, they'll be pointed to a page view. However, that view will be hosted by a server in the HarperCollins digital warehouse. "The difference is that the digital files will be on our servers," said Brian Murray, group president of HarperCollins Publishers. "The search companies will be allowed to come, crawl our Web site, and create an index that they can take away, but not the image of the page."

This would prevent such Internet companies from selling a digital copy of that book unless HarperCollins decided to partner with them as a retailer. "We'll own the file, and we'll control the terms of any sale," he added.

OK, in order to make a searchable index of a book, a search engine is essentially making a copy of the book, though it doesn't mean that it reprints that copy. Indexing Versus Caching & How Google Print Doesn't Reprint from me earlier explains this in more depth.

So yep, the search engines won't have images of a book to display -- assuming they go along with this -- but they will have a copy of all the words in the books. And that's pretty much all Google doing with the Google Library scanning project -- making an index of books, a card catalog, exactly as HarperCollins wants to replicate.

Interestingly, HarperCollins -- though not a party to that suit over Google Library -- says it supports it "economically and philosophically." Well philosophically, it doesn't seem to understand it's doing pretty much what Google's doing already.

Here's the especially tricky bit. Google and gang, if they are "allowed to come, crawl our web site," as HarperCollins puts it, are then going to have access to the same content the general public gets. In other words, whatever you put out for crawlers, anyone gets. So is HarperCollins going to put the full text of books online? Because then forget the part about selling digital copies (not that Google and gang are doing that now). The digital copies will be out for anyone to access.

Alternatively, the various search engines do have programs where site owners can submit content, such as Google's here. But you can't just send them some non-descript "index." They want PDF, though the program doesn't require that actual pages have to be shown, despite coming in as PDFs.

Aside from book search, there are programs such as Google Scholar or Yahoo Search Subscriptions that can effectively left content owners cloak material -- the general public sees abstracts while the search engine indexes the good stuff. But neither of these, to my knowledge, will work for book search.