HarperCollins To Digitize Own Books; Unclear How That Will Feed Into Book Search Engines

HarperCollins To Digitize And Control Its Book Content
from the Wall Street Journal looks at HarperCollins saying it will digitize its
active backlist of 20,000 titles and up 3,500 books per year. Part of the idea
is that by doing this itself, the publisher can give content to the search
engines to index but keep the files themselves. That leads me to think
HarperCollins doesn’t understand how book indexing works. From the story:

Search companies such as Google will then be allowed to create an index of
each book’s content so that when consumers do a search, they’ll be pointed to
a page view. However, that view will be hosted by a server in the
HarperCollins digital warehouse. “The difference is that the digital files
will be on our servers,” said Brian Murray, group president of HarperCollins
Publishers. “The search companies will be allowed to come, crawl our Web site,
and create an index that they can take away, but not the image of the page.”

This would prevent such Internet companies from selling a digital copy of
that book unless HarperCollins decided to partner with them as a retailer.
“We’ll own the file, and we’ll control the terms of any sale,” he added.

OK, in order to make a searchable index of a book, a search engine is
essentially making a copy of the book, though it doesn’t mean that it reprints
that copy.
Indexing Versus Caching & How Google Print Doesn’t Reprint
from me earlier
explains this in more depth.

So yep, the search engines won’t have images of a book to display — assuming
they go along with this — but they will have a copy of all the words in the
books. And that’s pretty much all Google doing with the Google Library scanning
project — making an index of books, a card catalog, exactly as HarperCollins
wants to replicate.

Interestingly, HarperCollins — though not a party to that suit over Google
Library — says it

it “economically and philosophically.” Well philosophically, it
doesn’t seem to understand it’s doing pretty much what Google’s doing already.

Here’s the especially tricky bit. Google and gang, if they are “allowed to
come, crawl our web site,” as HarperCollins puts it, are then going to have
access to the same content the general public gets. In other words, whatever you
put out for crawlers, anyone gets. So is HarperCollins going to put the full
text of books online? Because then forget the part about selling digital copies
(not that Google and gang are doing that now). The digital copies will be out
for anyone to access.

Alternatively, the various search engines do have programs where site owners
can submit content, such as Google’s
But you can’t just send them some non-descript “index.” They want PDF, though
the program doesn’t require that actual pages have to be shown, despite coming
in as PDFs.

Aside from book search, there are programs such as
Yahoo Search Subscriptions
that can effectively left content owners cloak
material — the general public sees abstracts while the search engine indexes
the good stuff. But neither of these, to my knowledge, will work for book

Related reading

interview with SEMrush CEO
facebook is a local search engine. Are you treating it like one?
17 best extensions and plugins that experienced SEOs use
Gillette video search trends