My mind is going a million miles a minute over the whole "Perfect Search" discussion that kicked off this week. Instead of getting it all out now, I plan to do in small doses while at the same time, hopefully, sharing some cool resources at the same time. Let me add a few comments to Danny's most recent post and add a few additional views on other issues.
First, while the human-edited model might have scalability issues, it doesn't mean that these types of tools (for example, general web directories) from non-commercial organizations are now, no less valuable to many searchers.
Sure, they're not the biggest in overall size (vs DMOZ) but the quality of the sources in these tools and their often meticulous maintenance is often what matters to a web researcher. For example, take a look at the LII, Infomine, and the Resource Discovery Network.
Remember, a good library does not have everything in its collection. "Collection Development" is a major part of library education and these directories, are good examples of this concept brought to the web. They also show that having some human involvement from subject specialists, librarians, etc. can prove useful.
- The human-compiled card catalog looks only at book titles and short human-written descriptions of the books, maybe 25-75 words in all.
- The crawler-compiled card catalog will let you scan every word on every page of every book in the library.
Let's stop using the term "card catalogs." They haven't existed in years. In an overwhelming majority of cases, card catalogs are now electronic databases that are called Online Public Access Catalogs ("OPACS") for short. Too long of a term? No problem. Consider them an "electronic library catalog" or the "library database". One thing is for sure, very rarely will you find paper cards. Yes, those good old paper cards had/have value but today's OPAC also offer lots of features.
For example, some allow you to get new book announcements via RSS. By the way, many libraries make these databases searchable for free over the web. Services like RedLightGreen allow you the chance to search hundreds of library catalog databases simultaneously and then allow you to customize for your local library's holdings. Heck, RedLightGreen will even format your bibliography for you. More about this service here.
Also, OPAC records of today often contain much more than the 25-75 words that Danny writes about (though to be fair, he's talking about web directories versus web search, rather than library paper card catalogs versus electronic library catalogs). Frequently, you'll find tables-of-contents, book reviews, snippets, web links, and more. The Library of Congress has an entire department called the Bibliographic Enrichment Team doing work in this area. Yesterday, Syndetic Solutions released even more info that can included in library catalogs.
Oh, how could I forget? Library book catalogs are not the only database tools available via the web (for free). Here's an article about some of what's out there. Lots of specialty databases (full text articles, too).
Second, Danny writes that the crawler-compiled OPAC will let you scan every page of every book in the library. Yes, in theory that's true and well, could be a great thing. Here's the problem. More unstructured data (words) could mean more false drops especially when you add-in the fact that most people only enter a few words in a web engine and only look at the first few results.
This is true whether or not the material comes from scanned books or just plain old web pages. If I had searchable access to every word in every book and entered "Football" I'm going to get back with millions and millions of hits and also have issues with precisely what the term means? American football? What most of the rest of the world calls football (aka soccer)?
Sure, the power searcher will have the skills to create a great search strategy from the outset and then refine as needed using the right tools. However, to this point, the typical open web searcher doesn't do anything like this and likely doesn't even know that have some of the tools to do it. Who is going to show them?
What I'm trying to say is that a bigger database doesn't necessarily mean better and in fact often means less precise results, especially when you're dealing with primarily, but not entirely, uncontrolled content. Some electronic databases also attach subject headings, descriptors and the like to help the searcher focus. Folksonomies could help but, IMHO, the jury is still out on their use and application. One thing is for sure scalability is an issue, I'm not denying that in the least.
Another part of a library school education is something called the reference interview. It involves a human working with a researcher and help them determine specifically what they're looking for and then provide the tools and search strategies to find the info. Good interviewing is a difficult skill to master. Perhaps what we need automated Q&A technology to help the searcher determine what they're looking for and then help them find it. Regardless of how good it is, it's still will not be a human.
Of course, dynamic clustering (we can talk about that at another time) might also play a role especially in the area of subject access and scalability. As Vivisimo says, its technology can quickly offer "selective ignorance" and help the searcher eliminate from a large results set what they don't want to see or need. In other words, increase precision with little work by the searcher while at the same time letting the page speak for itself.
Next, Danny writes, "It will find not just all the matching pages but often rank them so you are getting the very best ones." True in theory but as databases grow larger and larger, this will become more and more (increased recall lowers precision) of a challenge given the fact that very few people take advantage of the tools that are already available that can produce better search results. Udi Manber from A9 said a few months ago that search engines (at least for now) are not mind in the mind reading business and will have to invest in better thinking. He's right.
I also want to comment about what my friend, Jim Lanzone, from Ask Jeeves said:
That is not how people search, and neither you or I or any number of Web Search Universities is going to change that for the vast majority of searchers.
Look, I've been a "faculty" member of Web Search University (Chris, too!) since the first WSU met in 2001 and fully realize that we're only reaching a small, very small, number of people who are primarily professional searchers. However, and I think Lanzone would agree with me, web search training or as Eszter Hargittai calls it "practice," especially for students and educators, can only be a good thing.
As I've said many times, a little goes a long way. Search engines (with the money to offer training) should think of it as both a way to attract new users (in an age where many think there is just one search tool) and also as a public service.
Hargittai wrote in 2003:
Results from a study I conducted on average users' ability to find information on the Web suggest that there is great variance in whether people can locate different types of content online and their efficiency in doing so. These findings imply that simply offering an Internet connection to those without access will not alleviate differences or the so-called "digital divide." Rather, providing training is a necessary component of making the medium a useful tool for everyone.
So, do we have a new digital divide forming? Those who can access info quickly and efficiently and those who can't. I wonder if Rheingold has commented on this?
Finally, one more issue (for another time) is not only the ability to find and access information efficiently and in a timely manner but also having the skills to analyze the content for accuracy, currency, bias, etc. These skills are equally important to just being able to find what you want in the first place, especially in the web age.