Few people who have a deep understanding of search have the ability to write eloquently about it. Search engine pioneer Tim Bray is one of those people, and he has written an absolutely fabulous series of essays that should be essential reading for anyone wanting a thorough understanding of the technology.
Bray has worn many hats over his career. He is best known as a co-author of the XML specification. But in the very early days of the web, Tim was deeply involved in creating and running one of the first search engines, the long-gone Open Text Index. Open Text was also Yahoo’s original search partner.
These days, Tim is the CTO of Antarcti.ca, best known for its visual search interfaces.
His series of essays On Search, the Series, is almost a virtual textbook on search engine technology. But they’re also highly readable, and replete with Tim’s personal insights and opinions.
In his overview to the series, he says: “I’ve written these pieces because I care about search and because the lessons of experience are worth writing down; but also because I’d like to change this part of the world. In short, I’d like to arrange for basically every serious computer in the world to come with fast, efficient, easy-to-manage search software that Just Works.”
There are fifteen installments, as well an informal overview/table of contents.
The series begins with a Backgrounder covering the business of search, as well as an excellent history. Choice quote: “…the fact of the matter is that there really hasn’t been much progress in the basic science of how to search since the seventies.”
Rather than diving head-first into search engine mechanics and engineering, the next essay considers what people search for. Analyzing user logs of searches on the Open Text Index between late 1994 and early 1996, Bray gained deep insight into the information needs of users, coming away with “two lessons that loom larger than all the others put together.”
And those lessons? I won’t spoil the suspense, but you’ll likely nod your head in agreement once you’ve read this essay.
Next up, Bray discusses search engine basics — the popular features of search engines and their costs and benefits. Things start to get a bit technical here, but it’s well worth the effort to understand the basic data structures and algorithms search engines use to provide “results” for you.
How do you measure the effectiveness of search engines? Whether one system is improving, or whether there are meaningful differences between two systems? One way is to measure “precision” and “recall,” the most common measures of search performance. Although useful, the next essay also demonstrates the limitations of precision and recall as really good metrics.
“Here’s the problem: searching for words isn’t really what you want to do. You’d like to search for ideas, for concepts, for solutions, for answers.” In the fifth essay, Bray considers keyword analysis, how search engines look at position, frequency and emphasis of words to try to distill meaning. This essay is somewhat bleak — Bray isn’t optimistic about the future of making search engines more “intelligent.”
The sixth instalment looks at “squirmy words.” Language is inherently complex and often ambiguous, and this is a major challenge for search engines. Interestingly, Bray concludes that this lexical chaos has surprisingly moderate consequences for search systems.
Next, a detour to describe an unusual search user interface Bray built just as the web was emerging, with philosophical observations on why it didn’t succeed at the time.
Returning to search mechanics, the next installment considers stop words common words that “appear unreasonably often and carry unreasonably little information,” causing many search engines to ignore them.
In the essay on metadata, Bray may surprise some readers with his much broader (but quite accurate) definition of metadata and how successful players like Yahoo and Google use it to their advantage. “Neither has actually had better text search technology than the competition,” Bray writes, which may sound like heresy to some. So it’s well worth reading what metadata is, where it comes from, and how to use it.
Internationalization is the focus of the next essay. What happens when people write in languages other than English, using characters other than our familiar 26 letters of the alphabet? It’s a major issue for search engines, and one that will increase in importance over time.
Result ranking is the next topic to be considered by Bray’s critical eye. When you’ve got a lot of stuff in a big database (e.g. a search engine’s index of the web), how do you decide what goes on top of result lists? Bray concludes that the current state of result ranking (beyond the top few results for most searches) is not very good. But he also discusses some promising techniques that he believes are under-explored.
In the next essay, Bray literally thinks outside of the (search) box, describing the current state of search interfaces and once again proposing an alternative approach that he feels might provide a better user experience.
As the co-author of the XML specification, it’s no surprise that Bray includes an essay on XML searching. XML is gradually creeping into just about everything we do with computers, and Bray believes that it’s important to think about searching XML.
Next, a “tour through Robot Village,” looking at the crawlers, spiders and other critters that traverse the web, discovering information and bringing it back to the search engines for indexing.
To conclude the On Search series, Bray proposes a model and conceptual framework for where search should go in the future.
Most technical writing about search technology is jargon-laden, filled with arcane equations and tightly knit logic. On Search, the Series, offers a refreshingly different approach to explaining the search tools that we rely on every day. And as a bonus, it’s filled with comments and anecdotes from the personal experiences of someone who can genuinely lay claim to being a web search pioneer.
NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.
| Google faces trademark suit over keyword ads…
CNET Jan 29 2004 2:42AM GMT
| P2P companies say they can’t filter…
CNET Jan 29 2004 2:06AM GMT
| Lightweight XML Search Servers…
XML Jan 29 2004 1:55AM GMT
| Early test shows new CAN-Spam law has little effect on junk email…
InternetRetailer.com Jan 29 2004 1:03AM GMT
| Ask Jeeves makes its first profit…
BBC Jan 29 2004 0:41AM GMT
| FTC proposes adult spam labels…
CNET Jan 29 2004 0:12AM GMT
| InfoSpace and Yahoo Ramp Up Margins…
Search Engine Lowdown Jan 28 2004 11:08PM GMT
| SBC, Yahoo target small business…
CNET Jan 28 2004 5:52PM GMT
| An email worm’s greatest ally is us…
SiliconValley.com Jan 28 2004 1:30PM GMT
| MSN Releases a Toolbar…
Research Buzz Jan 28 2004 5:16AM GMT
| In Online Auctions, Misspelling in Ads Often Spells Cash…
New York Times Jan 28 2004 4:04AM GMT
| Google’s Orkut cuts out…
CNET Jan 28 2004 2:10AM GMT
| Google closes in on a planned IPO…
IHT Jan 28 2004 1:26AM GMT