Hello Natural Language Search, My Old Over-Hyped Search Friend

This is a rant. It's a rant from over 10 years of watching people trot out natural language search as the "killer" solution to the current state of search, something that's happening once again with Powerset. That's a search engine you can't even use at the moment, but the hype will no doubt continue. To counteract that, my thoughts on and some history about natural language search.

Natural language search makes a compelling pitch for those who really don't know search or haven't heard the natural language mantra before. I've seen the pitch time and time again. You:

  • Pick out an example that shows how "bad" search is on an existing search engine
  • Demonstrate how natural language search would work better on your service
  • Sit back and collect the press attention

What are the problems with this? First, most searches are two to three words long. You can't conceptually analyze them because there's simply not enough information to do so. They aren't sentences, where an ambiguous meaning becomes clear to a human that understands the context of the sentence.

Consider these popular searches out of the Google Zeitgeist or from Ask IQ or from Yahoo Buzz:

  • expedia
  • pirates of the caribbean
  • beach
  • game cheats
  • heidi klum
  • halloween wigs

What conceptual analysis are you going to do on them? What context from "beach" are you going to extract. Heck, take the longest query, "pirates of the carribbean." We can guess that a user might have these topics in mind:

  • the first movie
  • the second movie
  • a computer game
  • the ride at disneyland
  • actual pirates that roamed the carribbean

There are many more meanings and intentions. But natural language analyze the four words "pirates of the caribbean" all you want. You won't know exactly what concept someone has in mind from that analysis.

You can mind large sets of web data to guess at concepts. That's COMPLETELY DIFFERENT than natural language analysis. It's also something that's long been in practice with search engines. Consider this search for pirates of the caribbean at Clusty:

Query Refinement On Clusty

See how Clusty automatically analyzes the web results and produces "clusters" or topics? Click on those, and you refine your search results.

Here's the same thing, pirates of the caribbean, at Ask:

Query Refinement On Ask

That's actual conceptual search live, in action and on the web now. Nor is it new. Refinement features like this have been available on major search engines in various ways for years.

To date, they've never really taken off. I don't know why. Perhaps presentation could be better. Sometimes, the clustering isn't good enough. Alternatively, they might just be too weird for users. But I know this. The completely different type of "conceptual" searching, natural language analysis, has never caught on and probably won't for a very long time because of the limited data searchers provide.

Now let's do the history bit. This isn't as comprehensive as I would like, simply because many of the pitches I've gotten over the years haven't resulted in articles. That's because I already knew the services weren't going anywhere, so doing an article wasn't a good investment of time. Still, there's plenty left over.

1995: Excite is born with "intelligent concept extraction," the pitch being that Excite would know certain words were related to different concepts (Apple as a fruit versus being a computer company) and that this could be used to improve results. Excite still operates, but fair to say that ICE (the tech's acronym) didn't help keep it from being eclipsed by the likes of Google nor help it overtake Yahoo, which predated Excite.

1998: Ask Jeeves is born. It continues to be mistakenly described as a former "natural language" search engine. Yes, Ask encouraged people to using long sentences to "ask" what they were looking for. But you could do the same with all the other major search engines, as well. Ask had more relevant result NOT from natural language processing but because it employed over 100 editors to ensure they hand-picked good answers to popular queries.

1998: The Electric Monk gets buzz about taking a natural language query and sending it off to AltaVista. Today, the domain is for sale :)

2000: FAST acquires a natural language interpretation company called Albert. That technology never significantly comes into play in web search, which is later sold off to Overture, then to Yahoo when it bought Overture.

2001: iPhrase makes news, though it never migrates into web search. Instead, the company built a solid line of site and enterprise search partnerships and was acquired by IBM last year.

2003: BrainBoost makes a pitch about natural language search helping improve results. I was unimpressed, especially when I went away from the canned answers that "proved" its superiority. Answers.com bought the company last year, and it remains running. Three years old, it's failed to make a dent in any of the established players.

2004: Cringley gets all excited about MeaningMaster (scroll down here to see me sounding my usual skepticism). Let's look at his article for an example of the familiar natural language pitch.

MeaningMaster is generating a lot of buzz in Silicon Valley -- yet another overnight sensation that was 20 years in the making.

MeaningMaster is the brainchild of Kathleen Dahlgren, a computational linguistics PhD who has spent most of her career building a lexicon of the English language. This lexicon is a computer dictionary that is purported to understand the meanings of more than 200,000 English words IN CONTEXT....

"We model the way people interpret the meanings of a word -- through context," says Ms. Dahlgren, who is today CEO of MeaningMaster. "We search on meaning by using grammar and structure and semantics. Every word has associated with it a set of beliefs."

I asked Graham Spencer to take a look at MeaningMaster. Graham was the chief techie at Excite where he pioneered yet another search technique involving linguistic vector analysis that still offers some advantages, too.

"It looks interesting," said Graham, "but I found it to have some obvious gaps. The problem with any technology that tries to be explicitly 'smart' is that it has to be really close to perfect or else a human will notice.

Only time will tell if MeaningMaster annoys users or delights them, but if its real strength is for targeted advertising then the annoyance factor could be practically eliminated as long as advertisers were seeing improvements in converting clicks into sales. That's the REAL test.

I haven't heard much about MeaningMaster since. The company is still going. I notice that contextual ads are one of the solutions it pitches the technology toward. That's another familiar theme -- natural language technology originally designed to help searching ends up being an ad solution.

2004: Microsoft lets us know that natural language was going to be one of the ways it would kick some Google butt. From what I wrote about that pitch:

What about some recent statements by Gates about linguistic analysis as a way forward? They make a nice sound bite, which is why would-be search companies have said the same things in the past. But such efforts have gone nowhere. In my view, this has been primarily because linguistic analysis of pages or natural language processing isn't that important when dealing with the popular, short queries people conduct like "britney spears."

2004: Stochasto gets some attention for having "real natural language" search. Nevertheless, it's another company I haven't heard much about since.

2004: Kozoru gets lots of attention on a natural language like approach. When the attention runs out, it turns into accusations that Google's still checking them out after refusing calls and then how the blogosphere is ignoring them after being banned from using the Google API. In June, the company finally (after a year and a half) pushes out a live product for consumers, to get answers via instant messaging.

2004: Accoona gets Bill Clinton out for its launch party, is part-owned by former Compaq CEO Eckhard Pfeiffer and promises to revolutionize how we find information on the web. Nevertheless, its artificial intelligence/natural language searching technology hasn't helped the search engine attract searchers. Making no real gains in the US, it recently began focusing on Europe.

I really wish I had the time to dive even deeper into the actual press releases I've been sent over the years, the pitches, the promises and so on. Suffice to say this. Any time someone pitches that they have some "natural language" revolutionary technology, my eyes start to glaze over and my impression is that they know little about how web search actually operates currently. And that brings us to Powerset.

Powerset has all the things you want for a start up success story (unless you're pitching to me). It's in secret. They've raised $10 million in funding. They've got some prominent names behind them like Esther Dyson as an investor. More on Powerset, the secretive search engine from VentureBeat covers the back story.

I said these things don't work for me. That's because I've already seen prominent investors get involved with things that failed, since they really weren't that useful. I especially despise things that are in super-double-secret stealth mode. You want respect when you haven't already proven yourself in the search world? You roll out a product that people can play with fairly quickly. Otherwise, you're simply Kozuru 2.0 -- squandering your initial hype. I'm also already put off by the fact that "natural search" will be the advanced technique that's going to have you beat Google, despite the fact you've got nothing to analyze and the field is littered with other companies who already tried this.

From Powerset itself, we've got Barney Pell explaining more in the Powerset and Natural Language Search on his blog.

Pell does address things I've already covered -- people don't enter many words and that natural search hasn't worked before. But he argues still going ahead doesn't defy conventional wisdom, since Google did the same thing in challenging the majors.

Not correct. Google succeeded because the changes it made REQUIRED NO CHANGES ON THE PART OF THE USER. The results simply got better, even though the method of searching (and number of words) remained the same. It also succeeded because all the other major players weren't paying attention to search. Believe me, they aren't ignoring the idea of natural language search. They know about it, and if it will be successful, you bet it will be implemented long before an upstart company can try to trounce them.

Back to Pell's post, he fails to explain how searchers are somehow magically going to go beyond "keywordese" to natural language searching. They aren't using keywordese now because they somehow have been trained to do it. No one from Google sat the searchers down and said "only two words, and don't use conjunctions." People search however they want -- and right now, they use only a few words.

If Powerset's going to change those habits, good luck. Be sure to read my Why Search Sucks & You Won't Fix It The Way You Think post from last month, which graphically illustrates the challenge in changing habits with the user interface alone. Getting inside the minds and whispering "type longer" isn't going to be fun.

Meanwhile, I have great sympathy for the fact that the major search engines could definitely do a much better job with query refinement. My Robert Scoble Wants What We Had -- Better Query Refinement. So Do I! post from last year drills deep into some of the things I've illustrated above, on how query refinement could improve -- and improve in a way that wouldn't cause more work for the searchers.

Meanwhile, I'll leave off with this screenshot:

Direct Answers On Ask

That's the top of Ask's search results for the pirates of the caribbean search I mentioned earlier. Notice what they show before the regular search results. A big fat guess at what most people probably want, the info on the latest movie. Did they guess wrong? Notice the "Did you mean" link directing searchers to try a search on Xbox related info as a backup.

That's query refinement in action -- inline -- directly in front of the action area that most searchers interact with. It could be much, much better. But it's a start and you're going to see more of this, even at places like Google which has used midpage query refinement for nearly two years now.