Hello Natural Language Search, My Old Over-Hyped Search Friend

This is a rant. It’s a rant from
over 10 years
of watching people trot out natural language search as the "killer" solution to
the current state of search, something that’s happening once again with
Powerset. That’s a search engine you
can’t even use at the moment, but the hype will no doubt continue. To counteract
that, my thoughts on and some history about natural language search.

Natural language search makes a compelling pitch for those who really don’t
know search or haven’t heard the natural language mantra before. I’ve seen the
pitch time and time again. You:

  • Pick out an example that shows how "bad" search is on an existing search
    engine
  • Demonstrate how natural language search would work better on your service
  • Sit back and collect the press attention

What are the problems with this? First, most searches are two to three words
long. You can’t conceptually analyze them because there’s simply not enough
information to do so. They aren’t sentences, where an ambiguous meaning becomes
clear to a human that understands the context of the sentence.

Consider these popular searches out of the
Google Zeitgeist
or from Ask IQ or from
Yahoo Buzz:

  • expedia
  • pirates of the caribbean
  • beach
  • game cheats
  • heidi klum
  • halloween wigs

What conceptual analysis are you going to do on them? What context from
"beach" are you going to extract. Heck, take the longest query, "pirates of the
carribbean." We can guess that a user might have these topics in mind:

  • the first movie
  • the second movie
  • a computer game
  • the ride at disneyland
  • actual pirates that roamed the carribbean

There are many more meanings and intentions. But natural language analyze the
four words "pirates of the caribbean" all you want. You won’t know exactly what
concept someone has in mind from that analysis.

You can mind large sets of web data to guess at concepts. That’s COMPLETELY
DIFFERENT than natural language analysis. It’s also something that’s long been
in practice with search engines. Consider this search for

pirates of the caribbean
at Clusty:


Query Refinement On Clusty

See how Clusty automatically analyzes the web results and produces "clusters"
or topics? Click on those, and you refine your search results.

Here’s the same thing,
pirates
of the caribbean
, at Ask:


Query Refinement On Ask

That’s actual conceptual search live, in action and on the web now. Nor is it
new. Refinement features like this have been available on major search engines
in various ways for years.

To date, they’ve never really taken off. I don’t know
why. Perhaps presentation could be better. Sometimes, the clustering isn’t good
enough. Alternatively, they might just be too weird for users. But I know this.
The completely different type of "conceptual" searching, natural language
analysis, has never caught on and probably won’t for a very long time because of
the limited data searchers provide.

Now let’s do the history bit. This isn’t as comprehensive as I would like,
simply because many of the pitches I’ve gotten over the years haven’t resulted
in articles. That’s because I already knew the services weren’t going anywhere,
so doing an article wasn’t a good investment of time. Still, there’s plenty left
over.

1995:
Excite is born
with "intelligent concept extraction," the pitch being that Excite would know
certain words were related to different concepts (Apple as a fruit versus being
a computer
company) and that this could be used to improve results. Excite still operates,
but fair to say that ICE (the tech’s acronym) didn’t help keep it from being
eclipsed by the likes of Google nor help it overtake Yahoo, which predated
Excite.

1998:
Ask Jeeves is
born
. It continues to be mistakenly described as a former "natural language"
search engine. Yes, Ask encouraged people to using long sentences to "ask" what
they were looking for. But you could do the same with all the other major search
engines, as well. Ask had more relevant result NOT from natural language
processing but because it employed over 100 editors to ensure they hand-picked
good answers to popular queries.

1998: The Electric Monk
gets buzz about
taking a natural language query and sending it off to AltaVista. Today, the
domain is for sale :)

2000: FAST
acquires a
natural language interpretation company called Albert. That technology never
significantly comes into play in web search, which is later sold off to
Overture, then to Yahoo when it bought Overture.

2001:
iPhrase
makes news,
though it never migrates into web search. Instead, the company built a solid
line of site and enterprise search partnerships and was

acquired
by IBM last year.

2003: BrainBoost makes a
pitch about natural language search helping improve results. I was

unimpressed
, especially when I went away from the canned answers that "proved"
its superiority. Answers.com
bought the
company last year, and it remains running. Three years old, it’s failed to make
a dent in any of the established players.

2004: Cringley gets all
excited
about MeaningMaster (scroll down
here to
see me sounding my usual skepticism). Let’s look at his article for an example of
the familiar natural language pitch.

MeaningMaster is generating a lot of buzz in Silicon Valley — yet another
overnight sensation that was 20 years in the making.

MeaningMaster is the brainchild of Kathleen Dahlgren, a computational
linguistics PhD who has spent most of her career building a lexicon of the
English language. This lexicon is a computer dictionary that is purported to
understand the meanings of more than 200,000 English words IN CONTEXT….

"We model the way people interpret the meanings of a word — through
context," says Ms. Dahlgren, who is today CEO of MeaningMaster. "We search on
meaning by using grammar and structure and semantics. Every word has
associated with it a set of beliefs."

I asked Graham Spencer to take a look at MeaningMaster. Graham was the
chief techie at Excite where he pioneered yet another search technique
involving linguistic vector analysis that still offers some advantages, too.

"It looks interesting," said Graham, "but I found it to have some obvious
gaps. The problem with any technology that tries to be explicitly ‘smart’ is
that it has to be really close to perfect or else a human will notice.

Only time will tell if MeaningMaster annoys users or delights them, but if
its real strength is for targeted advertising then the annoyance factor could
be practically eliminated as long as advertisers were seeing improvements in
converting clicks into sales. That’s the REAL test.

I haven’t heard much about MeaningMaster since. The company is still going. I
notice that contextual
ads are one of the solutions it pitches the technology toward. That’s another
familiar theme — natural language technology originally designed to help
searching ends up being an ad solution.

2004: Microsoft lets us know that natural language was going to be one
of the ways it would kick some Google butt. From what I
wrote
about that pitch:

What about some recent statements by Gates about linguistic analysis as a
way forward? They make a nice sound bite, which is why would-be search
companies have said the same things in the past. But such efforts have gone
nowhere. In my view, this has been primarily because linguistic analysis of
pages or natural language processing isn’t that important when dealing with
the popular, short queries people conduct like "britney spears."

2004: Stochasto gets some
attention for
having "real natural language" search. Nevertheless, it’s another company I
haven’t heard much about since.

2004: Kozoru gets lots of
attention on
a natural language like approach. When the attention runs out, it turns into
accusations
that Google’s still checking them out after refusing calls

and then how
the blogosphere is ignoring them after being banned from using
the Google API. In June, the company
finally
(after a year and a half) pushes out a live product for consumers, to get
answers via instant messaging.

2004: Accoona gets Bill Clinton out for
its launch party, is part-owned by former Compaq CEO Eckhard Pfeiffer and

promises
to revolutionize how we find information on the web. Nevertheless,
its artificial intelligence/natural language searching
technology
hasn’t helped the search engine attract searchers. Making no real gains in the
US, it recently
began focusing on Europe.

I really wish I had the time to dive even deeper into the actual press releases
I’ve been sent over the years, the pitches, the promises and so on. Suffice to
say this. Any time someone pitches that they have some "natural language"
revolutionary technology, my eyes start to glaze over and my impression is that
they know little about how web search actually operates currently. And that
brings us to Powerset.

Powerset has all the things you want for a start up success story (unless
you’re pitching to me). It’s in secret. They’ve raised $10 million in funding.
They’ve got some prominent names behind them like Esther Dyson as an investor.

More on Powerset, the secretive search engine
from VentureBeat covers the
back story.

I said these things don’t work for me. That’s because I’ve already seen
prominent investors get involved with things that failed, since they really
weren’t that useful. I especially despise things that are in super-double-secret
stealth mode. You want respect when you haven’t already proven yourself in the
search world? You roll out a product that people can play with
fairly quickly. Otherwise, you’re simply Kozuru 2.0 — squandering your initial
hype. I’m also already put off by the fact that "natural search" will be the
advanced technique that’s going to have you beat Google, despite the fact you’ve
got nothing to analyze and the field is littered with other companies who
already tried this.

From Powerset itself, we’ve got Barney Pell explaining more in the

Powerset and Natural Language Search
on his blog.

Pell does address things I’ve already covered — people don’t enter many
words and that natural search hasn’t worked before. But he argues still going
ahead doesn’t defy conventional wisdom, since Google did the same thing in
challenging the majors.

Not correct. Google succeeded because the changes it made REQUIRED NO CHANGES
ON THE PART OF THE USER. The results simply got better, even though the method
of searching (and number of words) remained the same. It also succeeded because
all the other major players weren’t paying attention to search. Believe me, they
aren’t ignoring the idea of natural language search. They know about it, and if
it will be successful, you bet it will be implemented long before an upstart
company can try to trounce them.

Back to Pell’s post, he fails to explain how searchers are somehow magically
going to go beyond "keywordese" to natural language searching. They aren’t using
keywordese now because they somehow have been trained to do it. No one from
Google sat the searchers down and said "only two words, and don’t use
conjunctions." People search however they want — and right now, they use only a
few words.

If Powerset’s going to change those habits, good luck. Be sure to read my
Why Search Sucks & You Won’t Fix
It The Way You Think
post from last month, which graphically illustrates the
challenge in changing habits with the user interface alone. Getting inside the
minds and whispering "type longer" isn’t going to be fun.

Meanwhile, I have great sympathy for the fact that the major search engines
could definitely do a much better job with query refinement.
My Robert Scoble
Wants What We Had — Better Query Refinement. So Do I!
post from last year
drills deep into some of the things I’ve illustrated above, on how query
refinement could improve — and improve in a way that wouldn’t cause more work
for the searchers.

Meanwhile, I’ll leave off with this screenshot:


Direct Answers On Ask

That’s the top of Ask’s search results for the
pirates
of the caribbean
search I mentioned earlier. Notice what they show before
the regular search results. A big fat guess at what most people probably want,
the info on the latest movie. Did they guess wrong? Notice the "Did you mean"
link directing searchers to try a search on Xbox related info as a backup.

That’s query refinement in action — inline — directly in front of the
action area that most searchers interact with. It could be much, much better.
But it’s a start and you’re going to see more of this,
even at places like Google which has used
midpage query
refinement for nearly two years now.


Related reading

kitt
superbowl ad search volume
deceptive
connect logo