Behind the Scenes at Yahoo Labs, Part 3

Dr. Gary Flake is Principal Scientist & Head of Yahoo Research Labs. In part one of this wide-ranging interview, he talked about the daily work of researchers at Yahoo Labs, and what they’re doing to make search better. In part two, he talked about the challenges of indexing various types of information, and Yahoo’s efforts at realizing a current hot trend — personalized search. In this final installment, he touches on a wide range of other search related topics.

Do you see a time when Yahoo will be able to deliver an answer on a results page. For example, instead of just giving links to sites about the Prime Minister of France, you get an answer with key facts, directory info, and links to articles? You’ve started something close to this with Yahoo shortcuts but they usually involve a second or third click? Will Yahoo eventually become an answer engine?

Yahoo shortcuts have been growing and improving at a healthy pace. I think if you try some today, you’ll see that we give many answers right on the results page (e.g., weather, flight and tracking information, word definitions, etc.). Shortcuts work best when there is an unambiguously correct answer to a query. Many queries have no clearly perfect result, or may be ambiguously formed. As we improve the technology, we’ll get better at answering more things with fewer clicks or query refinements.

Local search is getting a great deal of attention these days. Last year, Overture made a local search demo publicly available. How did it go?

Swimmingly. We were very pleased with how the test went and the feedback we received from both the press and users. Local is a huge opportunity for all of Yahoo and our goal is to put together a plan that incorporates all of the different activities related to local going on throughout the entire company.

Very often, searchers doesn’t know precisely what they’re looking for. The search query needs to be molded and formed. That’s what a good reference librarian can do. Should reference librarians be worried that they’ll soon be out of jobs?

One of my favorite sayings is: “data is not information; information is not knowledge; knowledge is not wisdom.” Today, search engines can give you more data than you’ll ever need, and a lot of valuable information as well, but they aren’t even in the running when it comes to knowledge and wisdom. I think reference librarians will have job security for a long while still. When things change because of search engine improvements, I believe that the emphasis of the role of a reference librarian will shift along the continuum, dealing less with data and more with wisdom. We’ve seen this already, and it’s good news for everyone.

In what you might call a “traditional” research database, keeping duplicates to a minimum is often a goal. However dupes and spam continue to be a problem for many web engines. What are Yahoo and Yahoo Research doing to help correct the problem? What other areas are problems for web crawlers?

We are working on both text-based and link-based approaches for performing duplicate and spam detection. Spam is such an interesting problem because it has a co-evolutionary flavor to it; we make spam detection algorithms, the spammers adapt, and the cycle continues. However, I think search engine spam will eventually be a solved problem.

I think one of the most challenging problems for crawlers is load balancing, scheduling, and balancing freshness with “niceness.” Here’s the dilemma: the users want content to be fresh, and most webmasters say they want to be indexed frequently. However, no matter how we schedule pages to be downloaded, it’s hard to please everyone and someone will be invariably unhappy. To make things even more interesting, we get crawling efficiencies by crawling many sites in parallel. Putting all of this together is a fun but challenging problem.

What search capabilities or search features does your competition offer that you wished you thought of?

I like Google’s tilde operator and I think that Teoma does some impressive calculations at query time.

What has it been like for you during the past few years when one company seems to be getting all of the attention? Has it been a bit disconcerting?

As a company we’ve been focusing on building great products, and as a research lab we’ve been focusing on doing great science. As I mentioned, web search (and other Internet problems) are very long-term ventures. These are very early days, and what we will build in the future will make today’s search engines look like toys. So, to answer your question, it’s all good. Building a company that creates a new industry that others copy is its own reward ;-).

With the acquisition of AltaVista, Yahoo got an excellent image database. Does Yahoo have any plans to make text searching (inside the image, via OCR) available.?

I can’t really comment on the short-term product plans. I think image search can be improved in a number of ways, and OCR is just one of many possible directions.

Those of us in the research world realize that most people don’t use any advanced search functionality. So, do you plan on developing more options that the searcher will be able to control. Is it worth your time?

I believe that the GUI (or feature set) that resonates with the user will be one that adapts to the user either individually or in the aggregate. We constantly learn from our users, and this informs the interfaces that we built. They, in turn, learn about new features. Thus, search interfaces have been slowly co-adapting on a slow time
scale (remember when multiple words meant an implied disjunction?).

Nonetheless, I also think that on this issue you can have your cake and eat it too. Sometimes a seemingly advanced feature is actually pretty easy to add to the search engine in terms of just pure functionality. If we only expose this feature to the advanced user who has explicitly asked for the larger feature set, then everyone gets to experiment.

In short, yes it is worth our time because adding new features is our primary means of collaborating with the user to improve the interface, and the rarely-used specialty features usually require little incremental cost to maintain.

What’s going to be the “next big thing” in web search?

I believe that the next big thing in web search will be a form of personalization that is simple, unobtrusive, intuitive, and almost without exception better than the non-personalized version of web search. Two ways of getting this wrong are to (1) keep the GUI as is, implicitly build a user model, and show personalized results all the time, or (2) expose many new GUI elements to the user to give a great deal of explicit control for personalization.

The sweet spot — the thing that works — will most likely be a slight modification to the GUI, say a single new GUI element, that gives the user the power to tell the search engine what they like or dislike. If done correctly, we will all wonder how we ever searched without it, and it will be as if we get the best of both worlds: more control with minimal complication and a search experience that seems tailored to our own needs.

At the beginning of 2009 what will Yahoo search look like?

As I said earlier, where search goes is a function of how we and our users co-evolve with respect to one another. This means no one really knows where it will be.

My hunch is that personalization will be so good that most users will look back to web search circa 2004 as ridiculously outdated. I also think that Yahoo will have nailed user intent to the point that we will be able to tailor the result set to focus on documents that satisfy the need behind the query, instead of returning results that
merely contain the same words as in the query.

We will also be indexing and blending from many more sources.

In the end, there will be vastly more data and information behind Yahoo search, yet users will be able to find what they want and need far more effectively than they can today.

