Behind the Scenes at Yahoo Labs, Part 2

Dr. Gary Flake is Principal Scientist & Head of Yahoo Research Labs. In part one of this wide-ranging interview, he talks about the daily work of researchers at Yahoo Labs, and what they’re doing to make search better. In this second of three parts, he talks about the challenges of indexing various types of information, and Yahoo’s efforts at realizing a current hot trend — personalized search.

What’s your feeling about trying to place structured data like a library catalog/bibliographic record or an indexed article into an unstructured database? Asked another way, what’s the role of structured data in an unstructured web world? How can we bring both types of resources together and still allow users to take advantage of all of the additional access points that a structured database and its retrieval mechanism make available?

The beautiful thing about a relational database is that its structure tells you a lot about what is important. Database designers have been brilliant at optimizing databases (both the organization of the information as well as the algorithms) to best exploit this regularity. When you flatten out a database, those paths towards optimization often aren’t available.

A middle ground — which is not perfect, but adds a lot of utility — is to convert structured into a semi-structured form. Today, we treat documents as a big bag of words and index those words. In this semi-structured approach, we take structured information (say, the value of specific fields) and synthesize fake words that represent the fact that “document X has field Y with value Z.? Now, clearly I can’t run a SQL query on this representation; but at least I can search for documents with specific field:value pairs.

I’d like to tell you that we will be able to make an unstructured database as powerful as a structured database; but that simply is not the case. Nonetheless, the fusion of structured and unstructured data and approaches will add a lot of utility to the lives of most users.

In parallel to the above, we have started on a different approach through the launch of our Content Acquisition Program, working with such partners as NPR and the Library of Congress, as well as with universities such as Northwestern, UCLA and University of Michigan — all so we can bring their structured data to a larger audience.

Is there much implied data on a web page that goes unused today?

I would claim that there is more implied data (or inferable meta-data) than “raw” data on the web, and that we are barely scratching the surface of it. Today, all search engines are scraping for some simple forms of implied data: language, locality, etc. What’s missing from this list is a nearly infinite collection of relationships that are obvious to most any human reader but extremely difficult to infer from a single document. The reason why implied data is so hard to identify is because, in the aggregate, it forms our collective cultural wisdom. Let me ground this with an example:

I could be reading a very technical document about protein folding, which is an exciting area within molecular biology. This document may make references to some chain of amino acids with a chemical formula. The formula is actually written in another language that chemists use for describing molecules, yet a specification for this language is never given in the document that I am reading; instead, it is assumed that the reader would know this language. That’s an obvious example, but there are even more subtle relationships to be found.

Within the same document, we could see many more forms of implied data and information. For example, a document about protein folding may never actually use the word “biology,” yet the document clearly falls underneath that topical umbrella. Humans reading the document could rattle off several obvious truths: “it’s written for a specialist, “it makes reference to physics in a non-trivial way, etc. An expert would be able to tell you even more implied facts: “the article may be out-dated by now,” “the author is considered an authority in this domain,” or “there’s an expectation that diseases will be curable if these advances continue,” etc.

In total, all of the implied data amounts to the stuff that all of us carry in our heads but no one bothers to write down; yet these factoids are essential to understanding and meaning. Some people in AI have been trying to codify these factoids for decades (and in many forms, from ontologies to databases of common sense). We are now starting to scrape the web for these subtle relationships. The key insight is that it is not enough to look at words, concepts, or documents; one must also look at how all of these things relate to one another.

Right now Yahoo has numerous tabs available (news search, product search, image search) on its interface. Do you think at some point an interface can have too many tabs?

Absolutely. The 7 plus or minus 2 rule applies here. Namely, most people can only fit about 5-9 items in short term memory. If you add more tabs than this, some users scanning the tab list will forget one tab’s name before completing the list. The trick is to have the right tabs (or maybe even none at all).

During 2003, Yahoo launched the ability to personalize some product search results with SmartSort. Can you tell us how this was developed? Are more personalization tools on the way? Do you envision a time when searchers can literally create their own web databases by sending out their own crawlers to build them?

The bulk of Yahoo’s product search engine (including SmartSort) was built by an international team of IR research scientists and engineers, and it is an example of how R&D within Yahoo is both centralized and distributed. Many of the key people in product search live within a business unit, but are also affiliates of Yahoo Research Labs.

Personalization is one of the top priorities of Yahoo and new personalization features will be available over many of our properties.

Do I envision personal crawlers and indices? In the sense that I have personally done this for myself, yes. In the sense that most or even a significant minority of users do this, no. Here’s why: most people don’t bother to use search engine operators. Most HTML authors don’t bother to properly use meta-tags. Most people don’t want to use this level of granularity in a user interface, and that’s okay.

Nonetheless, we will do something almost as good or maybe even better. I believe that in the future, we will be able to tailor relevance/ranking functions in web search for individuals. The challenge is to do this in such a way that everyone can do it with minimal investment.

Do meta tags have a role in a large general web engine? I’m not talking about material you get from “trusted” members of a paid inclusion program but rather for the general web developer and searcher.

Meta tags have a role, but I am not so sure what that role is. Unfortunately, meta tags have been abused so often that it is really hard to separate the good from the bad from the ugly. Sometimes the abuse is intentional, as in the case of spammers. Other times, the problem is simply laziness on the part of the HTML authors.

