Yahoo Researcher Seeks to Combine Semantic Search Methods

Yahoo researcher Peter Mika has written up an extensive article on semantic search. First he talks about the limitations to syntax-based search:

  • It is almost impossible to return search results that relate to the secondary sense of a term—especially if a dominant sense exists—for example, try searching for George Bush the beer brewer as compared to the President
  • The capabilities of computational advertising, which is largely also an IR problem (for example, retrieving matching ads from a fixed inventory), are clearly impacted because of the sparsity of advertisements.
  • When no clear key exists, search engines are unable to perform queries on descriptions of objects. For example, try searching for the author of this article with the keywords ‘semantic web researcher working for yahoo.’
  • Current search technology is unable to satisfy any complex queries requiring information integration such as analysis, prediction, scheduling, etc. An example of such integration-based tasks is opinion mining regarding products or services. While there have been some successes in opinion mining with pure sentiment analysis, it is often the case that users like to know what specific aspects of a product or service are being described in positive or negative terms and to have the search results appear aggregated and organized. Information integration is not possible without structured representations of content.
  • Multimedia queries are also difficult to answer, as multimedia objects are typically described with only a few keywords (tagging) or sentences. This is typically too little text for the statistical methods of IR to be effective.

Mika says there are two approaches to semantic search: Natural Language Processing (NLP) and the Semantic Web.

Natural Language Processing “builds on the automatic analysis of text.” Semantic search company hakia is an example of natural language processing. Interestingly, hakia uses Yahoo search technology, including the recently announced Yahoo’s BOSS (Build Your own Search Service). Powerset, which was recently acquired by Microsoft, is another example of NLP. These NLP semantic search providers “extract entities from text, disambiguate them against large-scale background knowledge sources (PowerSet uses Freebase, Hakia has its own ontology), and then record the relationships as found in the text.” Users can query by asking full questions, though many still use keywords.

Semantic Web “aims to make the web more easily searchable by allowing publishers to expose their metadata.” Mika says most publishers are willing to share their data if it results in increased traffic. Plus, semantic web allows publishers to avoid costs and quality issues associated with NLP. But last year, Yahoo researcher Mor Naaman declared the Semantic Web dead. Naaman’s reasoning was the limitation of microformats, but Mika says that the new RDFa standard would have greater capabilities.

What Mika wants to do is to integrate the best of NLP and semantic web. He says Yahoo’s SearchMonkey platform allows for this integration to occur.

To dig into all the technical nitty gritty, check out Mika’s full article, “Semantic Search Arrives at the Web.”

Related reading