I had the chance recently to interview Ramez Naam of Microsoft. It was an interesting discussion focused on some of the key issues that impact core search algorithms. It turns out that there is an enormous amount of complexity in basic processing of search queries.
For example, one core concept is search is the notion of “stop words”. These are words like “the”, “and”, is”, and other words like them that search engines strip off user queries to simplify the process of finding the best results. Consider what happens when you enter the search query “the”, for example. Microsoft returns 9.68 billion results, Google returns 1.57 billion results, and Yahoo returns 15.6 billion results.
Of course, this is because nearly every web page in the world has the word “the” on it. So this is a smart move that simplifies query processing. But, what do you do when the query is for “the office”? If the search engine strips of the word “the”, the query becomes “office”.
In addition to their being millions of offices across the United States, let alone the entire world, this is also the brand name of a pretty well known software suite (yes, I am being a bit flip). But if the person is searching for the TV show, “The Office”, stripping the word “the” from the query will hurt the relevance of the results.
Dealing with stop words, and deciding when to not strip them from the query was one of the many enhancements to their core search announced by Microsoft in late September. There are many different issues of this type addresses by Microsoft at this announcement, most of which are discussed in my interview with Ramez, and it just reminds us of how complex search truly is.