Google had been doing a series of posts about search quality. Today, the latest post in the series discusses how evaluation enters into the the process.
Scott Huffman, Engineering Director, gave four insights into the nuances of difficulty experienced in search evaluation:
- First, understanding what a user really wants when they type a query -- the query's "intent" -- can be very difficult. For highly navigational queries like [ebay] or [orbitz], we can guess that most users want to navigate to the respective sites. But how about [olympics]? Does the user want news, medal counts from the recent Beijing games, the IOC's homepage, historical information about the games, ... ? This same exact question, of course, is faced by our ranking and search UI teams. Evaluation is the other side of that coin.
- Second, comparing the quality of search engines (whether Google versus our competitors, Google versus Google a month ago) is never black and white. It's essentially impossible to make a change that is 100% positive in all situations; with any algorithmic change you make to search, many searches will get better and some will get worse.
- Third, there are several dimensions to "good" results. Traditional search evaluation has focused on the relevance of the results, and of course that is our highest priority as well. But today's search-engine users expect more than just relevance. Are the results fresh and timely? Are they from authoritative sources? Are they comprehensive? Are they free of spam? Are their titles and snippets descriptive enough? Do they include additional UI elements a user might find helpful for the query (maps, images, query suggestions, etc.)? Our evaluations attempt to cover each of these dimensions where appropriate.
- Fourth, evaluating Google search quality requires covering an enormous breadth. We cover over a hundred locales (country/language pairs) with in-depth evaluation. Beyond locales, we support search quality teams working on many different kinds of queries and features. For example, we explicitly measure the quality of Google's spelling suggestions, universal search results, image and video searches, related query suggestions, stock oneboxes, and many, many more.
Not sure if I'm buying that Olympics example. Google didn't do a great job with the Beijing Olympics, and surely their algorithm could handle serving up more relevant search results during the time surrounding the event.
I'm not saying that search query intent evaluation is easy, just that the Olympics query is not quite as problematic as Google is making it out to be.
The rest of the points are things we've been hearing from Google for a long time. We know they're progressing on universal and personalization search efforts, all in their famous intent to create the best user experience.
So, what methods does Google employ to address these evaluations? Huffman offered up the following:
- Human evaluators. Google makes use of evaluators in many countries and languages. These evaluators are carefully trained and are asked to evaluate the quality of search results in several different ways. We sometimes show evaluators whole result sets by themselves or "side by side" with alternatives; in other cases, we show evaluators a single result at a time for a query and ask them to rate its quality along various dimensions.
- Live traffic experiments. We also make use of experiments, in which small fractions of queries are shown results from alternative search approaches. Ben Gomes talked about how we make use of these experiments for testing search UI elements in his previous post. With these experiments, we are able to see real users' reactions (clicks, etc.) to alternative results.
What do you think of Google's search evaluation? What evaluations would you like to see them conduct? Discuss in the comments.