Last month, Search Engine Watch published the results of our "Perfect Page" test, the first of many relevancy tests that we plan to do on an ongoing basis. In this article, we look at that feedback, while the broader issue of why the search engine industry desperately needs what I call "The Relevancy Figure" is dealt in the sidebar piece, "In Search Of The Relevancy Figure."
We only got significant feedback from two services, AltaVista and Overture. Not surprisingly, these were the two that received failing marks. And with apologies to Overture, we never meant to score it alongside the others.
In our test, we wanted to see how the primary, unpaid "editorial" results presented by the various major search engines compared. Overture was not intended to be part of the comparison group, because unlike the others, its primary listings are paid.
That's not to say we didn't test Overture. We did, because we were interested to also know how a paid listings service would perform. As noted in the story, this was something we did "out of curiosity," so that we could mention the findings as an aside. We didn't mean to list Overture's performance alongside the others, though unfortunately, that's exactly what we did.
We when finished our testing, we made a list of all the services in order of the letter grade they received, including Overture. That was for our own purposes, to see where they stood at a glance. This listing was then copied into the final article, and the last line with Overture was intended to be cut, just as we cut the column detailing Overture's performance in our "Criteria and Detailed Results" sidebar to the Perfect Page article.
Despite the cuts, people would still be able to learn how Overture did, but this would have been done by reading a paragraph explaining the unique situation with Overture, rather than in an at-a-glance chart. By mistake, we failed to cut the line with Overture's grade. Anyone just skimming the story and reading the itemized list might have assumed Overture was a formal part of the test. Certainly many people at Overture did.
I know this first hand, because exactly one week after the results were published, I was visiting at Overture's offices in Pasadena to get an update on happenings with the company. In my very first meeting, one of the people I was meeting with asked about the test, saying that many "noses were out of joint" that the company had been included. This was the first I learned that Overture had been inadvertently included on the itemized list, and it was something I corrected immediately the next day.
Overture, which had been silently fuming over the report for a week, also complained during my visit that certain search engines were able to "opt out" of the test. In reality, no one was invited to take participate or opt out. Chris Sherman and I drew up the list of search engines to test. We didn't test WiseNut and HotBot, but this wasn't because those search engines demanded to be dropped. Instead, it was because we didn't think it was useful to test them given the impending changes we knew were coming. Ultimately, we decide to test what we want, and the article has now clarified this point.
Overture also asked why Google's AdWords listings weren't tested separately from its editorial listings. Again, the test this time was of editorial listings. However, I think the real question was, "Why single out Overture for the look at how paid listings perform?"
The answer is simple. There are many large and small sites that run pure Overture results, such as Go.com, whereas no site that I know of currently uses only Google's AdWords listings and not also its editorial results. Since people may end up at Overture-powered sites intending to do general-purpose searches, we wanted to know how those results performed for this particular test. The answer remains, poorly. If you had gone to a place like Go.com, which is powered by Overture, you wouldn't have found most of the sites we felt were important.
To conclude, our look at Overture and its poor performance on this particular test does not mean that Overture's results are not relevant. It simply shows that in this particular test, Overture failed to bring up the target sites as well as the other search engines we examined. That's perhaps useful to know if you are one of the people who actually goes directly to the Overture site itself to search (but few people do this). We think the more important point to take away is that it illustrates why, as we pointed out in the story, a good general purpose search engine will provide a consistent blend of editorial AND advertising results, regardless of who provides their adverting results.
Moving to AltaVista, the company's chief scientist Jan Pedersen sent feedback saying that he felt the Perfect Page test was inadequate to rate search engine relevancy overall, and we agree. No one should assume the letter grades earned in this particular test define a search engine's overall relevancy. Here's what Pedersen sent:
"Correctly assessing the relevance of an Internet search engine is methodologically tricky. All too often people resort to anecdotal tests based on a small number of favored queries --- for example, someone's name. This can be misleading, since no one query is a good predictor of overall relevance and small samples produce unstable and unreliable results. In contrast, truly informative tests, such as those sanctioned by the Information Retrieval community, and such as the one we run internally at AltaVista, average over a large number of diverse queries, typically several hundreds, to produce statistically significant results. We use a random sample of queries to avoid the judgment bias that can creep into a test using a 'representative' query set.
The notion of relevance is itself rather elusive since it depends on a number of uncontrolled factors, such as the unstated intention of the user, and their tastes. For some queries one can identify an unambiguously correct authoritative page (for example www.yahoo.com for the query 'yahoo') that should rank at the very top of the results. These are often referred to as navigational queries.
More often, however, there are many potentially good matches and a search engine presents a selection of these in its top-10 results. In this situation two results lists can contain relatively few pages in common yet still both be, in the user's perception, a good response to the query. In fact, our measurements indicate that, on average, only 1/3 of pages are common across search engines in the top-10 results.
The most reliable method for determining if a result list is, in fact, relevant is to ask a user, not to require that it contain a particular page. This is typically achieved through an A/B test where a real user is asked to judge whether a list of results from system A is better, equivalent, or worse than a list of results from system B. We use an external panel of thousands of users assessing hundreds of queries in a formal A/B test to gain a real quantitative sense of the relevance of our ranking algorithm.
The 'Perfect Page Test' described in your recent article uses a small set of non-random queries and gauges relevance by the appearance of sanctioned 'perfect pages'. Some of these 'perfect pages' are indeed authoritative, but many others are not. For example, although www.cyndislist.com is a good site, it is not clearly better than the many other similar sites on genealogy and hence requiring its appearance offers a questionable basis for determining result set relevance. So, we consider this test an unreliable measure of relevance and hardly a good basis for a quantitative 'grading' of search engines. However, as an indicative measure it does seem to suggest that the variation in relevance across search engines is perhaps less than users might be inclined to believe, a finding which is supported by our internal relevance assessments."