IndustryValidity of Bing It On Challenge Results Put Bing on Defense

Validity of Bing It On Challenge Results Put Bing on Defense

According to Bing, in blind testing, users prefer Bing over Google. But a new survey has ignited a huge debate by questioning the validity of the Bing It On Challenge results based on how the challenge was conducted.

Bing It On

If you live in the U.S., you’re probably aware of the “Bing It On” challenge, where Bing claims that in blind testing, that users prefer Bing over Google 2-to-1. Considering Google has the significant market share, currently rocking in at 67 percent from the latest comScore data, while Bing has a mere 18 percent, the has been plenty of speculation over how accurate that claim really is. And Bing has confirmed around 5 million people have taken the challenge online.

But Ian Ayres, on the Freakonomics blog, questions the validity of the Bing It On Challenge and how they decided to run it, which they have also detailed in a much longer paper.

First off, is the fact that their statistic states users chose Bing over Google was based on a mere 1,000 users – not a very large pool at all. And they don’t specify how these participants were selected, other than they were 18 and up and from across the U.S. Were these people who are signed up for doing surveys, as there is quite a market for people getting paid for doing surveys online. Or were they selected in some other way?

Ayres, along with four Yale Law students, decided to do their own survey. But what they found was rather interesting.

When users were selecting the search terms from the ones Bing suggests, the results shown seem to be much more favorable for Bing, leading to speculation that either these results were hand curated or just happen to be pages that previous challenges had shown a marked preference for the Bing results.

When Bing-suggested search terms were used the two engines statistically tied (47% preferring Bing vs. 48% preferring Google). But when the subjects in the study suggested their own searches or used the web’s most popular searches, a sizable gap appeared: 55-57% preferred Google while only 35-39% preferred Bing. These secondary tests indicate that Microsoft selected suggested search words that it knew were more likely to produce Bing-preferring results.

Ayres was the only one who had concerns with how Bing It On was conducted. Matt Cutts, distinguished engineer at Google, also raised similar concerns in his Google+ post regarding their claims.

Freakonomics looked into Microsoft’s “Bing It On” challenge. From the blog post: “tests indicate that Microsoft selected suggested search words that it knew were more likely to produce Bing-preferring results. …. The upshot: Several of Microsoft’s claims are a little fishy. Or to put the conclusion more formally, we think that Google has a colorable deceptive advertising claim.”

I have to admit that I never bothered to debunk the Bing It On challenge, because the flaws (small sample size; bias in query selection; stripping out features of Google like geolocation, personalization, and Knowledge Graph; wording of the site; selective rematches) were pretty obvious.

After the Freakonomics post appeared, Bing’s first response was a comment to Slate from Bing’s behavioral scientist, Matt Wallaert:

The professor’s analysis is flawed and based on an incomplete understanding of both the claims and the Challenge. The Bing It On claim is 100% accurate and we’re glad to see we’ve nudged Google into improving their results. Bing it On is intended to be a lightweight way to challenge peoples’ assumptions about which search engine actually provides the best results. Given our share gains, it’s clear that people are recognizing our quality and unique approach to what has been a relatively static space dominated by a single service.

It is an impressive claim that Bing is suggesting Google improved their results due to this challenge. Bing’s share gains, from recent comScore data, shows that Bing has increased by 2 percent between June 2012 and June 2013, while in the same period Google increased its share by 0.2 percent.

However, Wallaert also comments on the Freakonomics post to remind people that Bing’s search reach is simply more than Bing.com itself.

There is just one more clarifying point worth making: you noted that only 18% of the world’s searches go through Bing. This is actually untrue; because Bing powers search for Facebook, Siri, Yahoo, and other partners, almost 30% of the world’s searches go through Bing. And that number is higher now than it was a year ago. So despite your assertions, I’m happy to stand by Bing It On, both the site and the sentiment.

And Wallaert brings up some valid points on why they suggest keywords now for people taking the Bing It On challenge, instead of having people enter in their own searches.

Here is what I can tell you. We have the suggested queries because a blank search box, when you’re not actually trying to use it to find something, can be quite hard to fill. If you’ve ever watched anyone do the Bing It On challenge at a Seahawks game, there is a noted pause as people try to figure out what to search for. So we give them suggestions, which we source from topics that are trending now, on the assumption that trending topics are things that people are likely to have heard of and be able to evaluate results about.

It is worth nothing that ironically, Bing It On uses Google’s 2012 Zeitgeist list for their “web’s top queries” suggestions.

Then later, Michael Archambault interviewed Wallaert on the Freakonomics piece and how he viewed many of the issues Ayres raised in his Freakonomics piece.

The first issue was how Ayres found his 1,000 people for his test – through the website Mechanical Turk and that those types of people prefer Google, although Wallaert fails to reveal exactly how the people were found for their own study, other than it was conducted by a third-party firm.

“Ayres’ used Mechanical Turk to recruit subjects, a site that is known to very few people on the web. While he measured things like gender, age, and race, and showed that his sample was representative of the internet-using population, one strong possibility is that those aren’t the relevant variables along which people pick search. For example, it may be that the more likely you are to use Mechanical Turk, the more technology-inclined you are, and that being technology-inclined is correlated with a preference for Google results over Bing results.”

If you have used or seen Mechanical Turk, it does have many highly technical members, and it could be very likely that there is a search engine bias to a higher degree than would normally occur in a random selection of people across the U.S. And they could be more likely to recognize the slight nuances between Google and Bing search that the average searcher might not notice on a Bing It On challenge.

In comments Wallaert made on the Verge (he also made sure to include a disclaimer that he works at Microsoft), he brings up another way that the Mechanical Turk audience is more technical and how it is possible they also search different from your average searcher.

Let’s pretend, and I have no idea if this is true, but let’s pretend Bing does better at head queries than Google does and Googles does better at tail queries than Bing does. The average MTurk person could be more likely to enter a tail query than the average American.

Another issue is the fact that Bing did two studies, which is causing some confusion. The initial study had the 2-to-1 claim, while the subsequent study showed a preference to Bing albeit not 2-to-1.

The first issue is Ayres’ challenging Microsoft’s older “2 to 1” study. If you visit the campaign’s website today, you will notice that Microsoft has changed their headline to “people prefer Bing over Google for the web’s top searches.” Wallaert, explained that Microsoft started with a study in which users could pick any search query they wished – this study is the basis of the “2 to 1” claim and it was reported back in September of 2013.

Microsoft then performed a new study in which they used Google’s top queries instead of user dictated ones. You might expect Bing to not perform as well, as these are Google’s top and most handled searches. The results were surprising. While Google did gain some edge Bing handled Google’s top searches better.

He also states that all their claims go through Bing’s lawyers, while Ayres claims have not undergone any validation at all.

Bing responded in a way where they tried to devalue many of the points Ayres brought up. This includes the fact that Ayres disagreed with the fact that Bing used such a small sample size, yet also used the same sample size of 1,000 people in their own testing.

Also Ayres did not like the fact that Bing has not released any data coming from their online “taste test” at BingItOn.com. Bing responded by saying that they don’t keep any of the data at all, because retaining that information would be unethical.

Next, Ayres is bothered that we don’t release the data from the Bing It On site on how many times people choose Bing over Google. The answer here is pretty simple: we don’t release it because we don’t track it. Microsoft takes a pretty strong stance on privacy and unlike in an experiment, where people give informed consent to having their results tracked and used, people who come to BingItOn.com are not agreeing to participate in research; they’re coming for a fun challenge. It isn’t conducted in a controlled environment, people are free to try and game it one way or another, and it has Bing branding all over it.

So we simply don’t track their results, because the tracking itself would be incredibly unethical. And we aren’t basing the claim on the results of a wildly uncontrolled website, because that would also be incredibly unethical (and entirely unscientific).

Many find it rather astounding that Bing isn’t tracking this information, even if it was without identifiable information – and even if they didn’t track the specific search queries – simply so they could say the percentage of people that are choosing thing over Google. Those understandable that people were speculating that they aren’t releasing the information because it wasn’t favorable to Bing.

As for the claim of Bing favorable results, Bing is falling back on the “We don’t track this” line as well, citing privacy considerations. So we unfortunately can’t get further online test data from the Bing It On website.

First, I think it is important to note: I have no idea if he is right. Because as noted in the previous answer, we don’t track the results from the Bing It On challenge. So I have no idea if people are more likely to select Bing when they use the suggested queries or not.

Here is what I can tell you. We have the suggested queries because a blank search box, when you’re not actually trying to use it to find something, can be quite hard to fill. If you’ve ever watched anyone do the Bing It On challenge at a Seahawks game, there is a noted pause as people try to figure out what to search for. So we give them suggestions, which we source from topics that are trending now on Bing, on the assumption that trending topics are things that people are likely to have heard of and be able to evaluate results about.

Which means that if Ayres is right and those topics are in fact biasing the results, it may be because we provide better results for current news topics than Google does. This is supported somewhat by the second claim; “the web’s top queries” are pulled from Google’s 2012 Zeitgeist report, which reflects a lot of timely news that occurred throughout that year.

In comments on an article on the Verge about the situation, Wallaert responds to many reader’s comments, but he also makes some interesting claims about Ayres, the author of the Freakonomics post, which was understandably omitted from the official Bing post, but raises questions about Ayres credibility.

Also, not to cast stones, but this is an academic who has admitted to plagerism (sic). If his motive was entirely about making sure the truth was known, he could have easily just asked me for the supporting data first, which is what academics generally do to other academics. As a matter of fact, it is a rule for many publications in my field (psychology) that once a paper is published in a peer-reviewed journal, anyone can request the and it must be provided.

I can tell you, I’m a pretty easy guy to find, and Ayers never asked me about anything.

(Note: I work at Microsoft)

Does plagiarism make Ayres study less valid? I could see where it could raise suspicions due to his previously admitted plagiarism.

It does make the entire Bing It On challenge something that should be looked into more, with both sides releasing all their data from the testing, including search terms used. This would show the differences between specific search queries used in the testing for both parties, and to compare the types of searches that were done for each, which would be most interesting for the Mechanical Turk searchers, to see if they do search differently from most searchers.

Ayres has not responded to Bing’s rebuttal to the claims he made in his Freakonomics post, or responded to any of the comments made on Freakonomics. I suspect we haven’t yet seen the rest of this play out yet.

Resources

The 2023 B2B Superpowers Index

whitepaper | Analytics The 2023 B2B Superpowers Index

8m
Data Analytics in Marketing

whitepaper | Analytics Data Analytics in Marketing

10m
The Third-Party Data Deprecation Playbook

whitepaper | Digital Marketing The Third-Party Data Deprecation Playbook

1y
Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

whitepaper | Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

1y