Lies, Damned Lies, and Statistics in Landing Page Optimization

The statistics branch of mathematics has a poor reputation among the public. Much of modern science and economics is fundamentally based on statistics.

So is public policy. Because public policy is a matter of priorities and heated debate about the allocation of government budgets, statistics has gotten pulled into the fray to support or undermine various political positions. Unscrupulous or ignorant people have corrupted it for their own purposes.

While there's nothing wrong with statistics itself, there are many common misuses. Let's look at some of the implications for landing page optimization.

Throwing Away Part of the Data

Statistical studies are based on a confidence level in the answer (commonly 95 percent). When conducting a large number of experiments, even two identical effects can seem different based simply on a statistical streak.

For example, if you flipped a coin five times, you might be surprised to see it come up heads every time, and might even suspect that it could be loaded. However, based simply on random chance, we would expect this result about 3 percent of the time. So if we repeated this experiment 100 times, we'd expect a series of all-heads to come up about three times.

Unscrupulous people might rerun the experiment many times, and report a single all-heads result as proof that the coin was loaded. By discarding the remaining experiments that don't support their desired conclusion, they're misrepresenting the results.

Traffic Filtering

In landing page testing, you generally want to get the widest possible range of traffic sources. That way, they're more likely to represent your visitor population as a whole.

Generally, traffic sources should be recurring, controllable, and stable. If your traffic doesn't have these characteristics, it may be very hard to tune.

You may want to remove unstable sources (such as some of your larger but highly variable affiliates) from your testing mix. Also, you should generally remove nonrecurring e-mail traffic that arrives in spiky and sporadic "drops."

Sequential Testing

Another type of sampling bias can be introduced by sequential testing. For example, you may test your original design for a month, and then replace it the following month. It's hard to reach any kind of conclusions after this kind of experiment.

Any number of external factors may have changed between the two testing periods. For example, perhaps there was a holiday with common family vacations, some major breaking news affected your industry, or you made a major public relations announcement.

The point is, you're comparing apples to oranges. In landing page testing, you should always try to collect data from your original version and your tested alternatives in parallel. This will allow you to control for (or at least detect and factor in) any changes in the external environment. Only use sequential testing as a last resort.

Short Data Collection

Even if you run your tests by splitting the available traffic and showing different versions of your site design in parallel, you may still run into biased sampling issues related to short data collection periods. Experiments involving very high data collection rates may be especially prone to this.

For example, let's assume you're testing two alternative versions of your page and measuring click-throughs to a particular target page as your conversion action. Because of the high traffic to your landing page, you collect about 10,000 conversion actions in the first hour of your test.

This data shows you that one of your versions outperforms the other to a very high level of statistical confidence. Many people would conclude the test at this point and immediately install the best performer as the new landing page.

But what if I told you that the data was collected in the middle of the night? You might correctly conclude that people visiting your site during the day are a different population, or at least they behave differently. The same is true of weekday (accessing the Internet from work) versus weekend traffic (accessing the Internet from home).

Regardless of your data rate, collect data for at least one week (or multiple whole-week increments if your data rate is low). This allows you to eliminate short-term biases. Of course, this still doesn't address the question of longer-term seasonality.

Overgeneralization

Overgeneralization is the erroneous extension of your test conclusions to a setting where the original results no longer apply.

For example, let's say that I set up an experiment to count the ants in my kitchen and tracked it for a full week during a record winter cold spell. My finding was that there were no ants in my kitchen at all during the study period. However, it would probably be incorrect to assume that the same would hold true during a summer heat wave.

Often, those who subsequently summarize or cite the results make the overgeneralization, rather than the original researcher.

A common overgeneralization in landing page testing is to assume that traffic sources that weren't part of your original test will behave in the same way as the tested population. For example, if you see a particular effect with your PPC traffic, don't assume it will hold up when you expose the new landing page to your in-house e-mail list.

Avoid these common issues with improper use of statistics and you'll be much more likely to find real conversion improvements in your landing page tests.

Join us for Search Engine Strategies San Jose, August 10-14, 2009, at the McEnery Convention Center.