I first heard the term big data about two years ago, when IBM and HP started running big data print ads as a conditioning tactic to prepare the marketplace for its (then) soon-to-be widespread use (an SEO tactic reserved only for the Fortune 500 brands).
The big data premise is simple and logical – in an age where every pixel can be tracked and measured, the challenge isn't having the data or accessing it, but making sense of it all. And for companies interested in learning what's working and what's not, sorting through mountains of data in order to see those insights is a promise that's hard to ignore.
But before you get sucked in and hire your next manager of meaning, here are five data analysis pitfalls you should be aware of and try to avoid.
1. Confirmation Bias
You have a hypothesis in mind but you are only seeking data patterns that support it – ignoring all data points that reject it.
You analyze the results of a campaign you believe performed well and find that the conversion rate on the landing page was really high (better than your average). You use that as the sole data point to prove your hypothesis, completely ignoring the fact that none of those leads was qualified or that the traffic to the landing page was sub-par.
More important than simple awareness, you should never approach data exploration with a specific conclusion in mind. Most professional data analysis methods are built so you try to reject your hypothesis, not prove it (reject the null).
Another way to do it is to assign a "devil's advocate": someone who always presents an opposing perspective. It's hard to do – and even harder to be that person – so it's good to rotate that role so you don't start ignoring this person.
2. Irrelevancy and Distraction
Focusing on data that is irrelevant to the problem you are trying to solve or being distracted by data that isn't directly connected to your analysis goal. In the age of Big Data, this is doomed to happen more and more.
You're trying to assess the effectiveness of a Twitter campaign in generating leads. In your analysis you notice that the campaign generated a lot of followers so you declare the campaign effective even though it generated almost no leads.
Clearly frame the data analysis boundaries. Simple framing parameters are time, resources and analyzed metric.
For example, if you're trying to learn what the conversion rate was on a landing page during a social media campaign to assess the success of the campaign, you have to define the time frame of the campaign (e.g., January 1 to January 14), the visits and leads from the exact landing page in question (e.g., only visits to the landing page and only leads that converted on that landing page), only traffic from that social media campaign (e.g., exclude all traffic not from the campaign). Only analyze the metric that will answer the question you are trying to answer (e.g., conversion rate).
By using these parameters, you can make sure you're not getting distract by irrelevant data like time on page or bounce rate.
3. Causation vs. Correlation
Mixing the cause of a phenomenon with correlation. If one action causes another, then they are most certainly correlated. But just because two things occur together doesn't mean that one caused the other, even if it seems to make sense.
Revenue and website traffic. You might find a high positive correlation between high website traffic and high revenue, but that doesn't mean that high website traffic is the cause for high revenue. There might be a common cause to both or an indirect causality that makes high revenue more likely to occur when high website traffic occurs.
Proving the null. One of the ways to prove causation is to try to eliminate the variable that you suspect is causing the phenomenon.
For example, if you found high correlation between number of leads and number of opportunities (a classic B2B data question), you might infer that a high volume of leads will lead to a high number of opportunities.
To prove that a high volume of leads causes a high volume of opportunities, try to eliminate the variable (note: the variable is not leads, but a high volume of leads). Change the volume drastically and see if it has an impact on the number of opportunities. You might find that the number of opportunities doesn't change, which will, in turn, allow you to cut costs on lead generation.
4. Statistical Significance
Using data sets that are too small to suggest a trend or comparing results that are not different enough to have statistical significance.
Comparing conversion rates of two landing pages and claiming one of the landing pages is the winner just because it has a higher conversion rate, even though the difference between the results is small.
Use this calculator when you compare results and always ask what the data set size (the "n") was.
5. Action vs. Intent
Inferring the wrong intention based on the actions recorded in the data rather than the suggested intent.
You analyze your website traffic after you change the messaging on your homepage and find a higher than usual bounce rate. You assume that the message didn't resonate well with the audience and that's why they bounced, but you neglect to consider the exact opposite – the message was so well-stated that the unqualified audience bounced while the qualified audience stayed longer and converted to leads.
Qualitative research (surveys, questionnaires, etc.) can help with understanding intent. In addition, creating more explicit calls-to-action and content can help with deciphering intent better.
Beware: More Pitfalls Ahead!
Here are five more things to watch out for when doing data analysis:
- Apples and oranges: Comparing unrelated data sets or data points and inferring relationships or similarities.
- Poor data hygiene: Analyzing incomplete or "dirty" data sets and making decisions based on the analysis of that data.
- Narrow focus/not enough data: Analyzing data sets without considering other data points that might be crucial for the analysis (for example, analyzing email click-through rate but ignoring the unsubscribe rate).
- Bucketing: The act of grouping data points together and treating them as one. For example, looking at visits to your website and treating unique visits and total visits as one, inflating the actual number of visitors but understating your true conversion rate.
- Simple mistakes and oversight: "It happens to the best of us."