Did Nicolas Cage cause swimming pool drownings?

A lesson in the dangers of confirmation bias and equating correlation with causation when analyzing big data.

analytics analize big data information scrutinize investigate
Credit: Thinkstock

Are your analytics telling you what to find, OR are they actually telling you what’s going on?

We’ve all heard that “correlation doesn’t equal causation” and are aware of the risks of confirmation bias (using evidence to confirm already held beliefs). But how do you avoid those pitfalls to ensure your data provides a realistic picture of a situation?

Let’s start with the easy one first: equating correlation with causation. There are two ways to avoid making that mistake — only one of which actually ensures whether the connection between two things is real or not:

1. Common sense.

2. Testing.

The first step, using common sense, sounds obvious, but you have to be careful. In some cases, a seemingly common-sense observation may eliminate a valid correlation that sounds nonsensical (an example of confirmation bias). But generally, common sense is a good first step that will save you some time and effort. The wonderful website Spurious Correlations has some great examples of statistics that correlate but really having nothing to do with each other, such as the number of people drowning in pools and the number of films Nicolas Cage appeared in:

Graph correlating number of Nicolas Cage films and pool drownings Credit: Spurious Correlations

My personal favorite spurious correlation is the one connecting Internet Explorer (IE) usage and the U.S. murder rate:

bar graph correlating IE use and murder rate Credit: Spurious Correlations

Now I think IE can be annoying, but we can all agree that it doesn’t influence the murder rate.

So the first check — common sense — is important, because it removes nonsense and saves you time. However, it can be entertaining to demonstrate the dangers of equating correlation with causation in data science by creating your own spurious correlations and presenting them as “facts” to make your point: “Sales of our cat food correlate strongly with point spreads in Champions League football games.”

Testing correlations

The second step, testing correlations, can actually be quite hard. To test a correlation, you must first understand which of the two findings might be the lever and then determine how to test it.

Let’s take another silly example: Back in 2012, some researchers looked at the impact of chocolate consumption on Nobel Prize winning, because chocolate is apparently associated with cognitive functions (who knew?). They found a decent correlation between chocolate consumption per capita and Nobel Prizes per capita — which, they claimed, could have some basis in reality.

Now how would you test that? Take one group with chocolate consumption but no education, another with education but no chocolate, a third with both and a fourth with neither, and then compare the results? Hardly feasible. So, in effect, we have a potential correlation that is untestable in reality.

To test a correlation we need to do four things:

1. Test which indicator is the “leader” and which is the following indicator.

2. Test what happens if we ignore the correlation.

3. Test what happens if we go against the correlation.

4. Actually be able to test.

It’s extremely important to do verification of multiple, different conditions on similar groups to prove whether the business should rely on a correlation. Otherwise, it’s just spurious noise.  This is where A/B testing (and A/B/C/D/E/ and so forth testing) comes in so you can actually compare one set of assumptions against another to see which holds most true in reality.

Testing a correlation is fundamentally what turns data analytics into data science. It’s the proving of the analytics rather than the assumption of its correctness.

This article is published as part of the IDG Contributor Network. Want to Join?

NEW! Download the State of the CIO 2017 report