Correlation Does in Fact Imply Causation


Essays · 01

Correlation Does in Fact Imply Causation

Statistics · Scientific Reasoning

Perhaps the single most-quoted phrase about statistics1 is that 'correlation does not imply causation.' It's a phrase I've spoken hundreds of times, even after the ideas that resulted in this essay were broadly developed. It's often a useful educational tool for beginner-level students, and it's convenient as a shorthand description of a failure of scientific reasoning that's disturbingly common: just because A correlates with B, it doesn't mean that A causes B. The classic example is that ice cream sales correlate with violent crime rates, but that doesn't mean ice cream fuels crime — and of course this is true, and anyone still making base-level errors is well-served by that catchphrase 'correlation does not imply causation'.

The thing is, our catchphrase is wrong — correlation does in fact imply2 causation. More precisely, if things are correlated, there exists a relatively short causal chain linking those things, with confidence one minus the p-value of the correlation. Far too many smart people think the catchphrase is literally true, and end up dismissing correlation as uninteresting. It's of course possible for things to be correlated by chance, in the same way that it's possible to flip a coin and get 10 heads in a row3, but as sample size increases this becomes less and less likely, that's the whole point of calculating the p-value when testing for correlation. In other words, there are only two explanations for a correlation: coincidence or causation.

Let's return to the ice cream example. It doesn't take long to guess what's really going on here: warm weather causes both the increased occupancy of public space and irritability that leads to spikes in violent crime and to a craving for a cold treat. So no, ice cream does not cause violent crime. But they are causally linked, through a quite short causal pathway. There are three possible architectures for the pathway: A causes B, B causes A, and C causes both, either directly or indirectly4.

I would hate to push anyone back to the truly naive position that A correlating with B means A causes B, but let's not say false things: correlation does in fact imply causation5, just doesn't show you which direction that causation flows.

Why do I care about correcting this phrase? Two reasons — it is bad as a community to have catchphrases that are factually false, and "correlation does not imply causation" can and has been used for dark arts before. Rather famously, Ronald Fisher spent decades arguing that there was insufficient evidence to conclude that smoking causes lung cancer - because correlation does not imply causation. The tobacco industry was grateful. Meanwhile, the correlation was telling us exactly what we should have been doing: not dismissing it, but designing experiments to determine which of the three causal architectures explained it. The answer, of course, was the obvious one. Correlation was trying to tell us something, and we spent decades pretending it wasn't allowed to.

Notes

1. This one strikes closest to my heart as a longtime XKCD fan. Randall is almost gesturing at the point I make in this essay, but not quite. At the risk of thinking too hard about a joke (surely not a sin for this particular comic), the key flaw here is the tiny sample size — this isn't even correlation, the p-value is 0.5. If 1,000 people take a statistics class and we survey them before and after, then we could get a meaningful, statistically-robust correlation here — and unfortunately it would probably be the case that taking the class makes people more likely to believe this phrase.
2. I'm using 'imply' in an empirical rather than logical sense — it's not that correlation proves causation the way a mathematical proof does, but that it provides evidence for causation, with strength proportional to sample size.
3. p=0.00195, being generous and taking the two-tailed value.
4. That "indirectly" is pointing at a fourth option, actually an infinite set of options: C causes A and D which causes B, C causes A and E which causes D which causes B, etc. I'm not including these because it's natural to consider those as variants on C causes both. As an analogy: if one pushes over the first domino, did that cause the last domino to fall? A pedant might argue the actual cause of the last domino falling was the penultimate domino falling on it, and in some cases that precision can be useful, but most of the time it's natural to just say the person who pushed the first domino caused the last one to fall over. In practice the causal chain is probably pretty short, because interesting correlations tend to be well below one, and after a few intermediates the correlation strength drops below the noise threshold of detection.
5. With the evidentiary strength you would expect based on the p-value of the correlation. Coincidence is always a possibility, but becomes pretty unlikely for correlations with a large sample size.