Many (thanks!) readers sent links to Tyler Vigen’s Spurious Correlations, whose motto is “Discover a new correlation – an interesting spurious correlation each day!”
My favorite is tying the yearly number of people who drowned by falling into a swimming-pool with the number of films Nicolas Cage appeared in. The correlation is, ominously, 0.666, meaning that the more we see of Cage the more widespread and Satanic the death toll. Surprised?
But that correlation is trivial compared to per capita consumption of cheese in the States with the number of people who died by becoming tangled in their bed sheets, which reaches a whopping, and statistically significant, 0.947.
Before you wander off giggling, answer this: what makes Mr Vigen’s entries different from those offered at “data journalism” sites like Vox and FiveThirtyEight? Or indeed different from learned journals, such as Stroke, which recently published the “significant” discovery that “Geomagnetic Storms Can Trigger Stroke“?1
Consider that we have dispassionate data culled from government sources, and therefore as pure-as-can-be. No rules of statistics have been violated. No miscalculations have been made. True, “statistical significance” should join chiropractory, communism, and continual motion machines on the scrap heap of beguiling but baneful, baleful beliefs. But switching to another quantitative probability interpretation won’t fix anything.
Like, for instance, a Bayesian technique. Turning these correlations into Bayesian parametric posteriors would change little. Even going whole hog and speaking of posterior predictive distributions, where the uncertainty of the parameters are “integrated out” and the model speaks purely in terms of observables, though it would be an improvement, wouldn’t do much good.
That’s because the quantitative “signals” identified by Vigen, and by many “researchers” in their papers, are real—they really are there. That cheese-bed sheet correlation really is absurdly high. But could eating cheese cause somebody to strangle themselves with his bed sheet? Maybe. Cheese is very binding, as my Grandmother used to say.
Hilarious puns aside, we suspect these correlations because we can’t think of plausible (efficient) causal connections. We have no proof of lack of causality, not in the “formal data” anyway. But that is because probability is far more than its formal quantification.
Statistics is shockingly limited. It never, or at least not natively, asks about causation. Instead, it asks about correlations. “Given this what is the probability of that,” is what it is good at, not “What caused that?”
This would not be problematic except that everybody, unless they’re forewarned, mistakes statistical correlations for causality—and even when forewarned the error is made. That stroke paper says the sun’s rays are causing apoplexy. How? Who knows? But the sun’s rays are a form of radiation and radiation, as our culture affirms every chance it gets, is bad. Strokes are bad, too. Therefore, the sun might cause strokes.
And, hey, maybe it does. The statistics say it might; there are even wee p-values. But statistics also suggests Nicholas Cage causes suicides.
The key difference between the stroke paper and cheese-bed sheet connection is that the authors of the former work took care to build a plausible causal story, while Vigen’s site offers none, and even asks you to consider there could be none. The formal quantitative result is the same in both cases.
This is where statistical practice becomes schizophrenic. Everybody knows that there is more to the evidence that than which is formally quantified. But if the formally quantified evidence is pleasing (wee p-values, etc.) it is taken as proof of the speculated causation, as if, that is, it were the complete evidence. Read the discussion section of any paper which relies on statistics to see this, particularly in the so-called soft sciences or those claiming the horrors which await us once global warming finally strikes (soon, soon).
The opposite also holds. Consider that if we knew, really knew, the causal process by which solar rays caused stroke, it wouldn’t matter what the statistical evidence said. Non wee p-value? Well, that could be a faulty observation, the wrong population, something.
Part of the problem is the intense drive to quantify and to leave everything non-quantifiable behind. You can’t stick ideas into formulas. Another part is the mysticism which accompanies classical statistical measures. Wee p-values are magic.
So what to do? Ah, that’s the hard part. One quick example. If we want to map the uncertainty of the flight of a bullet from a Smith & Wesson six-shooter, we ask a physicist about the equations of motion related to ballistics. Because why? Because those equations quantify our understanding of the causality. So much force in such and such a way results in the bullet being caused to land over there. That’s a pure causal model.
Except it won’t work in practice, not perfectly. Because that causal model won’t nail the precision of the landing past some point. We may be able to say, via the causal model, that the bullet will land somewhere on some target, but that’s it. To say more, we can add a probability model to the causal model which gives us probabilities the bullet lands in specific locations on the target.
We do that because we don’t understand all the forces acting on the flight of the bullet. Those parts which we don’t understand are “random”, i.e. unknown, those parts which we do know are the (gross) causes and are modeled accordingly.
In other words, the best models are mixtures of physics and probability. About these, more later.
1We might look at this paper, which was discovered by K.A. Rodgers, in depth later.