I made this picture for my talk at the first ever public Broken Science event (videos coming soon):

Everybody has heard the saying “correlation doesn’t imply causation.” Taken loosely, it seems wrong, because when there is causation there is also correlation. And scientists sometimes investigate problems in which they suspect causation is present.

The difficulty is with *imply*. In the loose, vulgar sense, it means something like *goes with*. So we’d have “correlation goes with causation”, which is true, more or less.

The original slogan is meant to be logical, where there is a strict definition of *imply*. As in *logically follows*. This means another way of saying the slogan is “causation does not follow from correlation.” Or “correlation could be a coincidence and not causation.”

There are also difficulties with *coincidence*. Here I mean it in the sense of Diaconis and Mosteller, which are events which seem surprising and unusually close together in essential aspects, yet have no causal link, with a stress on *no causal link.* (We also unfortunately sometimes use *coincidence* to mean there *is* a causal link, and that it is suspicious.)

You have the idea.

In any case, every single time, with *no* exceptions, no, not even if you are the greatest scientist in the world, when we say “Here is a correlation, thus it is causation”, it is a fallacy. A strict logical fallacy. Every time. Each time. No wiggling out. Even if your grant is large.

From this it does not follow that causation is never present when correlation is. It may be. The more wisely you choose your correlations the more likely causation is there, somewhere, maybe.

But it remains a simple logical inescapable fact that moving from correlation to causation without understanding the causation, without putting the premises of the causation into your argument, is a fallacy.

Every use of a p-value is a fallacy.

The p-value says, “The *null* that a coincidence happened is true, and here is the probability of something that happened; therefore, my correlation is causation.” This is false, a fallacy.

Again, every use of a p-value is a fallacy. *Every* as in *every*. Each one. Every time. No exceptions. All p-values are logical fallacies. This is so no matter what. It is so even if you say, “Yes, p-values have problems, but there are some good uses for them.” Whether or not that is true, and I say it is not, each time a p-value is used a logical fallacy has been invoked.

If, that is, a causation has been imagined, hinted at, teased, whispered, or outright declared when the p is wee.

Which always happens. That, after all, is why the p-value is used. To “reject” the hypothesis of correlation—which is to say, of coincidence. To reject the “null” is to say “The correlation I have observed is not a coincidence. Something is going on here.” And that something is causation.

Perhaps not *direct* causation, as in the parameter associated with the wee p is not thought in itself to be the cause, but is itself caused by something else, perhaps not measured. Still, this is causation. So this use of the p-value is a strict logical fallacy, too.

Coincidences abound. There are, if not an infinite number of interesting correlations, then there are so many that it is near enough infinity. Which means there is an infinity, or near enough, of data sets that will evince wee p-values, but where no causality is present. And if we merely dress up the data sets in some pretty science language, we can make our coincidences sound like science.

Which is what happens.

I’ve used this example countless times, but it hasn’t worn out its usefulness yet. The site Spurious Correlations—*spurious* indicating lack of cause—has a graph of US spending on science, space and technology with Suicides by hanging, strangulation and suffocation. Nearly perfect correlation, a cute coincidence. Spurious.

But that correlation would give a wee p-value. Which I use to reject the rejection of the “null” hypothesis, because why? Because it’s obvious there is no causal connection between the two.

Which judgement—our premise of supposing there is no causal connections—proves we often bring in information outside the p-value paradigm (sounds like a 80s cold war spy novel). So why not bring it *every time*?

Indeed, why not. That’s kinda sorta what Bayesians do, except they obsess over unobservable parameters to which they also assign bizarre causal powers, which is a story for another time.

*Subscribe or donate to support this site and its wholly independent host using credit card click here*. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.

Categories: Statistics

This is excellent and I wonder how I’ve missed this simple explanation for what is wrong with p-values in the years I’ve been reading this blog. It’s possible I’ve just been blind. It certainly hasn’t escaped my notice that Briggs denounces p-values at every opportunity — but I have somehow missed this concise explanation of why they are a fallacy: because they skip any attempt at trying to understand cause and just wave a wand saying, given a magic number, we can assume cause when we actually can’t. Thank you, Briggs.

Thanks, this makes clear to me the wee-p problem. I’ve read your blog for a while now and bought “Uncertainty” (Springer, full retail) and I never quite grokked your problem with the use of p-values. I believe I see clearly now. These people use it thinking they’ve found the needle, when all they’ve found is a haystack.

I’m looking forward to videos of “Broken Science” conference.

[…] Because it’s obvious there is no causal connection between the two.[…]

‘Obvious’: Not clear whether it is distinguishable from prejudice.

One thing I’ve said for a long time is “The only use of statistics is to prove what we already know to be true.”

What I mean is that people will agree that statistical methods do not imply causation in a great many situations. But they do not attribute this to a lack in the method, but rather in using statistics where there is “obviously” no causation.

For example, you can show them data that has an extremely high correlation coefficient for spurious calculations like those that you mention, like the rate of UFO sightings being correlated with deaths by drowning in home swimming pools or whatever. People will see those and say the same thing that you have said here: it’s obvious that there is no connection between the two things, thus there is nothing to worry about. But they never follow the conclusion through fully (unlike you), i.e. that if things can be correlated but unrelated in THIS incident, then the same could happen in ANY incident. Oh sure they will give lip service to the idea, but the second they have a high correlation between two things that they like and the causation language will be everywhere.

The same is true for p-values, except worse. This is because most people who use p-values have absolutely no idea how they are defined, mathematically. They will give you the “it’s the chance that things are due to chance” definition, which is of course nonsense and such a probability would be impossible to calculate. You can show them data which has a wee p but obviously doesn’t have any specific cause, and they will say that there is no problem since anyone can see that there is no causation there. But they expect some magic to mean that in the cases where they want to use p-values that they will actually work to show causation.

So the rule effectively becomes: You can use whatever crazy statistical method you want, but only to prove things that are actually true. If statistical methods “prove” something which is false, then that is just a coincidence, but if the conclusion is “known to be true” then the methods are beyond doubt. Thus statistics can only be used to prove things that we already know.

(Of course I am referring to how statistics is used in the wild, not to the actual theory of statistics.)

I’m finally beginning to get it. Thanks, Briggo!

I’m trying to decide if a certain pair of dice are loaded. I throw them ten thousand times. I test the results, and the high p-value for each possible result, from snake-eyes to double-boxcars, leads me to conclude they are not loaded.

Please explain why my logic is incorrect, and describe what alternative measure you would use to determine if the dice are loaded.

Thanks as always for your most fascinating blog,

w.

Willis,

How does it lead you to conclude that? Specifically, I mean.

And, of course, what flaws do you find in the argument above? Specifically, I mean.

Playing devil’s advocate: I do not see, in general, that causation is the principle underlying P-value measurement.

Take the commonly observed case of the comparison of two groups of data: pharmaceutical treatment vs no treatment.

In it’s basic form the purpose of P-Values is simply to attempt to quantify the mathematical statistical significance between two sets of data using established statistical methods without respect to causation. It is simply an exploratory usage of an experiment, I would assert. I believe that most statisticians would agree that the comparison has no relationship to causation, or repeatability, or reliability. At least this would be the stance of the American Statistical Association. Additional investigation would be required to establish causation or effectiveness.

The mistake is made when commercial interests assert or claim more to the analysis can be justified: ie that the finding is a measure of effectiveness. That is when it becomes problematic.

Actually it is not a mistake, it is a deliberate manipulation or scam that works well for their bottom line and this is the reason we are not going to see Wee-P’s replaced by big-Pharma and big-Phinance any time soon.

“Correlated” is one of those annoyingly overloaded words. The informal casual usage gets conflated with various measures of the same name defined by equations. For example, consider a certain string of ciphertext that is “caused” by a certain string of plaintext (plus an encryption algorithm, of course). Casual usage would assert the ciphertext is certainly correlated to the plaintext, but if a mathematician ran the two sequences through a correlation equation, they would assert the sequences are uncorrelated.

Similar situations can arise with confounding variables – I’ve seen many engineers, including myself, waste a lot of time chasing correlation ghosts around in circles. One classic tail-chase follows this pattern: observe a problem, change x -> problem still present, change y -> problem still present -> change z -> problem gone! -> revert y -> problem still gone, revert x -> problem still gone -> go to lunch, come back -> problem is back. One engineer I worked with approached these types of problems more successfully than the rest of us by using some sort of statistical techniques I didn’t understand, despite his attempts to explain it to me; it seemed like he was spending too much time collecting data, but it paid off in the end, without the hair-pulling. Every old engineer has stories about the mysteries they struggled with but never solved; sometimes you end up having to whack the problem with a sledgehammer and move on. I like to think that all my personal engineering mysteries will be made clear to me some day in Engineer Heaven, assuming they let me in.

A hundred years ago, a person eats an edible wild mushroom, has no immediate symptoms, but gets sick and mysteriously dies a week later. If there is a correlation, it isn’t obvious to anyone. This continues to happen occasionally. Eventually someone gets curious and looks into it, takes educated guesses as to the cause of these deaths, and checks for correlations. The curious person eventually, after many guesses that didn’t yield a firm correlation, notices that a common local edible mushroom has an uncommon lookalike, and starts digging through the trash bins of people who have died in a similar manner, finds a strong correlation with the presence of remnants of that uncommon lookalike, and warns everyone. Many people dismiss this – they have been eating this wild mushroom for years and no one is going to tell them not to, dammit. Deaths continue, albeit with decreasing frequency. As time passes, various graduate students study the deaths using the latest equipment, isolate possible chemical agents, study various organ tissues from the deceased, and run experiments that can cause similar organ damage. As technology advances, graduate students run simulations that show a possible chain of chemical reactions that could provide more detail. Even today, there is still much poorly-understood detail left for graduate students to study, and they continue to develop theories and devise experiments.

It seems to me that many things we now believe to be causal went through a similar process, where correlation slowly became accepted as causation. If you run off the road in your car, hit a tree and immediately die, we accept cause as obvious, we don’t entertain the possibility of any sort of coincidence. But when we are looking for cause for anything non-obvious, at what point does correlation become strong enough to call it causation?

While that spurious correlations site is certainly funny, I tire of it pretty quickly. By my thinking, anything described by a sequence does one of three things short-term: increases, decreases, or stays the same. Same thing applies for a second sequence. So right off the bat there is a good chance that two unrelated sequences will have a spurious correlation if they are short enough. It would be interesting to detrend the sequences in those spurious correlation graphs and see if the humor suffers or improves (which I haven’t done, beyond quick visualization attempts).

“In it’s basic form the purpose of P-Values is simply to attempt to quantify the mathematical statistical significance between two sets of data using established statistical methods without respect to causation.”

My heart soars to learn that the American Statistical Association would never allow itself (pardon me, according to the commenter, the correct usage is “it’sself”) would never allow it’sself to be sullied with language such as “rejecting the null” in any discussion of p. Despite the fact that this language appears in many statistics texts written by eminent members of the American Statistical Association, and despite the fact that this phrase harkens back to the very beginnings of p.

We shall ignore the high-sounding gibberish purporting to show the REAL purpose of p, which is… what? To do a mere and perfectly pure calculation using “established statistical methods”? What for? To employ statisticians?

Unfortunately, as a professional PhD statistician has just pointed out, the pure as the driven snow p calculation only exists within a context, a context in which a thing called the “null” exists, and this null in fact has, has always had, and will always have exactly the meaning he said: it creates a logical fallacy in its (excuse me again) in it’s, very being.

The p is a calculation that is founded on a logical fallacy. That is the point! NO calculation founded on a logical fallacy is worth anything, ever. However evil Pharma may be, it (sorry: i’t) is not responsible for a calculation that has from inception been founded on a logical fallacy.

Think of p as dividing by zero. It sounds like you should be able to do it, but only if you don’t understand the underlying system, in which the calculation simply cannot make any sense.

Matt is trying, mostly failing, but trying, to show others that calculations founded on logical flaws, or in this case, an out and out fallacy, don’t make sense and should never be done, let alone used for any purpose, nefarious or heavenly.

An L-shaped ruler? A sale? LOL

Eagerly awaiting your response.

I wonder if you are familiar with “Tactics of Scientific Research” (1960) by Murray P. Sidman. While the title is general, it refers specifically to research on behavior, not to more broad scientific methodologies. But it promotes no inferential statistics and is incredibly hard-nosed in how it insists that any claim of causality must be made. I would be interested to know what you make of it.

The entire field of probability and statistics is just another branch of mathematics.

Like all mathematical techniques, it’s (aha!) necessary to understand the assumptions and limitations.

P-values, regression, goodness-of-fit, anova, distribution, simulation, etc. At face value none or these techniques can be said to be evidence of causation; which is something else altogether.

But they are still useful exploratory techniques. The issue is not the mathematical technique, but the skill and knowledge of those who use it.

In the end, the proof is in the repeatability and predictive capability that emerges.

There are many, many relationships in engineering codes that are theoretically flawed but work just fine in practice, because they have been found to be reliable.

Robin,

There is a big problem with saying:

In it’s basic form the purpose of P-Values is simply to attempt to quantify the mathematical statistical significance between two sets of data using established statistical methods without respect to causation.Here it is: what do you mean by significance? If you mean something related to causation, we run into the same problems as usual. If you say that significance means having a wee p-value, then we have a tautological measure: the p-value measures what the p-value measures. So what’s the point?