# Spurious Correlations Proves Hypothesis Testing Should Be Abandoned

This week traditionally is a slow week on the blog, so let me have a go at explaining something I’ve explained a few hundred times before, a thing which has not yet stuck. Maybe the pace of the day will help us.

I enjoy collecting statistics (I use that word in its old-fashioned sense) like this:

Well, you tell me: do sunspots cause war? Would a “null hypothesis” significance test produce a wee p, thus “proving” the cause? Or not proving the cause, because we all know correlation isn’t causation, so proving the correlation instead. Which didn’t need proving, because the correlation is there, and is proof of itself. A large p doesn’t make what is there disappear.

So the wee p doesn’t prove causation and it doesn’t prove correlation. What does it prove?

Nothing.

As I have quoted more times than I can recall, De Finetti in 1974 shouted “PROBABILITY DOES NOT EXIST.” Most of his audience didn’t understand him, and most still don’t.

If probability doesn’t exist, any procedure or mechanism or measure that purports to tell you about probability as if it is a thing that exists must therefore be wrong. Right?

P-values are measures, hypothesis tests are procedures, and some obscure machines (used to measure statistics) all purport to tell you about probability as if it is a thing that exists. Since probability does not exist, they are all therefore wrong.

Right?

That’s the true, valid, and sound proof against p-values. But it doesn’t stick. Many still have the vague belief that, yes, probability does exist, therefore there are “right ways” or “good ways” to use p-values. This is false. So let’s use a deeper proof. Here is where you have to pay attention.

We’ve seen the site Spurious Correlations many times. Here’s one of his plots, “US spending on science, space, and technology” with “Suicides by hanging, strangulation and suffocation.”

It’s clear that this correlation would give a wee p, or that it’s easily extended so that it will. But we all recognize it’s a silly correlation.

Why?

Because we bring in outside information that says there is no way one of these things would cause the other. Simple!

Problem is, there is no way to fit in “outside” information in frequentist theory. In old-fashioned, real old-school, non-frequentist, non-subjective probability, probability as logic, we gather all information we believe is relevant—and even more importantly, we ignore all evidence we believe is irrelevant—and then use it to judge the probability of some proposition. Here the proposition is that spending causes suicides, or suicides cause spending.

In notation: Pr(Y|X), where Y is the proposition and X all information we deem relevant—and the absence of all information we deem irrelevant.

What should surprise you is that there is no mechanism to do this in frequentism, which is the probability theory that treats probability as if it is a real thing.

In frequentism, the criterion for including or excluding evidence is the “significance” test. Stick with me here, because here’s the subtle part. There are always (as in always) an infinite number of premises or evidence that we can add to the “X”, in frequentism or in logical probability.

There is no guarantee, of course, that in any logical probability assessment, we have chosen the “right” X—and indeed the right X is that one that gives the full cause of Y (so that Pr(Y|right X) = 1). But we are free (and this is the semi-subjective nature of probability and decision) to pick what we like. When we like. Before taking data, after, whenever. And then asking what is Pr(Y|X)?

But there is no way to pick the “right” X for frequentism without subjecting every potential element of X to a significance test. Since there are an infinite number of X for any Y, and since hypothesis testing acknowledges spurious correlations can appear as legitimate (if the p is wee), then either every problem if carried out in strict accordance with frequentist theory will have an infinite number of spurious correlations, or the test will “saturate” at some point and the analysis has to end incomplete.

By “saturate” I mean the significance testing math breaks down with finite n: after adding too many premises, the testing can no longer be computed. I won’t here explain the math of this, but all statisticians will understand this point.

There thus can be no genuine frequentist result, not one entirely consistent with the theory, unless the theory is abandoned at some point. And that point—which always comes—is when the analyst acts like a logical probabilist and excludes certain X.

As it is, the “frequentist” (the scare quotes indicate there is no true or pure such creature, but only approximations) will conduct “tests” on such X he picks, and ignore the tests for X he excludes. All “frequentists” are only partial frequentists. And even more partial than you think.

Here, with spending and suicides, even with a wee p, most “frequentists” would again reject the rejection of the “null”—that is, toss out the wee p—because other X are brought in from which it deduced the correlation is silly.

The “frequentist” will counter they only have to test the X they believe are relevant. But then they have to say how the winnowed down the infinite list to choose the “relevants”. How did they do this? It can’t be by testing. So there must exist procedures that are not testing that allow picking Xs and excluding Xs.

Read that over and over until it sinks in.

So why not use these other procedures all the time? Or you have to show us exactly precisely what is this strange inherent measure or procedure that says when to use testing, when to not to.

These questions are never answered. At most, we will hear “Well, there are still some good uses for p-values.” To which I answer: no there aren’t, and we just proved it.

Buy my new book and learn to argue against the regime: Everything You Believe Is Wrong.

Subscribe or donate to support this site and its wholly independent host using credit card or PayPal click here; Or go to PayPal directly. For Zelle, use my email.

1. Vincent Capuano

A similar point is made by Henry Veatch in Two Logics (1969) while “relating logic” is mathematical and precise it can’t say what things are. “What Logic” is less precise but can’t be replaced by “relating logic” .

2. Robin

SSgt Briggs: Superb essay. Again.

Is it safe to now declare that Reverend Bayes was correct after all?

3. Hagfish Bagpipe

Briggs: “As I have quoted more times than I can recall, De Finetti in 1974 shouted “PROBABILITY DOES NOT EXIST.”

Yeah, I remember that — De Finetti would wander about the streets of the West Village wearing a bathrobe, barefoot, shouting, “PROBABILITY DOES NOT EXIST.” at passersby. I found it charming.

4. I’m convinced!

Frequentist statistics, as taught in social science grad school, appeared, to me, confusing, but logical once I accepted the premises, or drank the kool-aid. But there was always a nagging feeling of incompleteness, or a gut feeling that the whole thing was a semi-fraud.

Can the frequentist testing approach be salvaged by applying logical selections of hypothesized cause/effect? That is, insert a prerequisite step before the correlation testing that is a logic-based, non-math explanation of why we’ve chosen an X.

Ex: We chose “eating ice cream” as a potential cause of “gaining weight” because ice cream is high in fat and extra calories. Whereas as we did not choose “walking by the corner of 5th and Main” as a potential cause of gaining weight because there is no known logical reason the act of walking by a certain corner would cause weight gain. (Even though the act of “walking by…” could be shown to correlate to weigh gain–because the ice cream store is at that corner, and each time you eat ice cream you walk by that corner.)

Would that extra step of an explicit rationale for selecting Xs tighten up the frequentist testing approach?

5. Douglas W Skinner

I would introduce one quibble. I would say that probability DOES EXIST because the results we calculate we call probabilities exist (we have them in our hand) only that it is NOT A PROPERTY. I say this because, coming from the physics community, there is the strongly rooted notion that probability is the “basis” for quantum mechanics and statistical mechanics. So, for example, it is believed that the half-life radium, which is observable, is tied to a probability distribution which inheres to each radium atom.

Similarly with games of chance it is believed that probability arises from the explicit listing of possibilities. The problem here is that for results to be obtained the atomic probabilities have to be assumed or specified. As these relate to actual physical objects, there is clearly an infinite number of possibilities. A coin may not be “fair” in an infinite number of ways. And when we talk about a “fair” coin we really mean to what extent does an actual coin approximate our conception of fairness (i.e., p{heads} = p{tail} = 1/2).

Still, I am not quite comfortable with your idea that the use of p-values and hypothesis tests should be abandoned. After all, a lot of useful work has been done with them and I’m pretty sure in the best applications no conclusion that established causation from correlation would be overturned. As I see it being a “partial frequentist” is not a bad thing as long as you’re a good one.

6. Johnno

Douglas, I believe the main thrust of the matter is that what is “probable” doesn’t depend on the charts.

It begins in what the plotter chooses to include or exclude a priori. Which, in the end is philosophical or entirely ad hoc. Therefore, it doesn’t exist in the scientistic sense.

You have already decided that a probability exists where it might never, thus assume correlations, then assume caustions. It is 3-step idiocy. Quite different than rolling a dice and knowing all six sides that may show up is a contained and controlled environment where you have already set the laws and rules of the board game.

You may as well plot a correlation between the economy and how many people play Monopoly accounting for what fraction of people together in an average game get to be the banker and the liklihood of duplicate CHANCE cards.

P values work best in controlled places where you have eliminated what is probable to what you actually know is direct causation – rolling the dice limited to six sides moves your piece forward that equal number of places on the board.

Every step beyond that, we are guessing. And we guess what is probable and we are probably wrong.

7. Robin

American Statistical Society, March 2016:

1. P-values can indicate how incompatible the data are with a specified statistical model.

2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.

4. Proper inference requires full reporting and transparency.

5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. ”

https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf

In my view it’s about repeatability and predicatbility. And, let me see the data please.

8. Rudolph Harrier

In one of my stats classes the teacher, following the book, included a cautionary tale about a spurious correlation (I think it was to the number of UFO sightings and the price of some commodity.) He said something like “even though there is a low p-value in this case, we can clearly see that there is no real relation and should dismiss the relationship. Probabilities are not absolute.”

And that was the last he said on it. When discussing other p-value testing applications he never said something like “Now the reason we know that these results are not spurious is…” Everything was said with very certain language, when the p’s were low enough. And there were some examples where the relationship was far from clear (ex. stuff like number of telephone lines in a county vs. number of cancer cases.)

Eventually I realized what the unspoken rule was: Correlations are not spurious when we “know” the result is true, and they are when we “know” the result is false. That is, the trick to using hypothesis testing is to know what the correct result is before you start and then to interpret the data to support that result. Even if you get a high p-value for something you “know” is true (this happened in an example with secondhand smoke exposure and lung cancer) you simply say (as the teacher did) “the lack of support of a relationship in the data does not mean that we should abandon our hypothesis, instead we should get more data with more care to eliminate any source of error.”

Nearly every single person who uses statistics uses them this way (both the professional and the layman) though very few admit it.

9. Robin

I love reading these articles. Here is a take on probability from ET Jaynes’ paper of 1988. At it’s core of it is his idea of the Mind Projection Fallacy:

“It is very difficult to get this point across to those who think that in doing probability calculations their equations are describing the real world. But that is claiming something that one could never know to be true; we call it the Mind Projection Fallacy. The analogy is to a movie projector, whereby things that exist only as marks on a tiny strip of film appear to be real objects moving across a large screen. Similarly, we are all under an ego-driven temptation to project our private thoughts out onto the real world, by supposing that the creations of one’s own imagination are real properties of Nature, or that one’s own ignorance signifes some kind of indecision on the part of Nature.

The current literature of quantum theory is saturated with the Mind Projection Fallacy. Many of us were first told, as undergraduates, about Bose and Fermi statistics by an argument like this:

“You and I cannot distinguish between the particles; therefore the particles behave differently than if we could.” Or the mysteries of the uncertainty principle were explained to us thus: “The momentum of the particle is unknown; therefore it has a high kinetic energy.” A standard of logic that would be considered a psychiatric disorder in other fields, is the accepted norm in quantum theory. But this is really a form of arrogance, as if one were claiming to control Nature by psychokinesis.”

From “Clearing Up Mysteries, the Original Goal”, Jaynes, 1988:

https://bayes.wustl.edu/etj/articles/cmystery.pdf