Susan Holmes has done us a service by writing clearly the philosophy of the p-value in her new paper “Statistical Proof? The Problem of Irreproducibility” in the Bulletin of the American Mathematical Society (Volume 55, Number 1, January 2018, Pages 31–55).
The thesis of her paper, about which I am in the fullest possible support, is this: “Data currently generated in the fields of ecology, medicine, climatology, and neuroscience often contain tens of thousands of measured variables. If special care is not taken, the complexity associated with statistical analysis of such data can lead to publication of results that prove to be irreproducible.”
About how to fix the problem we disagree. I say it won’t be any kind of p-value, or p-value-like creation.
Here from the opening are the clear words:
Statisticians are willing to pay “some chance of error to extract knowledge” (J. W. Tukey ) using induction as follows.
If, given (A => B), then the existence of a small ε such that P(B) < ε tells us that A is probably not true.
This translates into an inference which suggests that if we observe data X, which is very unlikely if A is true (written P(X|A) < ε), then A is not plausible. [A footnote to this sentence is pasted next.]
We do not say here that the probability of A is low; as we will see in a standard frequentist setting, either A is true or not and fixed events do not have probabilities. In the Bayesian setting we would be able to state a probability for A.
I agree with her definition of the p-value. In notation, the words (of the third paragraph) translate to this:
(1) Pr(A|X & Pr(X|A) = small) = small.
The argument behind this equation is fallacious. To see why, first convince yourself the notation is correct.
I also agree—with a loud yes!—that under the theory of frequentism “fixed events do not have probabilities.”
But in reality, of course they do. Every frequentist acts as if they do when they say things like “A is not plausible”. Not plausible is a synonym for not likely, which is a synonym for of low probability. In other words, every time a frequentist uses a p-value, he makes a probability judgement, which is forbidden by the theory he claims to hold.
Limiting relative frequency, as we have discussed many times, and often in Uncertainty, is an incorrect theory of probability. But let that pass. Believe it if you like; say that singular events like A cannot have probabilities (which does follow from the theory), and then give A a (non-quantified) probability after all. Let’s pretend we do not see the inconsistency.
Let’s instead examine (1). It helps to have an example. Let A be the theory “There is a six-sided object that when activated must show one of the six sides, just one of which is labeled 6.” And, for fun, let X = “6 6s in a row.” Then Pr(X|A) = small, where “small” is much weer than the magic number (about 2×10^-5). So we want to calculate
(1) Pr(A|6 6s on six-sided device & Pr(6 6s|A) = 2×10^-5) = ?
Well, it should be obvious there is no (direct) answer to (1). Unless we magnify some implicit premises, or add new ones entirely.
The right-hand-side (the givens) tell us that if accept A as true, then 6 6s are a possibility; and so when we see 6 6s, if anything, it is evidence in favor of A’s truth. After all, something A said could happen did happen!
Another implicit premise might be that in noticing we just rolled 6 6s in a row, there were other possibilities. We also notice we can’t identify the precise causes of the 6s showing, but understand the causes are related to standard physics. These implicit premises can be used to infer A.
We now come to the classic objection, which is that no alternative to A is given. A is the only thing going. Unless we add new implicit premises that give us a hint about something beside A. Whatever this premise is, it cannot be “Either A is true or something else is”, because that is a tautology, and in logic adding a tautology to the premises is like multiplying an equation by 1. It changes nothing.
Not only that, if you told a frequentist that you were rejecting A because you just saw 6 6s in the row, and that therefore “another number is due”, he’d probably accuse you of falling prey to the gambler’s fallacy. Again, we cannot expect consistency in any limiting relative frequency argument.
But what’s this about the gambler’s fallacy? That can only be judged were we to add more information to the right hand side of (1). This is the key. Everything we are using as evidence for or against A goes on the right hand side of (1). Even if it is not written, it is there. This is often forgotten in the rush to make everything mathematical.
In our case, to have any evidence of the gambler’s fallacy would entail adding evidence to the RHS of (1) that is similar to, “We’re in a casino, where I’m sure they’re real careful about the dice, replacing worn and even ‘lucky’ ones, and they way they make you throw the dice make it next to impossible to control the outcome”. That’s only a small summary of a large thought. All evidence that points to A.
But what if we’re over on 34th street at Tannen’s Magic Store and we’ve just seen the 6 6s, or even 20 6s, or however many you like? The RHS of (1), for you in that situation, changes dramatically, adding possibilities other than A.
In short, it is not the observations alone in (1) that get you anywhere. It is the extra information you add that works the magic, as it were. And whatever you add to (1), (1) is no longer (1), but something else. If you understand that, you understand all. P-values are a dead end.
Bonus argument This similar argument I wrote appears in many places, including in a new paper about which more another day:
Fisher said: “Belief in null hypothesis as an accurate representation of the population sampled is confronted by a logical disjunction: Either the null is false, or the p-value has attained by chance an exceptionally low value.” Something like this is repeated in every elementary textbook.
Yet Fisher’s “logical disjunction” is evidently not one, since his either-or describes different propositions, i.e. the null and p-values. A real disjunction can however be found. Re-writing Fisher gives: Either the null is false and we see a small p-value, or the null is true and we see a small p-value. Or just: Either the null is true or it is false and we see a small p-value. Since “Either the null is true or it is false” is a tautology, and is therefore necessarily true no matter what, and because prefixing any argument with a tautology does not change that argument’s logical status, we are left with, “We see a small p-value.” The p-value thus casts no light on the truth or falsity of the null. Everybody knows this, but this is the formal proof of it.
Frequentist theory claims, assuming the truth of the null, we can equally likely see any p-value whatsoever, i.e. the p-value under the null is uniformly distributed. To emphasize: assuming the truth of the null, we deduce we can see any p-value between 0 and 1. And since we always do see any value, all p-values are logically evidence for the null and not against it. Yet practice insists small p-value are evidence the null is (likely) false. That is because people argue: For most small p-values I have seen in the past, I believe the null has been false; I now see a new small p-value, therefore the null hypothesis in this new problem is likely false. That argument works, but it has no place in frequentist theory (which anyway has innumerable other difficulties). It is the Bayesian-like interpretation.
The decisions made using p-values are thus an “act of will”, as Neyman criticized, not realizing his own method of not-rejecting and rejecting nulls had the same flaw.