Some two and a half years ago I posted this article: “An Infinity of Null Hypotheses — Another Anti-P-Value Argument“. The title is unfortunate; or, rather, the subtitle is. It should have read “The Anti-P-Value Argument.”
It is the and not another because no other argument is needed.
Here it is in brief:
The P-value is used only sometimes and arbitrarily for deciding between probability models, whereas others times decisions are made using logical probability, but logical probability and P-value theory are incompatible: to be consistent, P-values should be used for every decision, but this is impossible; therefore, P-values should never be used.
Unless a probability model can be deduced from definite premises, then it must be itself uncertain, and if it is itself uncertain, the model should be decided by P-value, which never happens, and is indeed impossible, since the number of possible models is infinite.
Therefore, some other mechanism must be used to choose uncertain models. It cannot be P-values, but it can be probability.
By models deduced from definite premises, I mean models deduced from premises like these: There are n_1 white balls, n_2 yellow balls, and n_3 blue balls in this bag, and only one will be drawn out. We deduce the probability model: the chance a white ball s drawn is n_1/n, where n is the sum of n_1, n_2, and n_3.
There are no real disputes about choosing models of this kind, even if not everybody agrees on what the probabilities deduced in the model means. By choosing I mean people do not really sit down and seriously consider picking between the model we deduced and, say, some parameterized multinomial or large-sample normal approximation or whatever.
But there are disputes galore about models that cannot be deduced. Consider a model of weight loss between males and females who are made to adhere to some diet.
How shall that be modeled? We cannot deduce—a strict word!—a model, but it’s simple to propose one, such as a normal distribution model of the weight loss, with one parameter indicating the difference in the normal central parameter indicating sex.
But that is only one possible model. A gamma distribution model can be used instead. Or a T-distribution model, or on and on and on, and so forever.
There is a veritable endless sea of potential models. And it doesn’t make one whit of difference whether any one of the fits better than another, or one makes superior predictions according to some utility function, or whatever.
The point is rather that all these models, except for one, or a small handful, have been rejected by a mechanism that is not a P-value. Indeed, they have been rejected using logical probability.
In order to select a single model, or perhaps some subset, to be consistent with frequentist theory we need to test a series of “null hypotheses”, which are that certain parameters that select a model, or set the chance of it to be zero, for an infinite number of models.
We must do this because the model itself is uncertain. It is not deduced. It has uncertainty.
Yes, we can, and people do, all the time, plop down any old model, justified by custom or symmetry, or other considerations that are far removed from deduction, but that only means they do not take P-values seriously. That they only invoke them when they help their case.
For instance, inside the normal model mentioned above, on the parameter indicating sex. Which really means the P-value is used on just two models. In somewhat crude notation, we can write our ad hoc model M:
where each αi can only take the values 0 or 1. The “null” is that α1 = 1, and by fiat all the other models are disallowed. Usually, of course, the “null” is written that βsex=0 in only the second model. But that notation is tricky and misleading, because, as I hope is now clear, there are two models, and not just one.
(Skip this paragraph if you have to.) The “β0” and “β1” (and σs) are not the same, though they are usually taken to be in the way these models are written. I differ because I say any change in a model gives a different model. And consider the values chosen (“estimated”) for β0 and β1 will not be the same.
I hope now the context of the original post makes more sense. For that equation we just wrote can, and must, be expanded for an infinite number of other parameters that are left out without the benefit of hypothesis testing.
Notice that model choice is no problem at all in logical probability. Because probabilities are only formed on premises assumed. If we cannot deduce a model, we can with perfect consistency say “I am entertaining this limited class of models because of experience”, which sets a probability of zero on all unused models.
Since most models are not deduced, these probability judgments are local and not universal truths. And that is how we go about in daily life, forming (usually unquantified) probabilities, making judgments and decisions.
P-values are silly. Give them up.
Buy my new book and learn to argue against the regime: Everything You Believe Is Wrong.
Subscribe or donate to support this site and its wholly independent host using credit card click here. For Zelle, use my email: email@example.com, and please include yours so I know who to thank.
Thanks as always for a thought-provoking post.
If I grab up ten dice and throw them, and they all come up with the number “3”, should I conclude that the dice are heavily weighted? I say yes, because of the P-value.
Statistics was invented by gamblers for this very purpose. I think you’ve over-egging the pudding here. Your basic point is correct … but not for every single situation.
Best regards to you and yours,
“Your basic point is correct … but not for every single situation.”
His point is valid for 99.44% of cases in which probability is used.
You say yes because you change your model based on old evidence, that the new probabilities are not equal. Not the p-value.
See also New Paper! Everything Wrong With P-values Under One Roof.
Wow, interesting coincidence. I’m just taking a break from my Monte Carlo analysis. I’m using only one die (or one dice, or whatever). This analysis is pretty important, lives are in the balance, maybe I should do a quick check on the die? I wouldn’t want anyone to die because of my bad die. Here we go … 1 … 4 … 1 … 4 … uh-oh, is that a pattern emerging? Keep going … 2 … 1 … 3 … 5 … 6 … 2 … 3 …
Ok, that’s enough, no obvious pattern here, nothing off about that die, right?
Back to work.
“should I conclude that the dice are heavily weighted? I say yes, because of the P-value.”.
You should make a second toss.
I was looking through my experimental statistics text book from a grad course I attended in 1981. In the margin was a hand-written note about P values being unreliable and misleading and that they should be used with caution. We were told this by our lecturer.
Even the American Statistics Association put out a surprisingly strongly worded statement against the use of P Values in 2016. And Briggs has done outstanding work in proving just how worthless is the method.
But until we can provide an alternative then “wee Ps” will continue to be widely used by science, industry and finance.
I would argue that the pharmaceutical industry uses them precisely because they are flawed and so easily gamed. Gaming P-Values has become an art in pharmaceutical trials; they know it and they use it for profit. As a result it has become almost impossible to dislodge the practice.
I’ve also noticed, after having read and watched hundreds of COVID papers and seminars, that many (if not most) in the life sciences clearly lack numeracy and logic skills, so they seek simple formulaic models and P-Values suits this purpose well.
No one makes the determination in that situation because of the P-value. They make their reasoning on broader models and logical statements. For example the statement “when dice are rolled they usually come up with different values.”
You might object that this statement is the same as using the p-value. But which p-value am I using? Certainly no test statistic is mentioned, and for that matter no probability distribution is mentioned. Note that you never specified that we are looking at six sided dice. If we had four sided, eight sided, etc. dice then the “natural” probability distributions would be different. And for that matter we never said that the sides included the numbers 1 through 6; some games include other combinations of numbers.
But let’s suppose that we have six sided dice, that they have the numbers 1 through 6 on them and we accept that in the “fair” probability distribution each side has a 1 in 6 chance of coming up (and that the dice are independent and identically distributed; note that outside of gambling these assumptions are often hard to justify!) Even in that situation we STILL don’t have “the” p-value. The test statistic closest to my statement would be a boolean value which comes up 0 if the dice have the same value and 1 if they have different values. And indeed in this situation we do have a wee-p ((1/6)^10). But note that nothing in the definition of a p-value prevents my test statistic from setting 0 when the values are different and 1 when they are the same, in which case my p-value is 1.
Of course I could have made other choices. I could have had my p-value derived from the number of distinct values chosen (in this case leading to a wee-p). This value is a bit more “robust” than the boolean choice because it can distinguish situations where most of the dice are the same number, but one is not. But consider if I had six dice and they show 1, 2, 3, 4, 5, 6 in order. This is just as unlikely as rolling six 3’s, but with a test statistic that measures the distinct numbers the p-value would be 1.
Or I could have used the sum of all values on the dice, which would easily lead to wee-p’s when we roll all 1’s but not all 6’s (equivalently we could use the average value of the dice.) Or I could have chosen any other of infinitely many other possible test statistics, each leading to a different p-value, and in each case some p-values would be wee and some would not.
So there are two major defects in your reasoning. The first is that NO ONE thinks in terms of p-values until they are forced. That is they don’t think “p is low, therefore we reject” they think “this seems unlikely so the data may be fishy.” The second defect is that the latter type of reasoning cannot be turned into reasoning about “the” p-value because there is no such thing as “the” p-value.
And that gets back to the point of the original post. Since there is no unique p-value, you must choose which one to take when using your reasoning. But if your argument is that we make statistical choices via p-values (even if we do not know it), which p-value do we use to select which p-value to use? Of course no one ever uses a p-value to select a p-value. Instead we use logic and previous observations to select the “appropriate” p-value. But the same type of reasoning can be used to analyze the data itself without a p-value, so why use a p-value at all?
Sure, there are ?1 models. Almost all of them are bad.
But what about the work to identify the best models? We have BLUP, BLUE, MVUE, the Lehmann–Scheffé theorem, the Rao–Blackwell theorem, and the Gauss–Markov theorem.
Also, your examples are always low dimension. Ok, it’s a blog. However, guidance in low dimensions often leave people puzzling what to do in 4 and higher dimensions [one pro tip: avoid going higher than 3 if possible].
Also, in my work as an engineer and a statistician, I’ve never come across an “Urn problem.” How weird is that?
Silly self-defeating argument IMO since, of course, p-values are functions of a test statistic and a reasonable model, and the test statistic is often just what we actually observe or some function of it, like standardizing it. So arguing against p-values is like arguing against what we observe.
For example, if we assume a fair coin model, and for 100 flips observe 87 heads, and say we repeat this well-designed experiment two more times and observe 92, and then 89 heads, we’d conclude that the coin is not fair. You can use a p-value to make this conclusion or use the number of observed heads- same thing. Except in some situations say we’d want to be really certain before concluding the coin is unfair, so maybe we want the decision to be based on if the number of observed heads is over 95 instead of over 80. P-values standardize this basic process across any field.
P-values worked just fine when Briggs used them in the ESP stuff he looked at long ago.
“Also, in my work as an engineer and a statistician, I’ve never come across an “Urn problem.” How weird is that?”
You have, but just not phrased in that “here is an urn” manner. Tons of fundamental distributions can be conceived as ‘urn problems’.
“But the same type of reasoning can be used to analyze the data itself without a p-value, so why use a p-value at all?”
You can use any statistic you like. P-values caught on because they work well, are standardized, and easily compare what we actually observe to what we expect to observe under a model.
“You can use any statistic you like. P-values caught on because they work well, are standardized, and easily compare what we actually observe to what we expect to observe under a model.
This is dumb. Dumb, dumb, dumb, dumb. It is nothing more than mystical worship of p-values.
You concede to me that p-values depend on the test statistic. And thus, by varying test statistic we can get different p-values. If we allow any function it’s always possible to get whatever p-value we want, from 0-1. (And that’s without evening getting into possible changes of models, which also change the p-value!) Even if we constrain ourselves to “natural” test statistics given the things we are interested in, we are still likely to see a wide variety of possible p-values.
So if we have many possible statistics to use to generate our p-values, how do we make the determination of which one we use? You say any is fine. But then statistics become arbitrary ritual. You arbitrarily use a test statistics and get p <.05 and declare that you have found some cause (even though the logic used to set up p-values doesn't work like that.) I use my test statistic and find p = .58. But if you had arbitrarily used my statistic instead you would have said there was no cause.
The only way that we can resolve the situation is by looking at something beyond p-values. And in fact this is what everyone does when challenged with these sorts of examples. They say things like "well, we need to see which p-value leads to better predictions!" But then why not just see which models have better predictions in the first place, without using p-values? And so on with other styles of objections; they all point to some standard for how we know things independent of p-values which is better than p-values and which could be used without p-values.
You did hit on the real reason why p-values are popular: because they are “standardized.” That is, someone without much mathematical or philosophical knowledge can get a stats package to output a p-value and think that he has done some real reasoning. More to the point, he can get a paper published. It is much more difficult to see the accuracy of predictions, or to logically justify the assumptions used to form a model, etc. But no one wants to admit that he is only using p-values because they are easy, so their proponents instead attach a mystic value to them (much like they do to the term “random”) even though they can never clearly articulate what makes p-values in particular valuable (in contrast to other statistical methods.)
It’s all a conspiracy to deny that the world is “Abby Normal”
I only discovered this website yesterday, while looking for something unrelated on YaCy search. Considering content, no wonder even DuckDuckGo won’t link here. xD
Anyway, William, I have written a blog post that attacks the same target as you do, but from a different direction. You might find it interesting, it’s on https://becomingbelte.rs/blog/8/the-valueless-experiment Apologies for it not being as well-worded as your work. The basic argument, reworked for your purposes, is that likelyhood of experiment outcomes is determined by experimenters, and therefore experiments are themselves worthless. The only thing which has “worth” is the interaction of the experimenter – his/her idea of causality – and the Universe. Specifically, the difference in experiment outcomes, conditional on experimenters manipulation of causes, is the only thing that can give experiments worth.