Thou Shalt Not Seek The Wee P

Thou Shalt Not Seek The Wee P

Because some readers may think the crusade against the wee p is mine alone, sinner that I am, or that I somehow represent an obscure shady suspect intellectually insufficient movement, I present to you quotations from three prominent individuals who recognize that time for p-ing is over.

I do not, of course, necessarily agree with the proposed solutions offered by these fine gentlemen. For that solution, which avoids all problems which can be avoided (many cannot), see Chapters 8-10 of Uncertainty: The Soul of Probability, Modeling & Statistics. See also the book page for details, and the on-going classes where examples are given. I recommend the predictive approach, which is pure probability from start to the finish where each user makes their own decisions.

Valentin Amrhein, Professor of Zoology, University of Basel, writing that Inferential Statistics is not Inferential:

Statistical significance and hypothesis testing are not really helpful when it comes to testing our hypotheses.

But I have increasingly come to believe that science was and is largely a story of success in spite of, and not because of, the use of this method. The method is called inferential statistics. Or more precisely, hypothesis testing.

The method I consider flawed and deleterious involves taking sample data, then applying some mathematical procedure, and taking the result of that procedure as showing whether or not a hypothesis about a larger population is correct…

In 2011, researchers at CERN worked on the so-called OPERA experiment and sent neutrinos through the Alps to be detected in central Italy. The neutrinos were found to be faster than light, even when the experiment was repeated. This was surprising, to say the least, and the p-value attached to the observation was smaller than the alpha level of p=0.0000003 that is required to announce a discovery in particle physics experiments involving collision data.

Although the researchers made clear that they were still searching for possible unknown systematic effects that might explain the finding, the news hit the media as: “Was Einstein wrong?”

A few months later, the researchers announced the explanation for the surprising measurements: a cable had not been fully screwed in during data collection.

Bad ps found in bad plumbing?

Frank Harrell, Statistician, Vanderbilt, A Litany of Problems With p-values:

In my opinion, null hypothesis testing and p-values have done significant harm to science. The purpose of this note is to catalog the many problems caused by p-values. As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.

The American Statistical Association has done a great service by issuing its Statement on Statistical Significance and P-values. Now it’s time to act. To create the needed motivation to change, we need to fully describe the depth of the problem….

A. Problems With Conditioning

p-values condition on what is unknown (the assertion of interest; [null hypothesis]) and do not condition on what is known (the data).

This conditioning does not respect the flow of time and information; p-values are backward probabilities.

I cut Harrell off at the p. He has many, many, many objections.

John P. A. Ioannidi, Physician, Stanford, The Proposal to Lower P Value Thresholds to .005:

P values and accompanying methods of statistical significance testing are creating challenges in biomedical science and other disciplines. The vast majority (96%) of articles that report P values in the abstract, full text, or both include some values of .05 or less. However, many of the claims that these reports highlight are likely false…

P values are misinterpreted, overtrusted, and misused. The language of the ASA statement enables the dissection of these 3 problems. Multiple misinterpretations of P values exist, but the most common one is that they represent the “probability that the studied hypothesis is true.”…Better-looking (smaller) P values alone do not guarantee full reporting and transparency. In fact, smaller P values may hint to selective reporting and nontransparency. The most common misuse of the P value is to make “scientific conclusions and business or policy decisions” based on “whether a P value passes a specific threshold” even though “a P value, or statistical significance, does not measure the size of an effect or the importance of a result,” and “by itself, a P value does not provide a good measure of evidence.”

It goes p-p-p-ing along like this for some length.

I have the solution (it’s not mine: it’s old). A glance is here in the JASA paper A Substitute for P-values, Uncertainty has all the proofs and philosophical arguments, and I’ll have more papers coming out soon with expansions and clarifications.


  1. Kip Hansen

    Briggs: John P. A. Ioannidis — not plural, just has an “s”.

  2. Douglas Skinner

    I looked up p-value in Casella and Berger’s “Statistical Inference”. They say “The p-value for the sample point x is the smallest value of alpha for which for which this point will lead to rejection of H0…The smaller the p-value, the stronger the sample evidence that H1 is true.” Are they right? I’ve been following this debate for awhile and I’m still confused as to what the fuss is all about.

  3. Kalif

    @ Douglas.

    No, they are not right. One could write a whole essay on what P values are NOT, but very little of what they ARE. P values are absolutely correct computationally, but utterly useless.

    To answer your particular question, imagine you conduct a single experiment and obtain a p value of, say, .037. One of the two things happened. Either you are correct and there is an effect, or you committed a type one error. There is no one to tell you which one happened for your particular experiment. Your finding does not mean that there is a 0.37 probability of anything.

    Only if you repeated the same study, under the same conditions, with the same no. of subjects and assuming the Ho is true (big and wrong assumption as there’s no such thing as true Ho) over and over again, until the end of time (in theory), in 37% (roughly) of those hypothetical cases, you would have obtained an even more extreme result than the one you did. All based on central limit theorem. But for your single study that obtained .037, all the above means nothing. You either got it right or not, regardless of how tiny your p value is.

    The only thing we should be looking at are basic descriptive stats and various effect sizes.

    I find it important to note that as p values go out of fashion, so should the power analysis. Dr. Ioannidis wrote about under-powered studies, but power analysis is only relevant in the NHST framework that is ultimately flawed.

  4. Douglas Skinner

    “Your finding does not mean that there is a 0.37 [sic] probability of anything.” I’m not sure what you mean here. BTW: I am not necessarily disagreeing with you. I came across this blog because I’ve been concerned about whether probability is a quality at all, of anything. I have also been inclined to think that descriptive statistics is the most solid part of the subject and that inference is very shaky at best. If that’s what you’re saying, I do agree. I also agree with your comments on power analysis. I think it begs more questions than it answers. For one thing we simply don’t know anything about alternative probability distributions–again, if that’s what you men. A lot of seductive graphs and some loose mathematics of a very general kind. So, I’m really interested your elaborating.

Leave a Reply

Your email address will not be published. Required fields are marked *