Making P-values Weer To Achieve Significance Won’t Help

Making P-values Weer To Achieve Significance Won’t Help

Mini-paper out in JAMA by Matt Vassar and pals: “Evaluation of Lowering the P Value Threshold for Statistical Significance From .05 to .005 in Previously Published Randomized Clinical Trials in Major Medical Journals”. Thanks to Steve Milloy for the tip.

Authors scanned JAMA, Lancet, and NEJM for wee ps, and then asked how many study’s p’s survived being wee after dividing the magic number by 10. Seventy-percent was their answer. Meaning 30% of official findings would have to be tossed for not achieving super significance.

Somewhat amusingly, and unnecessarily, they computed regression models on the results and reported 95%—and not 99.5%—confidence intervals.

Never mind. Making ps weer does not solve any of the logical and philosophical difficulties of p-values, as in part is as partly explained in this peer-reviewed (and therefore perfectly true and indisputable) paper: Manipulating the Alpha Level Cannot Cure Significance Testing.

As a bonus, here is just one of a dozen or two criticisms of p-values that will appear in a new peer-reviewed (and therefore true and indisputable) paper in January. This is not the strongest criticism, nor even in the top five. But it alone is enough to quash their use.

(I’m leaving it in LaTeX format so you can get a hint about the citations.)


P-values are Not Decisions

If the p-value is wee, a decision is made to reject the null hypothesis, and vice versa (ignoring the verbiage “fail to reject”). Yet the consequences of this decision are not quantified using the p-value. The decision to reject is just the same, and therefore just as consequential, for a p-value of 0.05 as one of 0.0005. Some have the habit of calling especially wee p-values as “highly significant”, and so forth, but this does not accord with frequentist theory, and is in fact forbidden by that theory because it seeks a way around the proscription of applying probability to hypotheses. The p-value, as frequentist theory admits, is not related in any way to the probability the null is true or false. Therefore the size of the p-value does not matter. Any level chosen as “significant” is, as proved above, an act of will.

A consequence of the frequentist idea that probability is ontic and that true models exist (at the limit) is the idea that the decision to reject or accept some hypothesis should be the same for all. Steve Goodman calls this idea “naive inductivism”, which is “a belief that all scientists seeing the same data should come to the same conclusions,” \cite{Goo2001}. That this is false should be obvious enough. Two men do not always make the same bets even when the probabilities are deduced from first principles, and are therefore true. We should not expect all to come to agreement on believing a hypothesis based on tests concocted from {\it ad hoc} models. This is true, and even stronger, in a predictive sense, where conditionality is insisted upon.

Two (or more) people can come to completely different predictions, and therefore difference decisions, even when using the same data. Incorporating decision in the face of uncertainty implied by models is only partly understood. New efforts along these lines using quantum probability calculus, especially in economic decisions, are bound to pay off, see e.g. \cite{NguSri2019}.

A striking and in-depth example of how using the same model and same data can lead people to {\it opposite} beliefs and decisions is given by Jaynes in his chapter “Queer uses for probability theory”, \cite{Jay2003}.


  1. Stan Young

    I took the time to carefully count out ~50 papers. The median number of questions/models was on the order of 10,000. Moving the cut-off from 0.05 to 0.005 does not address the issue of multiple testing and multiple modeling.

    One very simple step brings p-values up to a reasonable replication standard. Adjust the p-value to reflect the number of questions at issue. If you have 4 p-values, then .05/4 is your cut-off value. Resampling can be used to take correlations into account. All this and more is standard technology as given in Westfall and Young, 1993, and part of SAS since about that time. The real problem is that some, authors and editors, desire a published paper more than making a claim that has a good chance of replicating.

    There is a lot of good statistical technology; it is very often the intention of the user that matters.

  2. Repeat after me, class: Correlation is not causation. No matter how wee the P.

  3. Ray

    I believe that most people have not taken a statistics course that correctly defined the confidence interval. People believe there is a 95% probability that the parameter lies between the upper and lower limits, but this is false. It is not true that confidence intervals can be read as a measure of certainty that the interval contains the true value.

Leave a Reply

Your email address will not be published. Required fields are marked *