P-Value Hacking Is Finally Being Noticed

Fig. 2 from the paper.
Fig. 2 from the paper.

Since I’m on the road, all typos today are free of charge.

Some reasonably good news to report. A peer-reviewed paper: “The fickle P value generates irreproducible results” by Lewis Halsey and three others in Nature: Methods. They begin with a warning well known to regular readers:

The reliability and reproducibility of science are under scrutiny. However, a major cause of this lack of repeatability is not being considered: the wide sample-to-sample variability in the P value…

[Jumping down quite a bit here.]

Many scientists who are not statisticians do not realize that the power of a test is equally relevant when considering statistically significant results, that is, when the null hypothesis appears to be untenable. This is because the statistical power of the test dramatically affects our capacity to interpret the P value and thus the test result. It may surprise many scientists to discover that interpreting a study result from its P value alone is spurious in all but the most highly powered designs. The reason for this is that unless statistical power is very high, the P value exhibits wide sample-to-sample variability and thus does not reliably indicate the strength of evidence against the null hypothesis.

It do, it do. A short way of saying this is small samples mislead. Small samples in the kind of studies interested in by most scientists, of course. Small is relative.

But, as I’ve said many, many, __________ (fill in that blank) times, p-values are used as ritual. If the p-value is less than the magic number, SIGNIFICANCE is achieved. What a triumph of marketing it was to have chosen that word!

Why is this? As any statistician will tell you, the simplest explanation is usually the best. That’s true here. Why are people devoted to p-values? It isn’t because they understand them. Experience has taught me hardly anybody remembers their definition and limitations, even if they routinely use them—even if they teach their use to others!

Most people are lazy, and scientists are people. If work, especially mental toil, can be avoided, it will be avoided. Not by all, mind, but by most. P-values-as-ritual does the thinking for researchers. They remove labor. “Submit your data to SPSS” (I hear a phrases like a lot from sociologists). If wee p-values are spit out, success is announced.

Back to the paper:

Indeed most scientists employ the P value as if it were an absolute index of the truth. A low P value is automatically taken as substantial evidence that the data support a real phenomenon. In turn, researchers then assume that a repeat experiment would probably also return a low P value and support the original finding’s validity. Thus, many studies reporting a low P value are never challenged or replicated. These single studies stand alone and are taken to be true. In fact, another similar study with new, different, random observations from the populations would result in different samples and thus could well return a P value that is substantially different, possibly providing much less apparent evidence for the reported finding.

All true.

Replacement? The authors suggest effect size with its plus-or-minus attached. Effect size? That’s the estimate of the parameter inside some model, a number of no (direct) interest. Shifting from p-values to effect sizes won’t help much because effect sizes, since they’re statements of parameters and not observables, exaggerate, too.

The solution is actually simple. Do what physicists do (or used to do). Fit models and use them to make predictions. The predictions come true, the models are considered good. They don’t, the models are bad and abandoned or modified.

Problem with that—it’s called predictive statistics—is that it’s not only hard work, it’s expensive and time consuming. Takes plenty of grunting to come to a reasonable model—and then you have to wait until verifying data comes in! Why, it’s like doing the experiment multiple times. Did somebody mention replication?

P-value hacking, you asked? According to this study:

P-hacking happens when researchers either consciously or unconsciously analyse their data multiple times or in multiple ways until they get a desired result. If p-hacking is common, the exaggerated results could lead to misleading conclusions, even when evidence comes from multiple studies.

Funniest quote comes from one Dr Head (yes): “Many researchers are not aware that certain methods could make some results seem more important than they are. They are just genuinely excited about finding something new and interesting”.

Amen, Head, amen.

Binue Plus! The answer to all will be in my forthcoming book. Updates on this soon.


Thanks to reader Al Perrella for alerting us of this topic.


  1. David in Cal

    Biostatistician Helena Kraemer gave a nice talk at the Stanford Medical School yesterday about this sort of problem and other errors in the use of hypothesis testing. Her view is that only a small portion of the published papers she sees have everything as it should be.

  2. First, Spurious p-values have been announced on the day that R.A. Fisher invented tests. The problem made a big splash again in the 60s in Morrison and Henkel’s Significance Test controversy. Hacking p’s are nominal or computed p’s not actual. People have been piling up publicationon this same exact point for 70 years (Cox says 100).
    But this only proves the value of error statistics. The real importance of the p-value, or rather the overall error statistical philosophy in which they exist as a very small part, is that we can demonstrate the cheating by proving the p-value is increased. For example, we can show how cherry picking, multiple testing and optional stopping alter the p-value, and in the case of the latter, whole books on how to make the adjustments have been written! Now on the Bayesian account, these shenanigans don’t show up. To alter the inference based on optional stopping, say, violates the likelihood principle. Which would you rather have, an account that can reveal cheating, or one in which it is not cheating at all (not to mention you get to put in your prior beliefs at the same time as you ignore these biasing selection effects). You can see one discussion in my April 4 blog in errorstatistics.com. Or look up optional stopping.

  3. The follow up that got cut from my comment pertained to computing power in the case of statistical significance (as the author recommends) . This recommendation falls out naturally from my account of severe tests that I’ve been discussing for 30 years. See for example Error and the Growth of Experimental Knowledge (1996), which won the Lakatos Prize in 1998 in philosophy of science, and my papers with Aris Spanos (2006 and others), and Sir David Cox in 2006/2010. However, the results reveal that likelihoodists and Bayesians severely exaggerate the upshot of the same result that post-data power shows is unwarranted.

  4. MattS

    Dr. Head is wrong. They don’t care about new or interesting, they only care about whether it is publishable or not.

  5. DAV

    The real problem with p-values is not so much that they can be gamed but instead that they provide nothing useful. In particular, they don’t provide anything informative about model or hypothesis validity. Unfortunately, they are too often falsely seen as supporting an hypothesis.

  6. I hope skeptics and statisticians don’t start using the phrase “p hacking” routinely, as seems to be the current trend. It’s so *stupid*. Even the phrase “p gaming” has more explanatory power to the public and doesn’t sound ‘cool’ like ‘hacking’ does.

  7. “The solution is actually simple. Do what physicists do (or used to do). Fit models and use them to make predictions. The predictions come true, the models are considered good. They don’t, the models are bad and abandoned or modified.”

    This is incredibly naive for two reasons. (2) How do you test? What do physicists do? Answer: check if they’ve got a genuine effect via p-values. Remember the Nobel prize winning Higgs? How do you determine what ought to be modified? (Duhem’s problem). That’s what error statistics is designed to accomplish.

    (2) The vast majority of interesting inferences in science are NOT settled by predictive fits of the sort being recommended. Were that the main criterion, they would have stopped with Newton. It’s understanding and explanation that’s needed. And this rarely is accomplished by formal statistics alone. The account of severe tests, however, neatly encompasses these substantive inferences, and reflects the criterion of informativeness and explanatory power.

    It’s time to get beyond logical positivism, yet the most famous probability texts (bought into by Briggs, I take it) were written by those mired in logical positivist thinking. Hence, verificationism, confirmationism, instrumentalism, phenomenalism, idealism, behaviorism, and a few more isms.

  8. DAV

    check if they’ve got a genuine effect via p-values.

    A p-value is a value concerning a model parameter — if it’s a probability at all (which it isn’t allowed to called such, hence “p-value”) it would be Pr(parameter | model, data). How does knowing the quality of a model’s parameters tell you anything about what is being modeled? After all, it’s not Pr(model | parameters, data).

  9. Mayo,

    Please flesh out what you mean by ” It’s understanding and explanation that’s needed” because at face value your last post didn’t communicate any useful information on the subject, although you may have something interesting to say.

    Freudian psychoanalysis has incredible depth of understanding and explanatory power. Actually, that’s it’s problem. Because of it’s incredible explanatory power, it can explain everything, no matter what happens next. That’s why it’s also useless as a scientific theory. So obviously, a good scientific theory has to have more to it than just ‘explanatory power’. There also have to be strict limits to what a scientific theory can explain.

  10. Will…both you and Briggs have enemies: “it’s” is the contraction of “it is” not the possessive form of the neuter prounoun. (It’s nice to be able to pin one of my typographical failings–I do it often–on someone intelligent.) There must be a finger reflex that does an apostrophe after “it” if the next finger is poised for s.

  11. You’re quite right Bob. I’ve always made that typo and I go through periods for months on end where I re-read and subtract all the apostrophes. And then I lapse back into illiteracy.

  12. “I do it often”: write “it’s” for “its'” (and “you’re” for “your”).

  13. DAV

    It could be those words are possessed so they do crazy things — the literary equivalent of photo bombing . Personally I think these words are just possessive and seek world domination. They do this by evading the watchful eyes of editors and forcing themselves into various texts. That there is a secret enemy doing this is just a phantasm created by Briggs’s paranoia. It’s also proof that idleness (long road trips; no job to speak of; reading books by the poolside at noon) is a breeding ground for these possessive words.

  14. John B()

    Knot HOMOPHONES! Call the RFRA!

  15. Pouncer

    Were the following illustration repeated about twenty times per each discussion of p-values, my models indicate that there would be about five people in a hundred who would become statisticially enlightened:


Leave a Reply

Your email address will not be published. Required fields are marked *