Since I’m on the road, all typos today are free of charge.
Some reasonably good news to report. A peer-reviewed paper: “The fickle P value generates irreproducible results” by Lewis Halsey and three others in Nature: Methods. They begin with a warning well known to regular readers:
The reliability and reproducibility of science are under scrutiny. However, a major cause of this lack of repeatability is not being considered: the wide sample-to-sample variability in the P value…
[Jumping down quite a bit here.]
Many scientists who are not statisticians do not realize that the power of a test is equally relevant when considering statistically significant results, that is, when the null hypothesis appears to be untenable. This is because the statistical power of the test dramatically affects our capacity to interpret the P value and thus the test result. It may surprise many scientists to discover that interpreting a study result from its P value alone is spurious in all but the most highly powered designs. The reason for this is that unless statistical power is very high, the P value exhibits wide sample-to-sample variability and thus does not reliably indicate the strength of evidence against the null hypothesis.
It do, it do. A short way of saying this is small samples mislead. Small samples in the kind of studies interested in by most scientists, of course. Small is relative.
But, as I’ve said many, many, __________ (fill in that blank) times, p-values are used as ritual. If the p-value is less than the magic number, SIGNIFICANCE is achieved. What a triumph of marketing it was to have chosen that word!
Why is this? As any statistician will tell you, the simplest explanation is usually the best. That’s true here. Why are people devoted to p-values? It isn’t because they understand them. Experience has taught me hardly anybody remembers their definition and limitations, even if they routinely use them—even if they teach their use to others!
Most people are lazy, and scientists are people. If work, especially mental toil, can be avoided, it will be avoided. Not by all, mind, but by most. P-values-as-ritual does the thinking for researchers. They remove labor. “Submit your data to SPSS” (I hear a phrases like a lot from sociologists). If wee p-values are spit out, success is announced.
Back to the paper:
Indeed most scientists employ the P value as if it were an absolute index of the truth. A low P value is automatically taken as substantial evidence that the data support a real phenomenon. In turn, researchers then assume that a repeat experiment would probably also return a low P value and support the original finding’s validity. Thus, many studies reporting a low P value are never challenged or replicated. These single studies stand alone and are taken to be true. In fact, another similar study with new, different, random observations from the populations would result in different samples and thus could well return a P value that is substantially different, possibly providing much less apparent evidence for the reported finding.
Replacement? The authors suggest effect size with its plus-or-minus attached. Effect size? That’s the estimate of the parameter inside some model, a number of no (direct) interest. Shifting from p-values to effect sizes won’t help much because effect sizes, since they’re statements of parameters and not observables, exaggerate, too.
The solution is actually simple. Do what physicists do (or used to do). Fit models and use them to make predictions. The predictions come true, the models are considered good. They don’t, the models are bad and abandoned or modified.
Problem with that—it’s called predictive statistics—is that it’s not only hard work, it’s expensive and time consuming. Takes plenty of grunting to come to a reasonable model—and then you have to wait until verifying data comes in! Why, it’s like doing the experiment multiple times. Did somebody mention replication?
P-value hacking, you asked? According to this study:
P-hacking happens when researchers either consciously or unconsciously analyse their data multiple times or in multiple ways until they get a desired result. If p-hacking is common, the exaggerated results could lead to misleading conclusions, even when evidence comes from multiple studies.
Funniest quote comes from one Dr Head (yes): “Many researchers are not aware that certain methods could make some results seem more important than they are. They are just genuinely excited about finding something new and interesting”.
Amen, Head, amen.
Binue Plus! The answer to all will be in my forthcoming book. Updates on this soon.
Thanks to reader Al Perrella for alerting us of this topic.