Reproducibility Crises Will Be Forever With Us Until We Abandon Classical Statistics

Reproducibility Crises Will Be Forever With Us Until We Abandon Classical Statistics

The scientist stood before them.
He announced his result with glee.
It sounds impossible?
Can you say it’s not bull?
It’s true ’cause my P is wee!

Until classical statistics, with its focus on parameters and hypothesis testing, is beheaded and buried in a field of garlic irrigated with holy water, with a spike of holly thrust through its heart, we will go on having reproducibility crisis after reproducibility crisis, with their flood of “new” and “novel” results corrupting the minds of eager young scientists, and blowing untold wads of money on error.

Our latest dreary example is from Brazil, noted in the paper “Estimating the replicability of Brazilian biomedical science” by Olavo Bohrer Amaral and what looks like a few hundred other authors. Snippet from the Abstract:

With this in mind, we set up the Brazilian Reproducibility Initiative, a multicenter replication of published experiments from Brazilian science using three common experimental methods: the MTT assay, the reverse transcription polymerase chain reaction (RT-PCR) and the elevated plus maze (EPM). A total of 56 laboratories performed 143 replications of 56 experiments; of these, 97 replications of 47 experiments were considered valid by an independent committee. Replication rates for these experiments varied between 15 and 45% according to five predefined criteria. In median terms, relative effects (expressed as ratios between group means) were 60% larger in original experiments than in replications, while coefficients of variation were 60% smaller. Effect size decrease was smaller for MTT, cell line experiments and original results with less variability, while t values for replications were positively correlated with researcher predictions about replicability. 

Between 15 and 45% stinks. Replications, in direction and size simultaneously, ought to be near 100%. And would if scientists are following rigorous procedures, no cheating is happening, and the methods used to claim results are sound.

Cheating is far more common than we’d like, especially given the number of scientists and their absolute need to publish, the money flooding through the system, and the utter ease at which it can be accomplished. Skullduggery can be nothing more than leaving out that one data point which “didn’t look right”. A trick which would bypass any “preregistration of methods” requirement. Or it can be wholesale making it all up, which can now be done at the press of a button.

Procedural rigor also lacks in science more than you’d guess. It’s telling that “bench”-oriented research, with its greater compliment of fixed measuring devices, was best replicated here. But also consider that those carrying out replications are likely to be a lot more careful than original researchers, who are anxious to see their theories proved out. This is human nature. But then, so is cheating.

What isn’t human nature are the methods. Human nature is a constant, and will ever be with us. Science, like every other area of intellectual endeavor, will never be perfect because of this. But those methods can be changed. And must be.

If we don’t change, we’re going to continue to see results like this forever. This is their Table 1:

The MTT is an assay, as is PCR, while the EPM is “the elevated plus maze test of anxiety”, obviously an ad hoc quantified behavioral outcome. The PI is a predictive interval, CI a confidence interval, the magical wee P you know, and where, as always, “significant” means wee P. The only one of these criteria that has any bearing on the Real World is the predictive interval, so ignore the others.

Again, not surprisingly, those experiments which used considered-to-be objective measurement equipment had the highest replicability, though the levels were still lousy. The behavioral studies, with their wide-open nature, stunk up the place. So much so that if you as a reader of science had a rule of thumb to ignore all sociological-type research, you would not go far wrong.

This includes, I must warn you, even those studies you like.

But wait, it’s even worse. Consider that if a prediction interval from an experiment said you’d see results from negative infinity to positive infinity, every attempt at replication will prove it right. But the experiment would be useless. What’s wanted are tight bounds around observed measures.

Now take a look at what they found:

The original effect size, where means a stronger signal in line with whatever theory is being touted, are generally larger than the smaller replications. That puts original papers in the same class as advertisements for pop or drugs. “If I take the advertised drug/drink this pop, I’ll be as happy as the people in the4 ad?” Take careful notice on the graph of the points around 0 to 2 (these are all normalized numbers). A lot of effects which were said to be positive turned out to be negative, i.e. opposite of what was claimed.

I have written so many times in so many different ways to show you that the methods upon which Science relies are wrong. Not just off by a bit, but wrong. Every use of a p-value is a formal fallacy. Every. At the least, the methods are guaranteed to produce preposterous over-certainty. I don’t know how else to tell you. I’m open to ideas.

If you want to know more, you can take the Class. Free, no sign up. Show up here every Thursday.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *