*The first part of this article also appears at the Broken Science Initiative. Go there to read it, too, and many other good ones by other authors.*

We already saw a study from Nate Breznau and others in which a great number of social science researchers were given identical data and asked to answer the same question—and in which the answers to that question were all over the map, with about equals numbers answering one thing and others answering its opposite, all answers with varying strengths of association, and with both sides claiming “statistical significance.”

If “statistical significance”, and statistical modeling practices were objective, as they are claimed to be, then all those researchers should have arrived at the same answer, and the same significance. Since they did not agree, something is wrong with either “significance” or objective modeling, or both.

The answer will be: both. Before we come to that, Breznau’s experiment was repeated, this time in ecology.

The paper is “Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology” by Elliot Gould and a huge number of other authors, from all around the world, and can be found on the pre-print server EcoEvorxiv.

The work followed the same lines as with Breznau. Some 174 teams, with 246 analysts in total, were given two identical datasets and they were asked “to investigate the answers to prespecified research questions.” One dataset was to “compare sibling number and nestling growth” of blue tits (*Cyanistes caeruleus*), and the other was “to compare grass cover and tree seedling recruitment” in *Eucalyptus*.

The analysts arrived at 141 different results for the blue tits and 85 for the *Eucalyptus*.

For the blue tits (with my emphasis):

For the blue tit analyses, the average effect was convincingly negative, with less growth for nestlings living with more siblings, but there was near continuous variation in effect size from

large negative effects to effects near zero, and even effects crossing the traditional threshold ofstatistical significance in the opposite direction.

Here’s a picture of all the results, which were standardized (that Z_{r} in the figure) across all entries for easy comparison.

The center of each vertical line is the standardized effect found for each research result, with the vertical lines themselves being a measure of uncertainty of that effect (a “confidence interval”), also given by the researchers. Statistical theory insists most of these vertical lines should overlap on the vertical axis. They do not.

The red dots, as the text indicates, are for negative effects that are “statistically significant”, whereas the blue are for positive, also “significant”.

Now for the *Eucalyptus* (again my emphasis):

[T]he average relationship between grass cover and Eucalyptus seedling number was only slightly negative and not convincingly different from zero, and

most effects ranged from weakly negative to weakly positive, with abouta third of effects crossing the traditional threshold of significance in one direction or the other. However, there were also several striking outliers in the Eucalyptus dataset, witheffects far from zero.

Here’s the similar picture as above:

This picture has the same interpretation, but notice the “significant” negative effects are more or less balanced by the “significant” positive effects. With one very large negative effect at the bottom.

If statistical modelling was objective, and if statistical practice and theory worked as advertised, all results should be the same, for both analysis, with only small differences. Yet the differences are many and large, as they were with Breznau; therefore, statistical practice is not objective, and statistical theory is deeply flawed.

There are many niceties in Gould’s paper about how all those analysts carried out their models, with complexities about “fixed” versus “random” effects, “independent” versus “dependent” variables, variable selection and so forth, even out-of-sample predictions, which will be of interest to statisticians. But only to depress them, one hopes, because *none* of these things made any difference to the outcome that researchers disagreed wildly on simple, well defined analysis questions.

The clever twist with Gould’s paper was that all the analyses were peer reviewed “by at least two other participating analysts; a level of scrutiny consistent with standard pre-publication peer review.” Some analyses came back marked “unpublishable”, other reviewers demanded major or minor revisions, and some said publish-as-is.

Yet the peer-review process, like details about modeling, *made no difference either*. The disagreements between analysts’ results was the same, regardless of peer-review decision, and regardless of modeling strategies. This is yet more evidence that peer review, as we have claimed many times, is of almost no use and should be abandoned.

If you did not believe Science was Broken, you ought to now. For both Breznau and Gould prove that you must not trust *any* research that is statistical in nature. This does not mean all research is wrong, but it does mean that there’s an excellent chance that if a study in which you take an interest were to be repeated by different analysts, the results could change, even dramatically. The results could even come back with an opposite conclusion.

What went wrong? Two things, which are the same problem seen from different angles. I’ll provide the technical details in another place. Or those with some background could benefit from reading my *Uncertainty*, where studies like the one we discussed today above have been anticipated and discussed.

For us, all we have to know is that the standard scientific practice of model building does not guarantee, or even come close to guaranteeing, truth has been discovered. All the analyses handed in above were based on model fitting and hypothesis testing. And these practices are nowhere near sufficient for good science.

To fix this, statistical practice must abandon its old emphasis on model fitting, with its associated hypothesis testing, and move to making—and *verifying*—predictions made on data never seen or used *in any way*. This is the way sciences like Engineering work (when they do).

This is the Predictive Way, a prime focus of Broken Science.

## Technical Details

I’ve written about this so much that all I can do it repeat myself. But, for the sake of completeness, and to show you why all these current researchers, as with Breznau, thought their models were the right models, I’ll say it again.

All probability is conditional, meaning it changes when you change the assumptions or premises on which you condition it. Probability does not exist. All these statistical models are probability models. They therefore express only uncertainty in some proposition (such as size of nestling growth) *conditional on* certain assumptions.

Part of those assumptions are on the model form (“normal”, “regression”, etc.), model parameter (the unobservable guts inside models), part are on the “data” input (which after picking and choosing can be anything), on the “variables” (the explicit conditions, the “x”s) and so forth.

Change *any* of these conditions, and you change the uncertainty in the proposition of interest. (Usually: changed propositions which do not *anywhere* change the uncertainty in the proposition are deemed *irrelevant*, a far superior concept to “independence”, if you’ve ever heard of that.)

That’s it. That’s all there is to it. Or, rather, *that’s as it should be*. That’s the predictive way. That’s the way I advocate in *Uncertainty*.

What happens instead is great confusion. People forget why they wanted to do statistics—quantifying uncertainty in the propositions of interest—and become beguiled by the math of the models. People forget about uncertainty and instead focus on the guts of the model.

They become fascinated by the parameters. Likely because they believe probability is real, an ontological substance like tree leaves or electricity. They think, therefore, that their conclusions about model parameters are statements about the real world, about Reality itself! Yet this is not so.

Because in the belief of ontological probability, therefore parameters must necessarily also be real, we have things like “hypothesis testing”, an archaic magical practice, not unlike scrying or voodoo. Testing leads to the spectacle of researchers waving their wee Ps in your face, as if they have made a discovery about Reality!

No.

That’s what happened here, with Gould, with Breanau, and with any other collection of studies you can find which use statistics. Only with most collections people don’t it’s happened to them. Yet.

It’s worse that this. Because even if, somehow, hypothesis testing made any sense, researchers still stop short and only release details about *the model parameters!* This causes mass widespread continuous significant grief. Not the least because people reading research results forget they are reading about parameters (where uncertainty may be small) and think they are reading about Reality (where uncertainty may remain large).

I say this, have said this, yet I am doubted. I wrote a whole book discussing it, and am doubted. But that is exactly why Gould and Breanau exist. This is vindication.

The fix? There is no fix. There is a slight repair we can make, by acknowledging the conditional nature of probability, that it is only epistemological, that at a minimum the only way to trust any statistical model is to observe that it is has made skillful (a technical term) useful (a technical term) predictions of *data never before seen or used in any way*.

It’s the last that’s a killer. We can just about convince people to think about uncertainty in observables (and not parameters). But that bit about waiting for confirmation of model goodness.

Waiting? *Waiting?*

That’s too expensive! That limits us! That slows science down!

It does. Slows it way down. For which I can only respond with “You’re welcome.” Do you even remember the last three years of The Science?

*Subscribe or donate to support this site and its wholly independent host using credit card click here*. Or use the paid subscription at Substack. Cash App: $WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.

What’s missing?

Honesty or a moral compass. It’s hard to connect those two characteristics with research results. But it would be a start.

I once was involved in a Ph.D program and was ABD. I submitted a paper for review and responses ranged from groundbreaking to a complete waste of time. That helped me to move on. What really caused me to move on was my wife was fired from her job and I had three children in college/private schools.

Presented similar data to manufacturers who paid us nicely. Never regretted not having Ph.D.

How can “probability” not exist yet be “conditional”? How can “statistical models” exist if they are models of something that does not exist?

Please clarify / explain 🙂

gareth,

They are epistemological only, a way to quantify level of uncertainty.

“How can “statistical models” exist if they are models of something that does not exist?‘

There is no such thing in the real world as a straight line or circle or square or right angle etc. They are extremely useful concepts and our modern world is the result of these concepts. But they are all mental abstracts. We then use these concepts to make physical entities that are extremely close to our mental abstracts.

We then have a bridge or a car or a violin etc.

Gareth,

When you talk about a coin being flipped, the conventional statistical explanation is that there is a 50% chance of it coming up heads on every flip. This is viewed as a property of the coin itself. If the chance is not 50%, then this is due to the coin having a different probability, not for any other reasons. The discussion will usually go immediately to models and parameters. That is, something like “the coin has a Bernoulli distribution with independent parameter p for the chance of seeing a heads and dependent parameter q = 1-p for the chance of seeing a tails. In this situation the parameter p = .5” Or they might say something like “we believe p to be .5, but there is a certain margin of error.” In any case p is viewed to be a real property of the coin itself.

It’s those things that don’t exist. The coin isn’t reading off of a Bernoulli distribution when it flips, and it does not have any intrinsic property which corresponds to the .5.

What we can do is associate probability with epistemological conditional statements. For example: “Supposing the coin can only take on the values of heads or tails, and each of these is equally likely, then the chance of it coming up heads is 50%.” What this really means is that from the given information our level of confidence that it is heads is only 50%. We could contrast this with the statement “supposing this die can take only the values of 1, 2, 3, 4, 5 or 6 and each is equally likely, then the chance of it coming up 1 is 1/6.” That is, we should have less confidence in the die coming up 1 then we do in the coin coming up heads. But all of this is based on the conditions in those statements. For example, suppose we instead knew “the coin can only take on the values of heads or tails, and it is being flipped by someone who has practiced for years to have it come up heads.” That’s not enough to get an exact percentage, but surely the confidence should be strictly between 50% and 100%. Above 50%, because the coin flipper has skill in getting a heads, but below 100%, because everyone makes mistakes. The coin itself has not changed, so if probability was intrinsic to the coin we should have gotten the same answer, but its clear that our probability should change. Therefore it is not intrinsic to the coin but instead is conditional on our knowledge.

If you talk to day traders rather than professional statisticians you will see this perspective of probability being used all the time.

“They become fascinated by the parameters. Likely because they believe probability is real, an ontological substance like tree leaves or electricity. They think, therefore, that their conclusions about model parameters are statements about the real world, about Reality itself! Yet this is not so.”

Ah, so you’ve encountered quantum physics. The Copenhagen Interpretation is nothing but the insistence that statistics are more real than reality.

Excellent post. I view the results in the context of Galit Shmueli paper “To Explain or To Prefict”. Explanatory modeling which you describe nicely almost always fails to accurately predict. Also explanatory modeling is inherently reductionist and cannot cope with system complexity. I do think Leo Breiman’s warning about the instability of model building algorithms is at work here. The “one shot” approach to modeling results in models that are very unstable. Even if one does predictive modeling in an appropriate manner instability is still a factor. We need more focus on ensemble modeling to overcome instability. In any case, thanks for the great discussion and examples. More ammunition I can use when warning scientists about models

@JerryR, “What’s missing?

Honesty or a moral compass.”

Maybe, but I think it’s more likely Briggs has the right answer. Ignorance. People with PhDs don’t tend to be the best and brightest, regardless of what they (and others) think, but rather those who can correctly memorize and parrot the things their professors say. That’s not to say they could not learn things outside the box, but rather that they have not, at least in this regard. The institution rewards mimicry, and the institution believes in statistical modeling.

The “one funeral at a time” quip is starry-eyed idealism.

@Briggs, Rudolph Harrier:

Yes, I know this. I was querying the description. Why? Because if I were to make such a statement in conversation with someone (should I find such) interested enough to engage in debate, this is what they might well respond.

What I think is meant is that probability is entirely a mental construct, a model: it is not a physical thing, it has no independent existence and is not a property of any physical thing (such as a dice or a coin). It is merely a way of looking at the behavior that we observe of the world.

So, maybe better to say “All probability is conditional . . . Probability does not exist

except as a mental model of the supposed behavior of real world things. All these statistical models are probability models.Would this be (approximately) correct?

“People with PhDs don’t tend to be the best and brightest, regardless of what they (and others) think…”

Steve–that’s exactly what I’ve observed in working with and dealing with people with PhDs for more than three decades. After all those interactions, I’m still waiting to be impressed. Anyone with average IQ can get a PhD. What’s really required isn’t powerful intellect, but obsession and persistence.

lol. Mr. Briggs, then you should not reference the results of the paper found in EcoEvorxiv since the authors employ models with parameters. The field of statistics needs you to save it. So let’s see how you would analyze the data! Your chance of being famous.

That practitioners’ misuse of statistical methods only makes competent statisticians more valuable.

JH — never preface a comment with “lol”. It marks you as an unserious person.

I wonder what the study participants thought about the wide variety of results. It would be interesting to hear their discussions of what might have gone wrong. Did anyone learn anything? Interesting discussion here, thanks to all who commented. Even you, JH. You’re an interesting naysayer.

I wonder what YOS thinks of this. That dude’s been oddly silent.

Hagfish- YOS has passed on about a month ago, to the detriment of us all.

Doing statistics and modeling properly requires brainpower most people don’t have. The problem is lack of merit in academia and government. There are only a fraction of useful and productive people, the rest are riding on their coattails for what they can grab at other peoples’ expense. It’s all fake and gay.

While acknowledging that probability is a mathematical construct that does not exist in the real world like measurables such as time, length, mass etc, it often consists of these real-world measurables. What is a histogram of the compressive strengths of a concrete bridge, for example? Is this real or not real? Is not such a histogram effectively a form of a probability density function? Can it be said that it does not exist?

Probability theory can be applied very effectively on real physical problems forward, as above, and in reverse. For example, measure theory can be used to deconstruct some physical object to define physical measures other than length, mass etc, but that are real and measurable, in order to solve a problem that otherwise might have appeared to be impossible.

To make the bold statement that “probability does not exist”, while absolutely true in the most theoretical sense, betrays the fact that it can be applied very effectively as though it does exist when it consists of properly acquired real physical measurements. Therefore from the viewpoint of a practitioner, I would argue that the distinction between it being real or not real is a moot point.

The key issue here is, I think, that too many people understand statistical models as broadly similar to equations, or sets of equations, describing physical processes – i.e. they understand in the abstract that correlation is not causation, but treat their own stats models as implying causation.

In physics y = something means that the relationship(s) is/are understood; have been measured/verified; and can be relied on. In the social “sciences” a pee value (or other statistic) attached to y=something usually means little more than “some data (whose origin and accuracy is dubious) suggests that the something has in some cases changed in some way relative to Y” but is usually interpreted as the something causes Y (some processing of the statistic used) of the time.

With a fair coin there will be a 50% chance of it coming up heads on every flip. So, what is a “fair coin”? One that has a 50% chance of it coming up heads on every flip.

I see a circular definition there. 🙂

The field of statistics is doomed? Nah. When the accuracy of the results matter, when there is an actual price to be paid for being wrong, the field seems to be doing just fine. When the accuracy of the results don’t matter, good statistical practices can die of neglect and no one cares. When a specific result is demanded, good statistical practices will be tortured to death in service of the cause. But this is true of any discipline when it’s proper application gets in the way of a desired result and accountability is missing or perverted.

Blaming scientific malpractice when humans are rewarded for that malpractice will be as effective as complaining about the weather. Solutions? Removing the reward for malpractice would be the obvious approach, but it seems to be far too entrenched. Rush Limbaugh used to complain about how much damage a handful of activists could do with a fax machine and an important-sounding letterhead. Rush also used to say “the aggressor sets the rules”. When your opponent is clobbering you with their Experts, what would any good trial lawyer do?

Another approach is what the conservative media is already doing – laughing at their stupid studies, and then fighting tooth and nail their attempts to censor our laughter.

The source data may not be accurate. The people in the field who went out and counted the number of starlings in the nest may not have gathered correct data.

This is especially true in climate science, where the scientists can’t even agree on what is being counted and how. Someone mentioned moral compass. Not only is there disagreement, there is active cheating.

De Finetti stated in the Forward of his book titled

‘Theory of Probability’thatI don’t see how the belief of “probability doesn’t exit” is relevant to the paper’s conclusion that

However, I admit that I am not all-seeing.

I don’t need to go through the experiment described in the paper, mimicking the one in another paper, to put forth the conclusion and to embarrassed colleagues. I shall not comment on the coherency of statistical analyses employed and their data manipulations in the paper as it’s not going to be pretty, and I have no time.

If the main authors have included reviewers’ comments that include merits and shortcomings and suggestions for improvement, perhaps I’d conclude the paper is worth something. One can always learn from errors.

The pressure of publish or perish can be overwhelming. International statistical associations have published guidelines for practitioners and warmed them that the pressure should not be the reason to publish junk and waste time on useless research objectives. But to no avail.

There are plenty of junk science in certain fields, but can Science itself be broken by human errors? What is science?

Hagfish Bagpipe, thanks for the pointer but I did find the title hilarious as explained in my previous comment.

“ Only with most collections people don’t it’s happened to them.” ? I think that, statistically, you need a proofreader.

Not to pilen” on, but this is in the same family of studies:

https://journals.sagepub.com/doi/10.1177/2515245917747646

Pingback: Critique Of Specification Curve Analysis – William M. Briggs