Critique Of Specification Curve Analysis

Critique Of Specification Curve Analysis

Dear reader, this will be a tad difficult, but I urge you to plow through as best you can, because this is becoming “a thing” in some areas. I stripped out most technicalities.

It will not, or rather should not, come as any surprise that most statistical models are entirely ad hoc. Researchers have some data, and they want to say what explains or is “correlated” with some “outcome” in that data. The outcome is one of the measures in the data. One researcher in a group will say, “Why not a regression?” And his colleagues will say, “Why not?”

As I have written an untold number of times, the goal of all analyses is to discover the cause, and conditions relevant to that cause, related to the outcome of interest. If we could measure these, we could easily put them in a model in some way appropriate to the nature of the outcome and causes acting on it. Our job would be done.

But for complex events, which are any that involve human behavior, this is not possible. With very rare trivial exceptions, we’ll never get full causes for groups of people. Take the paper “Grilling the data: application of specification curve analysis to red meat and all-cause mortality” by Yumin Wang and others in the Journal of Clinical Epidemiology.

People eat meat at various times, and in various quantities, at different locales and ages, not to mention other foodstuffs, and engage in an almost limitless number of behaviors that are related to the causes of conditions of dying.

Now we could collect some, but never all, data on this, not even for one person, let alone for large groups. Add to that the truth that no two people will be affected in the same way by all the correlates measured, and we arrive at the conclusion our models can never do better than to suggest correlations that may or may not be causally or conditionally related to the outcome. And only on average.

This is not to say such modeling is useless, and that correlations are not interesting or helpful. But it should be the cause of humility in researchers, and should stop them from making over-bold claims.

Again, most of the models used in fields like this are ad hoc, like regressions. Which, if you recall their math, are correlation-only by design: they only say how probabilities change by changing values of the measures in the model.

But hope springs eternal, and people will try to eke out as much as they can from their data and models, hoping they really have found cause. Enter the idea of “Specification Curve Analysis”. Do not be intimidated by the title. It sounds more sophisticated than it is.

Here’s how it works, as we gather from the 2020 introductory paper of the same name by Uri Simonsohn, Joseph P. Simmons, and Leif D. Nelson in Nature: Human Behavior.

The idea is simple enough. Instead of reporting on just one ad hoc model, which uses a given list of measures, report on other similar models, which may use slightly different combinations of those measures. If the goal is hunting for wee Ps, which it very sadly usually is, then report a curve of these Ps for all the different models and data combos. Or do the same for whatever metric is used to designate success, like effect size or (as below) hazard rates.

Maybe the simplest example, which I am making up, is this. Suppose our outcome is death in some year. A yes or no outcome. And we measure, or rather guess, total red meat consumption for that year for each person (good luck getting exact measurements!), for however many people we happen to get, under whatever circumstances.

We could do, say, an ad hoc logistic regression on this data, the outcome a function of red meat. Very common model.

But we could have done other ad hoc models, too. Maybe a random forest. Or a probit. Or some fancy “AI”. Whatever. We do all these models, as well as the logistic. And then report on the range across these models of how red meat is correlated with death.

We didn’t just have to put red meat in the models as itself, a raw measure. We could have also put in its square, as is done, or its log, as is also common. So we add these to the mix of measure and model manipulations, and report on the lot.

This is not the worst idea in the world.

One expects this technique would induce that humility I spoke of, and which is so lacking in science. After all, if the range of metrics across these models and measure manipulations is large, one should suspect that one doesn’t know what red meat and death have to do with each other. Especially because if you are you are admitting you do not know what the right model and measure manipulations are.

But my guess is that it will have the opposite effect. It will be impossible to keep one’s eyes off the largest or strongest metrics in the mix. Researchers will become like the lousy golfer who routinely scores 110—on the front nine (me)—but who, after hitting the green in one on a rare outing, says to himself, “Boy am I good.”

In other words, we’ll see a lot of post hoc reasoning. “Oh, so that combination of measures in that particular model configuration gives the best results, in the sense of confirming my suspicions; therefore, this combo is probably the right one.” We might not even hear that the authors used a specification curve analysis (SCA); we’ll only see the final model.

In the 2024 red meat paper noted above, the authors say that SCA is “a novel analytic method that involves defining and implementing all plausible and valid analytic approaches for addressing a research question”. (We ignore here the now-obligatory use of novel, which must, by kingly decree, be in all science papers.)

This shows you how quickly things can get out of hand. It is not so that SCA uses “all” plausible nor all valid analytical approaches. This is because the number of models is always infinite for any set of data. And (as long-time readers will recall) it is always possible to specify for any set of data a model that explains it perfectly, or to arbitrary degree of precision.

That the authors did not know this means SCA can already be a way to falsely proclaim certainty in results. All you need do is say you tried every “plausible and valid analytic approach”, and thus claim success. But of course you could not have: nobody could. Not unless the model is fully causal in the way I described above.

Here’s the lack of humility showing itself even stronger (my emphasis): “We enumerated all defensible combinations of analytic choices to produce a comprehensive list of all the ways in which the data may reasonably be analyzed.”

All. What else can you do but sigh? Especially when you read in their Methods, “We are unable to test for all possible combinations of covariates due to computational feasibility.” They kept a “core” group of measures, which were the same in all model combos, and varied some of the others. Which is not, I hope you see, all combinations.

Here are there results, which are worth a glance:

Our specification curve analysis included 1208 unique analytic specifications, of which 435 (36.0%) yielded a hazard ratio equal to or more than 1 for the effect of red meat on all-cause mortality and 773 (64.0%) less than 1. The specification curve analysis yielded a median hazard ratio of 0.94 (interquartile range: 0.83–1.05). Forty-eight specifications (3.97%) were statistically significant, 40 of which indicated unprocessed red meat to reduce all-cause mortality and eight of which indicated red meat to increase mortality.

Here’s the relevant portion of the picture:

(It’s hard to read in the paper, too.) The y-axis is hazard ratio, the dots are the estimates of the various combos, with confidence intervals, and the x-axis the combo number. Some of the results were “significant” against meat, some were “significant” for meat. Most were the middle.

To the author’s credit, they don’t make any strong claims about the results, other than that SCA works.

This picture should remind you of Nate Breznau’s work, now repeated by others, in which the same data was handed to groups of researchers, and all were asked to answer the same questions; they had a range of results that looked just like this picture (blog, Substack). The lesson we learned from that is the same one I mentioned above: because of the blizzard of results, we don’t know what to believe.

Which is proper to admit—when, in fact, we don’t know the cause.

Which also means we should not be tempted to look at some equally weighted average of all these results. After all, if the set of measures that are used are flawed, or the models are, no combination of them will set things right. We’re in the realm of the Emperor’s Nose Voting Fallacy (blog, Substack). The average of ignorance is not intelligence.

Specification curve analysis cannot remove bias. Because that comes in picking the measures that will be used, and their transformations, and the models used. It does not follow that even if all the combos supported some hypothesis—suppose all or most said meat was bad for you—that therefore we have extra weight for the hypothesis than if we only used one.

Remember: all probability is conditional on the assumptions made. Each of these combos varies conditions, and so the probabilities on the outcomes change accordingly. Each probability is locally correct, given the combo (and assuming no errors or cheating). Which means we can’t pick any combo as best or worst based on the results it gives, because those results are locally correct.

What we want is that combo which is closest to Reality. That is what we always want with any model: that its premises match Reality as closely as we can get. If you don’t know which combo best matches Reality, which is what you’re admitting when you do a SCA, then it will be too easy to fool yourself or to succumb to temptation.

Look: as I have said a very large number of times, we must move away from model-fit techniques. SCA is yet another model-fit technique. It announces models’ fits. To have confidence in any model, that model needs to skillfully predict observations never before used in the model-fitting process. SCA does not do this.

I repeat: SCA does not do that. It is nothing but an ordinary model fitting method, done in bulk. If it leads to the necessary humility and acknowledgement of Uncertainty, then it will have proven some positive benefit. But I’m concerned it will have the opposite effect of producing even more over-certainty.

As evidence for that, consider the original SCA paper by Simonsohn, linked above. They labor mightily to produce yet another way to fret about null hypotheses and get p-values.

One of their examples was for the goofy claim “that hurricanes with more feminine names have caused more deaths.”

No, really. That was a bona fide 2014 paper in PNAS called “Female hurricanes are deadlier than male hurricanes“. The idea was that when some dumb white hick heard the hurricane descending upon him was a girl, he would try to ride it out and die. But when it was a man, he would rightly flee. Gender theory for storms.

This is an absurd, woke, asinine claim, which in no way should be taken seriously, let alone have p-values calculated from it. It doesn’t even matter that SCA more or less “confirmed” this judgement using wee p-values. No analysis was needed at all. Especially knowing most men instinctively flee from females when their flapping generates furious winds.

And so, once again, we have over-certainty.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: $WilliamMBriggs. For Zelle, use my email:, and please include yours so I know whom to thank.


  1. Wow! Thanks for the nice exposition of SCA. It sounds a lot like little kids playing at meta-analysis (which is extremely difficult, even for experts). I always loved to teach my students about Fisher’s chi-square test combining p-values, the perfect way to turn chickensh*t into fried chicken.

  2. This was an interesting read, I liked it. Here’s an alternative attack on SCA: the graph it outputs can’t be converted into information because it’s polluted by a lack of intelligence. Consider: the number of models that go into the graph is dependent on the number of model parameters, as well as the ranges those parameters can take. Which means that if you accidentally have two base models, one whose parameters have only a small valid range, and another whose parameters have a large valid range, the latter model will be overrepresented in the final graph and skew the results. I probably didn’t use the correct terminology, appologies, but the basic idea is that if you don’t intelligently understand the inputs to your math, you’re certain to not understand the outputs. Which in turn means SCA can’t be the magic black box technique that produces correct results regardless of the inputs given it, intelligence is still required for understanding. Reminds me of Babbage: “On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.”. 🙂

  3. Gespenst

    That technique looks to me like curve fitting, by trying a lot of different curve types.

    Am I being too simplistic?

  4. Briggs


    Something like that, in a weaker way.

  5. Bill Raynor

    If I understand your description, I start with a prior distribution of models, apply a scoring function and end with a posterior distribution of scores. How is that different from a plain old Bayesian analysis. (the score? )

  6. Briggs


    The don’t put priors on the models or combos of measures, hence an implied uniform. And the analysis of each is still frequentist.

  7. Thank you Mr Briggs, I have just found your presence on bitchute and hopped over hear (‘here’!)

    I bring a sophomoric interest in statistics at best, having at least covered some of the bases in my education but I like asking questions (usually stimulated by what I am reading) . I remember one-time I tried my hand at running the Monte Casino method to estimate ? based on the simple criterion ‘in[side the quarter circle] or out’? You take a random pair of numbers in the unit interval, square them and add them and declare an ‘in’ when the result is less than ‘1’ and an ‘out’ otherwise. Then after running your computer all day and all night you have an estimate:

    ? ? 4 #(in)/(#(in)+#(out))

    The expercise shows just how poorly (a poorly designed) method can be given the accuracy against the already known (much better estimates) already arrived at more directly. (As I said .. I’m just ‘sophomoric’!)

Leave a Reply

Your email address will not be published. Required fields are marked *