Statistics

Making Up Data To Prove Models?

There is this idea in statistics called the “bootstrap.” The idea is not crazy, but it’s not quite right either.

Works like this: data comes in, and usually some parameterized probability (PP) model is fit to it. Focus, sadly, is on the parameters and discovering their “true” values. But since probability doesn’t exist, neither do parameters, and there can be no true values of what does not exist.

But never mind. Statisticians are parameter crazy and it’s tough talking them out of their obsession. The real problem is that, often, there isn’t a lot of data, but there is a ton of desire to fit PP models. Solution? Make more data up. Then fit the PPs on the made up data. Then pretend the fit on the made up data tells you something about the parameters.

How to make up data? Well, you don’t really. You just re-use the old data, but shuffled around a bit, while pretending the shuffle makes it new. Like I said, it’s not crazy. After all, if whatever caused, all the causes, of the original data remain the same, more or less, then it’s not impossible those causes would produce the shuffled data.

If the causes change, or you never got hold of all of them in the original data, then the method produces over-certainty. And even if you’ve mined all the causes in the original data, you don’t know how they were employed and with what frequency (if I may), because if you did, you’d model cause and not probability. This means the bootstrap is like trying to analyze a symphony from a brief passage as it echoes off some distant canyon wall.

How much better—this is going to be a rhetorical question—to use what data you have, fit whatever PP you like, “integrate out” the uncertainty in the non-existent parameters, and then treat the model in its predictive form, speaking only of what can be measured? In other words, take whatever data you can, and then make predictions of future data. If your predictions verify well, then you likely have a good model; if they don’t, then you know you have a lousy one.

This is called the Predictive Way, and the true path of science. It still uses models, because science is all about models, but it speaks of observables. Just like you do in Real Life when you have uncertainty in some thing but must make decisions on it.

Point is, the “bootstrap” (and similar methods) try to eke out extra information, with rapidly diminishing return on value, and all without any guarantee of veracity. In the end, you’ll still be left with a fistful of parameters and you still won’t know if your model is any good.

Yet if you’re determined to talk about your PP, and you want to improve the bootstrap, how could you do it?

Right: add “AI” to its name!

Or “machine learning”. As we find in the paper “Prediction-powered inference” by Anastasios Nikolas Angelopoulos (great name) and some others in Science. Abstract abstraction:

Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients [which are all parameters] without making any assumptions about the machine-learning algorithm that supplies the predictions. Furthermore, more accurate predictions translate to smaller confidence intervals…

Their idea is simple. Use “machine learning” to make predictive models conditioned on data, and use the predictions as if it were “new” data to produce shinier P-values.

So this is just the bootstrap, AI-ified. Which makes it sound like sexy science. (The paper itself confirms this, but it’s all math, and I wouldn’t want to shock you with equations.)

Funny thing about this is that Anastasios Nikolas Angelopoulos and his fellow researchers were so close to the right answer. They had just about touched Reality—but they veered away at the last second to grab onto their PPs.

They should have stopped with the “machine learning” predictions. Full stop. As in cease. Those predictions are the model.

We could have then verified those predictions against genuine new data and see how good those models are.

Alas, it’s costly and time consuming to wait to make real tests on new data before announcing to the world your new model. Science must needs progress!

This is why the true Predictive Way is such a hard sell.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: $WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.

Categories: Statistics

9 replies »

  1. I can attest, from working at a Russell Group University in the IT Department, that the people that come up with this stuff really are dumb enough to believe what they’re making up. The number of times I remoted into to a computer and “recovered” all the mysteriously “lost” email by clicking on the column header to change the sort order boggles the mind.

  2. At the GLOBALISTICS 2023 Congress, a report by MSU scientists “Reconsidering the Limits to Growth” was presented, 14.11.2023

    The report, edited by V.A. Sadovnichy, A.A. Akaev, I.V. Ilyin, S.Y. Malkov, L.E. Grinin, and A.V. Korotaev, was prepared in 2020–2022 as part of the development program of the Interdisciplinary Scientific and Educational School of Moscow State University “Mathematical Methods for the Analysis of Complex Systems” and the implementation of a grant from the Russian Science Foundation.

    The report was discussed at the meetings of the Club of Rome in 2021-2022, received the approval of full members of the Club of Rome and was published under the auspices of the Russian Association for the Promotion of the Club of Rome.

    In October 2023, the Springer Publishing House published a report.

  3. Patient thanks for that confirms NWO thesis of interlocking cooperation of
    NWO principles Biden-Putin-Xi+Modi. Modi is somewhat on the periphery
    providing a testbed for various biological and agricultural interventions.

  4. Looks like he’s been resurrected:

    Why are Indians demanding Bill Gates’ Arrest?
    https://www.desiblitz.com/content/why-are-indians-demanding-bill-gates-arrest

    Why Are Indians So Angry at Bill Gates?
    https://thediplomat.com/2021/06/why-are-indians-so-angry-at-bill-gates/

    All is forgiven…
    Microsoft co-founder Bill Gates to visit India next week for first time since pandemic
    https://www.businesstoday.in/technology/news/story/microsoft-co-founder-bill-gates-to-visit-india-next-week-for-first-time-since-pandemic-371194-2023-02-23

    Bill Gates: ‘India is not just a beneficiary of new breakthroughs, but an innovator of them’
    https://indianexpress.com/article/india/bill-gates-india-is-not-just-a-beneficiary-of-new-breakthroughs-but-an-innovator-of-them-8476560/

  5. “That CO2 stuff is dangerous! It’s changing the climate!”
    “Yeah? That’s not what the measurements show.”
    “Then the measurements must be wrong. Let’s make a model and then use THAT to, uh, “correct” the measurements!”
    “How much “correction” do you need?”
    “Just make the correction data a linear function of the CO2 concentration.”
    “You think anyone is going to fall for that? I mean, isn’t that just TOO obvious?”

    https://web.archive.org/web/20230115002156/https://realclimatescience.com/wp-content/uploads/2020/10/USHCN-Average-Temperature-Adjustments-Final-Minus-Raw-vs.-Atmospheric-CO2-1.png

  6. The paper suggests a method to correct the prediction/estimation bias of a machine learning model when applied to a large unlabeled dataset where only X variables are observed. The correction (rectifier) is estimated by utilizing a so-called small gold-standard dataset.

    No resampling is involved, hence no bootstrap at all.

    Briggs, please imagine what the authors would say if they read this post. @#$%^&!

Leave a Reply

Your email address will not be published. Required fields are marked *