There is this idea in statistics called the “bootstrap.” The idea is not crazy, but it’s not quite right either.
Works like this: data comes in, and usually some parameterized probability (PP) model is fit to it. Focus, sadly, is on the parameters and discovering their “true” values. But since probability doesn’t exist, neither do parameters, and there can be no true values of what does not exist.
But never mind. Statisticians are parameter crazy and it’s tough talking them out of their obsession. The real problem is that, often, there isn’t a lot of data, but there is a ton of desire to fit PP models. Solution? Make more data up. Then fit the PPs on the made up data. Then pretend the fit on the made up data tells you something about the parameters.
How to make up data? Well, you don’t really. You just re-use the old data, but shuffled around a bit, while pretending the shuffle makes it new. Like I said, it’s not crazy. After all, if whatever caused, all the causes, of the original data remain the same, more or less, then it’s not impossible those causes would produce the shuffled data.
If the causes change, or you never got hold of all of them in the original data, then the method produces over-certainty. And even if you’ve mined all the causes in the original data, you don’t know how they were employed and with what frequency (if I may), because if you did, you’d model cause and not probability. This means the bootstrap is like trying to analyze a symphony from a brief passage as it echoes off some distant canyon wall.
How much better—this is going to be a rhetorical question—to use what data you have, fit whatever PP you like, “integrate out” the uncertainty in the non-existent parameters, and then treat the model in its predictive form, speaking only of what can be measured? In other words, take whatever data you can, and then make predictions of future data. If your predictions verify well, then you likely have a good model; if they don’t, then you know you have a lousy one.
This is called the Predictive Way, and the true path of science. It still uses models, because science is all about models, but it speaks of observables. Just like you do in Real Life when you have uncertainty in some thing but must make decisions on it.
Point is, the “bootstrap” (and similar methods) try to eke out extra information, with rapidly diminishing return on value, and all without any guarantee of veracity. In the end, you’ll still be left with a fistful of parameters and you still won’t know if your model is any good.
Yet if you’re determined to talk about your PP, and you want to improve the bootstrap, how could you do it?
Right: add “AI” to its name!
Or “machine learning”. As we find in the paper “Prediction-powered inference” by Anastasios Nikolas Angelopoulos (great name) and some others in Science. Abstract abstraction:
Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients [which are all parameters] without making any assumptions about the machine-learning algorithm that supplies the predictions. Furthermore, more accurate predictions translate to smaller confidence intervals…
Their idea is simple. Use “machine learning” to make predictive models conditioned on data, and use the predictions as if it were “new” data to produce shinier P-values.
So this is just the bootstrap, AI-ified. Which makes it sound like sexy science. (The paper itself confirms this, but it’s all math, and I wouldn’t want to shock you with equations.)
Funny thing about this is that Anastasios Nikolas Angelopoulos and his fellow researchers were so close to the right answer. They had just about touched Reality—but they veered away at the last second to grab onto their PPs.
They should have stopped with the “machine learning” predictions. Full stop. As in cease. Those predictions are the model.
We could have then verified those predictions against genuine new data and see how good those models are.
Alas, it’s costly and time consuming to wait to make real tests on new data before announcing to the world your new model. Science must needs progress!
This is why the true Predictive Way is such a hard sell.
Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: $WilliamMBriggs. For Zelle, use my email: email@example.com, and please include yours so I know who to thank.