Kevin Gray is back with another question, this time about priors. His last led to the post “Was Fisher Wrong? Whether Or Not Statistical Models Are Needed.” (The answer was yes and no.)
Here’s his new one: “If our choice of priors substantially affects our coefficient estimates, this just means our sample was too small. After 25 years using Bayesian statistics, my answer is…”
Bruno De Finetti, as most of us know, shouted, in bold print and in Boomer all caps, PROBABILITY DOES NOT EXIST (ellipsis original):
PROBABILITY DOES NOT EXIST
The abandonment of superstitious beliefs about the existence of the Phlogiston, the Cosmic Ether, Absolute Space and Time, … or Fairies and Witches was an essential step along the road to scientific thinking. Probability, too, if regarded as something endowed with some kind of objective existence, is no less a misleading misconception, an illusory attempt to exteriorize or materialize our true probabilistic beliefs.
He was exactly perfectly beautifully succinctly correct in this.
True, there were many sad souls who did not believe him, and a scattering of kind folk who nodded along with him. But almost nobody understood him, not then, not now. If people had grasped the full implications of his simple statement, science wouldn’t be in the mess its in now.
Allow me to repeat: probability does not exist. If we knew this, really knew it, then we also would know it makes no sense to speak of “coefficient” or “parameter estimates”. Coefficients, a.k.a. parameters, are probability parameters, and since probability does not exist, neither do parameters, and since parameters do not exist, it makes no sense to speak of “estimating” them.
You cannot estimate what does not exist.
Believing we can, and therefore believing in probability, is what caused us to believe in our models, as if they were causal representations of Reality. This is why causal and semi-causal language about parameters (coefficients) saturates science discourse. It is all wrong.
Probability, like logic, is only a measure of uncertainty in propositions, given a set of assumptions. It is epistemic only. It has no physical existence; yes, I included the quantum world in this.
I haven’t forgot the question about priors, which is answered like this.
Almost everybody picks for their analysis a parameterized probability model. This model will be ad hoc, chosen for convenience or custom, or by some vague hand-waving hope in some central limit theorem, which is mistaken as proof that probability exists (even if this were so, it would only be at infinity, which will never be reached).
Nothing has more effect on the outcome of the analysis than this ad hoc model. Often, even the data is not as important as the ad hoc model. Change the ad hoc model, change the analysis.
Enter the Bayesians, who not only write down an ad hoc model, but realize they must specify other ad hoc models for the uncertainty in the parameters of that model. This is a step in the right direction, a promotion over frequentism, a theory which insists probability exists, and therefore parameters also exist.
Bayesians are almost always frequentists at heart, just as all frequentists cannot help but interpret their analyses as Bayesians. The reasons are that Bayesians are all first trained as frequentists, and frequentist theory is incoherent; rather, it is impossible to use it in a manner coherent with theory. If you doubt, just ask any frequentist how their interpretation of their confidence interval accords with theory.
Being frequentists at heart, Bayesians fret that picking a prior, as your question suggests, is “informative”; that is, its choice affects the answer. It does. So does, and in a larger way, choosing the ad hoc model. But fretting is not done in model choice for some reason.
Anyway, great efforts are spent showing how little influence the priors have. It’s well enough, in a completist sense, and there is some mathematical fun to be had. But it’s beside the point, and doesn’t help answering which is best.
Here is what the best prior is in all circumstances. The one that, given the ad hoc model, makes the best predictions.
That turns out to be the same answer as what makes the best model. Amazing.
Only not so amazing when you consider probability doesn’t exist, and the whole point of modeling is to quantify uncertainty in some observable. The point of modelling isn’t, and shouldn’t be but is, parameter “estimation”. Because you cannot estimate what does not exist.
In other words, and without going into the math which all who want to know already know, specify the ad hoc model and parameters and data, integrate out the parameters, and produce the so-called predictive probability distribution, i.e. the model, the complete whole model, the point of the exercise.
Except in diagnosing model success or failure, ignore the “posteriors” on the parameters. Instead, vary the measure associated with the parameter and see how it changes the probability of the observable. For example, you want to know how changes in X change the probability of Y, then change X and watch the change in the probability of Y. Amazing.
Use the (whole) model to make predictions of observations never before seen or used in any way, and then see how well the (whole) model does against possible competitors (i.e. calculate skill). Either the (whole) model makes useful predictions or it doesn’t.
Simple as that. That’s old school science.
Subscribe or donate to support this site and its wholly independent host using credit card or PayPal click here