I was asked to comment on a post by Dan Simpson exploring the Bernstein-von Mises theorem.
This post fits in with the Data Science class, a happy coincidence, and has been so categorized.
A warning. Do not click on the link to a video by Diamanda Galas. I did. It was so hellishly godawful that I have already scheduled a visit with my surgeon to have my ear drums removed so that not even by accident will I have to listen to this woman again.
Now the Bernstein-von Mises theorem says, loosely and with a list of caveats given by Simpson, that for a parameterized probability model and a given prior, the posterior on the parameter converges (in probability) to a multivariate normal with a covariance matrix an inverse function n and of the Fisher Information Matrix centered around the “true” parameter.
It doesn’t matter here about the mathematical details. The rough idea is that, regardless of the prior used but supposing the caveats are met, the uncertainty in the parameter becomes like the uncertainty a frequentist would assess of the parameter. Meaning Bayesians shouldn’t feel too apologetic around frequentists frightened that priors convey information. It’s all one big happy family out at The Limit.
There is no Limit, though. It doesn’t exist for actual measures.
You know what else doesn’t exist? Probability. Things to not “have” probabilities or probability distributions (we should never say “This has a normal distribution”). It is only our uncertainty that can be characterized using probability (we should say “Our uncertainty in this is quantified by a normal”). And also non-existent, as a deduction from that truth, are parameters. Since parameters don’t exist, there can’t be a “true” value of them. Yet parameters are everywhere used and they are (or seem to be) useful. So what’s going on?
Recall our probabilistic golden rule: all probability is conditional on the assumptions made, believed, deduced or measured.
Probability can often be deduced. The ubiquitous urn models are good examples. In an urn are n_0 0s and n_1 1s. Given this information (and knowledge or English grammar and logic), the chance of drawing a 1 is n_1/(n_1+n_0). In notation:
(1) Pr(1 | D) = n_1/(n_1+n_0),
where D is a joint proposition containing the information just noted (the deduction is sound and is based on the symmetry of logical constants where there is no need to talk of drawing mechanisms, randomness, fairness, or whatever; See Uncertainty for details).
If all (take the word seriously) we know of the urn are its constituents, we have all the probability we need in (1). We are done. Oh, we can also deduce answers to questions like, “What are the chances of seeing 7 1s given we have already taken such-and-such from the urn.” But the key is that all is deduced.
So what if we don’t know how many 0s and 1s there are, but we still want:
(2) Pr(1 | U) = ?,
where U means we know there are 1s and 0s but we don’t know the proportion (plus, as usual and forever, we also in U know grammar, logic, etc.). Well, it turns out the answer is still deducible as long as we assume a value n = n_1 + n_0 exists. We don’t even need to know it, really; we just need to assume it is less than infinity. Which it will be. No urn contains an infinite number of anything. Intuitively, since we have no information on n_1 or n_0, except that they must be finite, we can solve (2). The answer is 1/2. (Take a googol to the googol-th power a googol times; this number will be finite and bigger than you ever need.)
As above, we can ask any kind of question, e.g. “Given I’ve removed 18 1s and 12 0s, and I next grab out 6 balls, what are the chance at least 3 will be 1s?” The answer is deducible; no parameter is needed.
So what if we don’t know n? It turns out not to be too important after all. We can still deduce all the probabilities we want, as long as n is finite. What if n is infinite, though? Well, it can’t be. But what if we assume it is?
We have left physical reality and entered the land of math. We could have solved the problem above for any value of n short of infinity, and since we can let n be very large indeed, this is no limitation whatsoever. Still, as Simpson rightly says, asymptotic math is much easier than finite, so that if we’re willing to close our eyes to the problem of infinitely sized urns, maybe we can make our lives computationally easier.
Statistics revolves around the idea of “samples” taken from “populations”. Above, when n was finite, our population was finite, and we could deduce probabilities for the remaining members of the population given we’ve removed so much sample. (Survey statistics is careful about this.)
But if we assume an infinite population, no matter how big a sample we remove, we always have an infinite number left. The deductions we produced above won’t work. But we still want to do probability—without the hard finite population math (and it is harder). So we can try this:
(3) Pr(1 | Inf) = θ,
where Inf indicates we have an infinite urn and θ is a parameter. The value of this parameter is unknown. What exactly is this θ? It isn’t a probability in any ontological sense, since probabilities don’t exist. It’s not a physical measure as n_1/(n_1+n_0) was, because we don’t know what n_1 and n_0 are except that there are an infinite number of each of them and, anyway, we can’t divide infinities so glibly. (The probability in (1) is not the right hand side; it is the left hand side. The right hand side is just a number!)
The answer is that θ isn’t anything. It’s just a parameter, a placeholder. It’s a blank spot waiting to be filled. We cannot provide any answers to (3) (or questions like those above based on it) until we make some kind of statement about θ. If you have understood this last sentence, you have understood all. We are stuck, the problem is at a dead end. There is nowhere to go. If somebody asks, “Given Inf, what is the probability of a 1?” all you can say is “I do not know” because you understand saying “θ” is saying nothing.
Bayesians of course know they have to make some kind of statement about θ, or the problem stops. But there is no information about θ to be had. In the finite-population case, we were able to deduce the probability because we knew n_1 could equal 0, 1, …, n, with the corresponding adjustments made to n_0, i.e. n, n-1, …, 0. No combination (this can be made more rigorous) was privileged over any other, and the deduction followed. But when the population is infinite, it is not at all clear how to specify the breakdowns of n_1s and n_0s in the infinite urn; indeed, there are an infinite number of ways to do this. Infinities are weird!
The only possible way out of this problem is do what the serial writer of old did: with a mighty leap, Jack was free of the pit! An ad hoc judgment is made. The Bayesian simply makes up a guess about θ and places it in (3). Or not quite, but that would work about would give us
(4) Pr(1 | Inf; θ = 0.5) = θ (= 0.5).
Hey, why not? If probability is subjective, which it isn’t, then probability can equal anything you feel. Feelings…whoa-oh-a feelings.
No, what the Bayesian does is invoke outside evidence, call it E, which sounds more or less scientific or mathematical, and information about θ, now called the prior, is given. The problem is then solved, or rather it is solvable. But it’s almost never solved.
The posterior is not the end
Having become so fascinated by θ, the statistician cannot stop thinking of it, and so after some data is taken, he updates his belief about θ and produces the posterior. That’s where we came in: at the end.
This posterior will, given a sufficient sample and some other caveats, look like the frequentist point estimate and its confidence interval. Frequentists are not only big believers in infinity, they insist on it. No probability can be defined in frequentist theory unless infinite samples are available. Never mind. (Frequentism always fails in finite reality.)
You know what happens next? Nothing.
We have the posterior in hand, but so what? Does that say anything about (3)? No. (3) was what we wanted all along, but we forgot about it! In the rush to do the lovely (and it is) math about priors and posteriors we mislaid our question. Instead, we speak solely about the posterior (or point estimate). How embarrassing. (Come back Monday for more on this subject.)
Well, not all Bayesians forget. Some take the posterior and use it to produce the answer to (3), or what turns out to be the modification of (3), and what is called the posterior predictive distribution.
(5) Pr(1 | Inf; belief about θ) = some number.
Here is the funny part, at least for this problem. If we say, as many Bayesians do say, that θ is equally likely to be any number between 0 and 1 (a “flat” prior), then the posterior predictive distribution is exactly the same as the answer for (1).
That’s looking at the wrong way around, though. What happens is that if you take (1) (take it mathematically, I mean) and let n go to infinity in a straightforward way, you get the posterior predictive distribution of (3) (but only with a “flat” prior).
So, at least in this case, we needn’t have gone to the bother of assuming an infinite urn, since we had the right answer before. Other problems are more complex, and insufficient attention has been paid to the finite math, so we don’t have answers in every problem. Besides, it’s easier to assume an infinite-parameter based model and work out that math.
Assuming there is an infinite population, not only of 1s and 0s, but for any statistical problem, is what leads to the false belief that “true” values of parameters exist. This is why people will say “X has a distribution”. Since folks believe true values of parameters exist, they want to be careful to guess what they might be. That’s where the frequentist-Bayesian interpretation wars enter. Even Bayesians joust with each over their differing ad hoc priors.
It should be obvious that, just as assuming a model changes the probability of the observable of interest (like balls in urns), so does changing the prior for a fixed model change the probability of the observable. Of course it does! And should. Because all probability is conditional on the assumptions made; our golden rule. Change the assumptions, change the prior.
There is no sense whatsoever that an “noninformative” prior can exist. All priors by design convey information. To say the influence of the prior should be unfelt is like saying there should be married bachelors. It makes no logical sense. There isn’t even any sense that a prior can be “minimally” informative. To be minimally informative is to keep utterly quiet and say nothing about the parameter.
If there is any sense that a “correct” prior exists, or a “correct” model for that matter, it is in the finite-deducible sense. We start with an observable that has known finite and discrete measurements qualities, as all real observables do, and we deduce the probabilities from there. We then imagine we have an infinite population, as an augmentation of finite reality, and we let the sample go to infinity. This will give and implied prior and posterior and predictive distribution we which can compare against the correct finite sample answer.
But if we had the correct finite sample answer, why use the infinite approximation? Good question. The only answer is computational ease. Good answer, too.
Even though it might not look it, this little essay is in answer to Simpson. I’m answering the meta-question behind the details of the Bernstein-von Mises theorem, the math of which nobody disputes. As always, it’s the interpretation that matters. In this case, we can invert the BnM theorem and use it to show how far wrong frequentist point-estimates are. After all, frequentist theory can be seen as the infinite-approximation method to Bayesian problems—which themselves are when using parameters infinite-population approximations to finite reality. Frequentist methods are therefore a double approximation, which is another reason they tend to produce so much over-certainty.
What I haven’t talked about, and what there isn’t space for, are these so-called infinite dimensional models, where there are an infinity of parameters. I’ll just repeat: infinity is weird.
Categories: Class - Applied Statistics, Statistics
I know Diamanda Galas.
Diamanda Galas was a friend of mine.
And you, sir, are no Diamanda Galas.
[OK, her music is pretty ‘out there’. But this has been her shtick since forever. Basically comes from riffs on ecstatic Greek and Middle-Eastern music she’s heard since childhood, with a bow to post-avant garde vocal noodling. She figured out that what she produces scares the living bejeezus out of most people, and she’s gone with it. She casts it as lament music, wailing music. If you’ve ever heard women wailing at a Middle-Eastern funeral, it’s reminiscent. And the ‘lament’ side is real for her – her brother died young of AIDS (see below).
You have to admit that even the 10 minute YouTube is very…. well, emotional. It has an impact. And in real life, Diana is a pretty interesting person. Greek Orthodox – probably still devout. I remember a conversation where she remarked to me that without her faith, “you’d go crazy.” At 14, she made her orchestral debut with the San Diego Symphony as the soloist for Beethoven’s Piano Concerto No. 1. Her brother was homosexual and died of AIDS back in the day, which devastated her. And if you do click on the link (with the sound down, if you prefer), you’ll see a pretty darn well-preserved 62-year-old.
Still touring, still getting work, still doing what she likes and needs to do – my hat’s off to her.]
“So what if we don’t know how many 0s and 1s there are, but we still want:
(2) Pr(1 | U) = ?,
where U means we know there are 1s and 0s but we don’t know the proportion (…). Well, it turns out the answer is still deducible as long as we assume a value n = n_1 + n_0 exists. (…)The answer is 1/2.”
How in the world can that be right?
Olivier Paradis Béland:
The ‘trick’, or thing to note, is that we have stated that ALL we know is that there is a finite number of balls, which can take ONLY 2 states. Then the probability is assigned/deduced via the “statistical syllogism”.
site:wmbriggs.com “statistical syllogism”
You’ll find , e.g., https://www.wmbriggs.com/post/7942/:
“Donald Williams proposed the label Statistical Syllogism, an example of which is this. Premises: “There is a n-sided object, just one side of which is labeled ‘Q’; the object will be tossed and only one side can show.” Conclusion: “A ‘Q’ will show.” The objective deduction (given the premises) is the probability the conclusion is true is 1/n.”
Which is to say, since in our ‘urn’ case n = (the total number of possible states, 0 or 1) = 2, and that’s ALL we know (by our premises), we deduce the probability 1/2.
By the way — Matt: awesome post.
I sure hope that people who think about things on public transportation will think about this post.
I was having a discussion elsewhere about this, where it was pointed out the asymptotics are more supportive (about parameters) than it would seem from the BvM theorem alone. I said:
Obviously, I’m not as worried about asymptotics, however interesting they are mathematically, or for proving what happens to finite non-parametertized models at the limit. I’m perfectly happy with the conditional description of the model (not forgetting the parameterized model is nearly always ad hoc and much more informative than any prior).
Update: statistical syllogism derived in Uncertainty.
After listening to Diamanda Galas perform in several YouTubes, and listening to an interview of her in the 1990’s, I have concluded that she is a woman with a very dark soul. Her talent on the keyboard is obvious, but she seems to stick to dark music, and highlights it with soul rendering vocals that sound either of a bar maid complaining about tips, or Satan himself. I don’t fear her. I just feel sorry for someone with seemingly no joy in their work.