A primary justification for Bayesian probability is De Finetti’s representation theorem, which is stated like this.
You are to observe a sequence of 0s and 1s, “failures” and “successes” if you like. These 0s and 1s will necessarily come to you in a certain order, and you want to quantify the probability that you witness this order.
If you assume that the order in which the failures and successes arrive does not matter—but what does matter is the total number of successes (and failures)—-and if this sequence is embedded in an infinite stream of failures and successes, then the probability distribution of the total successes can be represented as (the integral of) a binomial distribution with parameter θ multiplied by a prior distribution on the possible values of θ.
Have all that? The assumption that the order doesn’t matter—called exchangeability in the parlance—is enough to prove both the existence of the binomial and its accompanying prior distribution. Ain’t that wonderful?
But it only works if there are an infinite stream of numbers coming at us. Let only a finite number arrive, and out goes representation. We know this through the work of Persi Diaconis, who discovered that approximate—not exact—representations can be had for finite data, but only if finite means very large.
If we only have one, two, or a small number of observations, then no representation theorems are possible. It’s not that we haven’t found them, but they cannot be found, an important distinction.
However, we need not despair, because we can still get where we need to go by turning the problem around. By seeing that the fundamental problem is representing uncertainty in finite streams of data, not infinite. Once we have the answer, we can let our data grow large. We will discover that, at the limit, the binomial representation pops out naturally.
Thus it is the binomial that is the approximation of finite situations, not the other way around. How do we start?
In front of you lies a box inside of which you are told has N items, M of which may—or may not—be labeled “success”. This implies that the other items may be anything but successes: we have no information, for example, that the “non-successes” are all identical in nature. These are our 0s and 1s.
How many successes are in the box? You don’t know, but you can quantify your uncertainty. Using a simple principle of logical probability, and the symmetry of individual constants, an axiom similar to the axiom of exchangeability, we can say that, given the evidence presented, the chance that there are no successes is the same as there are one, which is the same as there are two, and so on, up to N.
Suppose you take a handful of items from the box, where the handful is possibly smaller than N. It turns out that, given the evidence we have, the probability distribution that represents your uncertainty in the number of successes in your handful is represented by the hypergeometric distribution.
Unlike the binomial, which has unobservable parameter θ, the hypergeometric deals only with what can be or has been observed. Its parameters are all numbers you have seen.
In your hand now are a certain number of successes and non-successes, which is new evidence we can use to infer the likelihood that the remaining items in the box are also successes (or failures). We can work through the math and discover the representation of the probability distribution for the remaining items. This turns out to be a “beta-binomial” with fixed, observed, known parameters.
More data can be taken, and all the probability distributions can be updated systematically using just observed and known parameters.
What’s interesting is that as you let N grow to the limit, the standard binomial, beta, and beta-binomial results of Bayesian statistics are found. But then, as now makes sense, the parameters of these distributions become unobservable.
In the finite case, the parameters were all known numbers, but in the infinite case we have to wait until—well, we have to wait until we have reached an infinite number of observations until we can claim to have observed all the facts.
In the finite case, given the evidence and previous observations, the probability of future observations is always more spread out—it is more uncertain—than are future observations if you assume you will have an infinite amount of data. And since we never will see an infinite amount of data, standard results make us more certain than is warranted.
That’s the story in 750 words, but if you want to read more, and delve into the math, you can download this preprint. It’s a paper my friend Russ Zaretzki and I wrote, but was rejected (by the American Statistician) for “poor writing,” a damning criticism for papers supposed to appear in the “Teaching” sections.
This shows you that peer review sometimes works. Because the paper is poorly written. We’re having another go at cleaning up the notation, which proliferated rather profusely.