The going started getting tough last Chapter. It doesn’t get any easier here. But stick with it, because once you finish with this Chapter you will see the difference between classical/Bayesian and modern statistics.
Here is the gist:
- Start with quantifying your uncertainty in an observable using a probability distribution
- The distribution will have parameters which you do not know
- Quantify your uncertainty in the parameters using probability
- Collect observable data ,which will give you your updated information about the parameters which you still do not know and which still have to be quantified by a probability distribution
- Since you do not care about the parameters, and you do care about future observables, you quantify your uncertainty in these future observables given the uncertainty you still have in the parameters (through the information present in the old data).
If you stop at the parameters, step 4, then you are a regular Bayesian, and you will be too certain of yourself.
This Chapter shows you why. The computer code mentioned in the homework won’t be on-line for a week or so. Again, some of you won’t be able to see all Greek characters, and none of the pictures are given. You have to download the chapter. Here is the link.
Estimating and Observables
1. Binomial estimation
In the 2007-2008 season, the Central Michigan football team won 7 out of 12 regular season games. How many games will they win in the 2008-2009 season? In Chapter 4, we learned to quantify the probability in this number using a binomial distribution, but we assumed we knew p, the probability of winning any single game. If we do not know p, we can use the old data from last season to help us make a guess about its value. It helps to think of this old data as a string of wins and losses. So that, for the old x, we saw x1 = 0, x2 = 1, . . . , x12 = 1, which we can summarize by k = i xi , where k = 7 is the total number of wins in n = 12 games.
Here’s the binomial distribution written with an unknown parameter
(see the book)
where ? is the success parameter and k the number of successes we observed out of n chances.
How do we estimate ?? Two ways again, a classical and a modern. The classical consists of picking some function of the observed data and calling it ?, and then forming a confidence interval. In R we can get both at once with this function
where you will see, among other things (ignore those other things for now),
95 percent confidence interval:
probability of success
This means that ? = 0.58 = 7/12 so again, the estimate is just the arithmetic mean. The 95% confidence interval is 0.28 to 0.84. Easy. This confidence interval has the same interpretation as the one for the ?, which means you cannot say there is a 95% chance that ? is in this interval. You can only say, “either ? is in this interval or it is not.”
Here is Bayes’s theorem again, written as functions like we did for the normal distribution
(see the book)
We know p(k|n, ?, EB ) (this is the binomial distribution), but we need to specify p(?|EB ), which describes what we know about the success parameter before we see any data, given only EB (p(k|n, EB ) will pop out using the same mathematics that gave us p(x|EN ) in equation (17)). We know that ? can be any number between 0 and 1: we also know that it cannot be exactly 0 or 1 (see the homework). Since it can be any number between 0 and 1, and we have no a priori knowledge which number is more likely than any other, it may be best to suppose that each possible value is equally likely. This is the flat prior again (1Like before, there are more choices for this prior distribution, but given even a modest sample size, the differences in the distribution of future observables due to them is negligible). Again, technically EB should be modified to contain this information. After we take the data, we can plot p(?|k, n, EB ) and see the entire uncertainty in ?, or we can pick a ?best? value, which is (roughly) ? = 0.58 = 7/12, or we can say that there is a 95% chance that ? is in the (approximate) interval 0.28 to 0.84. I say “roughly” and “approximate” here, because the classical approximation to the exact Bayesian solution isn?t wonderful for the binomial distribution when the sample size is small. The homework will show you how to compute the precise answers using R.
2. Back to observables
In our hot little hands, we now have an estimate of ? which equals about 0.58. Does this answer the question we started with?
That question was How many games will CMU win in the 2008-2009 season? Knowing that ? equals something like 0.58 does not answer this. Knowing that there is a 95% chance that ? is some number between 0.28 to 0.84 also does not answer the question. This question is not about the unobservable parameter ?, but about the future (in the sense of not yet seen) observable data. Now what? This is one of the key sections in this entire book, so take a steady pace here.
Suppose ? was exactly equal to 0.58. Then how many games will CMU win? We obviously don?t know the exact number even if we knew ?, but we could calculate the probability of winning 0 through 12 games using the binomial distribution, just as we did in Chapters 3 and 4. We could even draw the picture of the entire probability distribution given that ? was exactly equal to 0.58. But ? might not be 0.58, right? There is some uncertainty in its value, which is quantified by p(?|kold , nold , EB ), where now I have put the subscript ?old? on the old data values to make it explicit that we are talking about the uncertainty in ? given previously observed data. The parameter might equal, say, 0.08, and it also might equal 0.98, or any other value between 0 and 1. In each of these cases, given that ? exactly equalled these numbers, we could draw a probability distribution for future games won, or knew given nnew = 12 (12 games next season) and given the value of ?.
Let us draw the probability distribution expressing our uncertainty in knew given nnew = 12 (and EB ) for three different possible values of ?.
(see the book)
If ? does equal 0.08, we can see that the most likely number of games next season is 1. But if ? equals 0.58, the most likely number of games won is 7; while if ? equals 0.98, then CMU will most likely win all their games.
This means that the picture on the far left describes our uncertainty in knew if ? = 0.08. What is the probability that ? = 0.08? We can get it from equation (19), from p(?|kold = 7, nold =12, EB ). The chance of ? = 0.08 is about 1 in 100 million (we’ll learn how the computer does these calculations in the homework). Not very big! This means that we are very very unlikely to have our uncertainty quantified by the picture on the left. What is the chance that ? = 0.98? About 3 in a trillion! Even less likely. How about 0.58? About 3 in 10,000. Still not too likely, but far more likely than either of those other values.
We could go through the same exercise for all the other values that ? could take, each time drawing a picture of the probability distribution of knew . Each one of these would have a certain probability of being the correct probability distribution for the future data, given that its value of ? was the correct value. But since we don?t know the actual value of ?, but we do know the chance that ? takes any value, we can take a weighted sum of these individual probability distributions to produce one overall probability distribution that completely specifies our uncertainty in knew given all the possible values of ?. This will leave us with
(see the book)
Stare at equation (20) for two minutes without blinking. This, in words, is the probability distribution that tells us everything we need to know about future observables knew given that we know there will be nnew chances for success this year, also given that we have seen the past observables kold and nold , and assuming EB is true. Think about this. You do not know what future values of k will be, do you? You do know what the past values are, right? So this is the way to describe your uncertainty in what you do not know given what you do know, taking full account of the uncertainty in ?, which is not of real interest anyway.
The way to get to this equation uses math that is beyond what we can do in this class, but that is unimportant, because the software can handle it for you. This picture shows you what happens. The solid lines are the probability distribution in equation (20). The circles plotted over it are the probability distribution of a regular binomial assuming ? exactly equals 0.58. The key thing to notice is that the circles distribution, which assumes ? ? 0.58 is too tight, too certain. It says the center values of 6 to 8 are more certain than is warranted (their probability is higher than the actual distribution). It agrees, coincidentally only, with the probability that the future number of wins will be 5 or 9, but then gives too little probability for wins less than 5 or greater than 9.
The actual distribution of future observable data (20) will always be wider, more diffuse and spread out, less certain, than any distribution with a fixed ?. This means we must account for uncertainty in the parameter. If we do not, we will be too certain. And if all we do is focus on the parameter, using classical or Bayesian estimates, and we do not think about the future observables, we will be far, far more certain than we should be.
3. Even more observables
Let?s return to the petanque example and see if we can do the same thing for the normal distribution that we just did for the binomial. The classical guess of the central parameter was ? = ?1.8 cm, which matches the best guess Bayesian estimate. The confidence/credible interval was -6.8 cm to 2.8 cm. In modern statistics, we can say that there is a 95% chance that ? is in this interval. We also have a guess for ?, and a corresponding interval, but I didn?t show it; the software will calculate it. We do have to think about ? as well as ?, however?both parameters are necessary to fully specify the normal distribution.
As in the binomial example, we do not know what the exact value of (?, ?) is. But we have the posterior probability distribution p(?, ?|xold , EN ) to help us make a guess. For every particular possible value of (?, ?), we can draw a picture of the probability distribution for future x given that that particular value is the exact value.
(see the book)
The picture shows the probability densities for xnew for three possible values of (?, ?). If (? = ?6.8 cm, ? = 4.4 cm), the most likely values of xnew are around 10 cm, with most probability given to values from -20 cm to 0 cm. On the other hand, if (? = 2.8 cm, ? = 8.4 cm), the most likely values of new x are a little larger than 0 cm, but with most probability for values between -20 cm and 30 cm. If (? = ?1.8 cm, ? = 6.4 cm), future values of x are intermediate of the other two guesses. These three pictures were drawn (using the Advanced code from Chapter 5) assuming that the values of (?, ?) are the correct ones. Of course, they might be the right values, but we do not know that. Instead, each of these three guesses, and every other possible combination of (?, ?), has a certain probability, given xold , of being true.
Given the old data, we can calculate the probability that (?, ?) equals each of these guesses (and equals every other possible combination of values). We can then weight each of the new x distributions according to these probabilities and draw a picture of the distributions of new values given old ones (and the evidence EN ) like we just did for the binomial distribution. This is
(see the book)
Here is a picture of this distribution (generated by the computer, of course)
(see the book)
The solid line is equation (21), and dashed is a normal distribution with (? = ?1.8 cm, ? = 6.4 cm). The two distributions do not look very different, but they certainly are, especially for very large or very small values of xnew . The dashed line is too narrow, giving too much probability for too narrow a range of xnew . In fact, for distribution (21), values greater than 10 cm are from the true distribution are twice as likely as the normal distribution where we plugged in a single guess of (?, ?); values greater than 20 cm are six times as likely. The same thing is repeated for values less than -10 cm, or less than -20 cm, and so on. Go back and read Chapter 6 to refamiliarize yourself with the fact that very small changes in the central or variance parameter can cause large changes in the probability of extreme numbers.
The point again, like in the binomial example, is that using the plug-in normal distribution, the one where you assume you know the exact value of (?, ?), leads you to be far more certain than you really should be. You need to take full account of the uncertainty in your guesses of (?, ?), only then will you be able to full quantify the uncertainty in the future values xnew .