I often say—it is even the main theme of this blog—that people are too certain. This is especially true when people report results from classical statistics, or use classical methods when implementing modern, Bayesian theory. The picture below illustrates exactly what I mean, but there is a lot to it, so let’s proceed carefully.
Look first only at the jagged line, which is something labeled “Anomaly”; it is obviously a time series of some kind over a period of years. This is the data that we observe, i.e. that we can physically measure. It, to emphasize, is a real, tangible thing, and actually exists independent of whatever anybody might think. This is a ridiculously trivial point, but it is one which must be absolutely clear in your mind before we go on.
I am interested in explaining this data, and by that I mean, I want to posit a theory or model that says, “This is how this data came to have these values.” Suppose the model I start with is
y = a + b*t
y are the observed values I want to predict,
b are something called parameters, and
t is for time, or the year, which goes from 1955 to 2005. Just for fun, I’ll plug in some numbers for the parameters so that my actual model is
y = -139 + 0.07*t
The result of applying model A’ gives the little circles. How does this model fit?
Badly. Almost never do the circles actually meet with any of the observed values. If someone had used our model to predict the observed data, he almost never would have been right. Another way to say this is
Pr(y = observed) ~ 0.04
or the chance that the model equals the observed values is about 4%.
We have a model and have used it to make predictions, and we’re right some of the time, but there is still tremendous uncertainty in our predictions left. It would be best if we could quantify this uncertainty so that if we give this model to someone to use, they’ll know what they are getting into. This is done using probability models, and the usual way to extend our model is called regression, which is this
y = a + b*t + OS
where the model has the same form as before except for the addition of the term
OS. What this model is saying is that “The observed values exactly equal this straight line plus some Other Stuff that I do no know about.” Since we do not know the actual values of
OS, we say that they are random.
Here is an interesting fact: model A, and its practical implementation A’, stunk. Even more, it is easy to see that there are no values of
b that can turn model A into a perfect model, for the obvious reason that a straight line just does not fit through this data. But model B always can be made to fit perfectly! No matter where you draw a straight line, you can always add to it
Other Stuff so that it fits the observed series exactly. Since this is the case, restrictions are always placed on
OS (in the form of parameters) so that we can get some kind of handle on quantifying our uncertainty in it. That is a subject for another day.
Today, we are mainly interested in finding values of
b so that our model B fits as well as possible. But since no straight line can fit perfectly, we will weaken our definition of “fit” to say we want the best straight line that minimizes the error we make using that straight line to predict the observed values. Doing this allows us to guess values of
Using classical or Bayesian methods of finding these guesses leads to model A’. But we are not sure that the values we have picked for
b are absolutely correct, are we? The value for
b might have been 0.07001, might it not? Or
a might have been -138.994.
Since we are not certain that our guesses are perfectly correct, we have to quantify our uncertainty in them. Classical methodology does this by computing a p-value, which for
b is 0.00052. Bayesian methodology does this by computing a posterior probability of
b > 0 given the data, which is 0.9997. I won’t explain either of these measures here, but you can believe me when I tell you that they are excellent, meaning that we are pretty darn sure that our guess of
b is close to its true value.
Close, but not exactly on; nor is it for
a, which means that we still have to account for our uncertainty in these guesses in our predictions of the observables. The Bayesian (and classical1) way to approximate this is shown in the dashed blue lines. These tell us that there is a 95% chance that the expected value of
y is between these lines. This is good news. Using model B, and taking account of our uncertainty in guessing the parameters, we can then say the mean value of
y is not just a fixed number, but a number plus or minus something, and that we are 95% sure that this interval contains the actual mean value of
y. And that interval looks pretty good!
Time to celebrate! No, sorry, it’s not. There is one huge thing still wrong with this model: we cannot ever measure a mean. The
y that pops out of our model is a mean and shares a certain quality with the parameters
b, which is that they are unobservable, nonphysical quantities. They do not exist in nature; they are artificial constructs, part of the model, but you will never find a
b anywhere, not ever.
Nearly all of statistics, classical and Bayesian, focuses its attention on parameters and means and on making probability statements about these entities. These statements are not wrong, but they are usually beside the point. A parameter almost never has meaning by itself. Most importantly, the probability statements we make about parameters always fool us into thinking we are more certain than we should be. We can be dead certain about the value of a parameter, while still being completely in the dark about the value of an actual observable.
For example, for model B, we said that we had a nice, low p-value and a wonderfully high posterior probability that
b was nonzero. So what? Suppose I knew the exact value of b to as many decimal places as you like. Would this knowledge also tell us the exact value of the observable? No. Well, we can compute the confidence or credible interval to get us close, which is what the blue lines are. Do these blue lines encompass about 95% of the observed data points? They do not: they only get about 20%. It must be stressed that the 95% interval is for the mean, which is itself an unobservable parameter. What we really want to know about is that data values themselves.
To say something about them requires a step beyond the classical methods. What we have to do is to completely account for our uncertainty in the values of
b, but also in the parameters that make up
OS. Doing that produces the red dashed lines. These say, “There is a 95% chance that the observed values will be between these lines.”
Now you can see that the prediction interval—which is about 4 times wider than the mean interval—is accurate. Now you can see that you are far, far less certain than what you normally would have been had you only used traditional statistical methods. And it’s all because you cannot measure a mean.
In particular, if we wanted to make a forecast for 2006, one year beyond the data we observed, the classical method would predict 4.5 with interval 3.3 to 5.7. But the true interval for the prediction of the interval, while still 4.5, has the interval 0.5 to 9, which is three and a half times wider than the previous interval.
…but wait again! (“Uh oh, now what’s he going to do?”)
These intervals are still too narrow! See that tiny dotted line that oscillates through the data? That’s the same model as A’ but with a sine wave added on to it, to account for possibly cyclicity of the data. Oh, my. The red interval we just triumphantly created is true given that model B is true. But what if model B was wrong? Is there any chance that it is? Of course there is. This is getting tedious—which is why so many people stop at means—but we also, if we want to make good predictions, have to account for our uncertainty in the model. But we’re probably all exhausted by now, so we’ll save that task for another day.
1Given the model and priors I used, this is true.