# You cannot measure a mean

I often say—it is even the main theme of this blog—that people are too certain. This is especially true when people report results from classical statistics, or use classical methods when implementing modern, Bayesian theory. The picture below illustrates exactly what I mean, but there is a lot to it, so let’s proceed carefully.

Look first only at the jagged line, which is something labeled “Anomaly”; it is obviously a time series of some kind over a period of years. This is the data that we observe, i.e. that we can physically measure. It, to emphasize, is a real, tangible thing, and actually exists independent of whatever anybody might think. This is a ridiculously trivial point, but it is one which must be absolutely clear in your mind before we go on.

I am interested in explaining this data, and by that I mean, I want to posit a theory or model that says, “This is how this data came to have these values.” Suppose the model I start with is

A: `y = a + b*t`

where `y` are the observed values I want to predict, `a` and `b` are something called parameters, and `t` is for time, or the year, which goes from 1955 to 2005. Just for fun, I’ll plug in some numbers for the parameters so that my actual model is

A’: `y = -139 + 0.07*t`

The result of applying model A’ gives the little circles. How does this model fit?

Badly. Almost never do the circles actually meet with any of the observed values. If someone had used our model to predict the observed data, he almost never would have been right. Another way to say this is

`Pr(y = observed) ~ 0.04`

or the chance that the model equals the observed values is about 4%.

We have a model and have used it to make predictions, and we’re right some of the time, but there is still tremendous uncertainty in our predictions left. It would be best if we could quantify this uncertainty so that if we give this model to someone to use, they’ll know what they are getting into. This is done using probability models, and the usual way to extend our model is called regression, which is this

B: `y = a + b*t + OS`

where the model has the same form as before except for the addition of the term `OS`. What this model is saying is that “The observed values exactly equal this straight line plus some Other Stuff that I do no know about.” Since we do not know the actual values of `OS`, we say that they are random.

Here is an interesting fact: model A, and its practical implementation A’, stunk. Even more, it is easy to see that there are no values of `a` and `b` that can turn model A into a perfect model, for the obvious reason that a straight line just does not fit through this data. But model B always can be made to fit perfectly! No matter where you draw a straight line, you can always add to it `Other Stuff` so that it fits the observed series exactly. Since this is the case, restrictions are always placed on `OS` (in the form of parameters) so that we can get some kind of handle on quantifying our uncertainty in it. That is a subject for another day.

Today, we are mainly interested in finding values of `a` and `b` so that our model B fits as well as possible. But since no straight line can fit perfectly, we will weaken our definition of “fit” to say we want the best straight line that minimizes the error we make using that straight line to predict the observed values. Doing this allows us to guess values of `a` and `b`.

Using classical or Bayesian methods of finding these guesses leads to model A’. But we are not sure that the values we have picked for `a` and `b` are absolutely correct, are we? The value for `b` might have been 0.07001, might it not? Or `a` might have been -138.994.

Since we are not certain that our guesses are perfectly correct, we have to quantify our uncertainty in them. Classical methodology does this by computing a p-value, which for `b` is 0.00052. Bayesian methodology does this by computing a posterior probability of `b > 0` given the data, which is 0.9997. I won’t explain either of these measures here, but you can believe me when I tell you that they are excellent, meaning that we are pretty darn sure that our guess of `b` is close to its true value.

Close, but not exactly on; nor is it for `a`, which means that we still have to account for our uncertainty in these guesses in our predictions of the observables. The Bayesian (and classical1) way to approximate this is shown in the dashed blue lines. These tell us that there is a 95% chance that the expected value of `y` is between these lines. This is good news. Using model B, and taking account of our uncertainty in guessing the parameters, we can then say the mean value of `y` is not just a fixed number, but a number plus or minus something, and that we are 95% sure that this interval contains the actual mean value of `y`. And that interval looks pretty good!

Time to celebrate! No, sorry, it’s not. There is one huge thing still wrong with this model: we cannot ever measure a mean. The `y` that pops out of our model is a mean and shares a certain quality with the parameters `a` and `b`, which is that they are unobservable, nonphysical quantities. They do not exist in nature; they are artificial constructs, part of the model, but you will never find a `mean(y)`, `a`, or `b` anywhere, not ever.

Nearly all of statistics, classical and Bayesian, focuses its attention on parameters and means and on making probability statements about these entities. These statements are not wrong, but they are usually beside the point. A parameter almost never has meaning by itself. Most importantly, the probability statements we make about parameters always fool us into thinking we are more certain than we should be. We can be dead certain about the value of a parameter, while still being completely in the dark about the value of an actual observable.

For example, for model B, we said that we had a nice, low p-value and a wonderfully high posterior probability that `b` was nonzero. So what? Suppose I knew the exact value of b to as many decimal places as you like. Would this knowledge also tell us the exact value of the observable? No. Well, we can compute the confidence or credible interval to get us close, which is what the blue lines are. Do these blue lines encompass about 95% of the observed data points? They do not: they only get about 20%. It must be stressed that the 95% interval is for the mean, which is itself an unobservable parameter. What we really want to know about is that data values themselves.

To say something about them requires a step beyond the classical methods. What we have to do is to completely account for our uncertainty in the values of `a` and `b`, but also in the parameters that make up `OS`. Doing that produces the red dashed lines. These say, “There is a 95% chance that the observed values will be between these lines.”

Now you can see that the prediction interval—which is about 4 times wider than the mean interval—is accurate. Now you can see that you are far, far less certain than what you normally would have been had you only used traditional statistical methods. And it’s all because you cannot measure a mean.

In particular, if we wanted to make a forecast for 2006, one year beyond the data we observed, the classical method would predict 4.5 with interval 3.3 to 5.7. But the true interval for the prediction of the interval, while still 4.5, has the interval 0.5 to 9, which is three and a half times wider than the previous interval.

…but wait again! (“Uh oh, now what’s he going to do?”)

These intervals are still too narrow! See that tiny dotted line that oscillates through the data? That’s the same model as A’ but with a sine wave added on to it, to account for possibly cyclicity of the data. Oh, my. The red interval we just triumphantly created is true given that model B is true. But what if model B was wrong? Is there any chance that it is? Of course there is. This is getting tedious—which is why so many people stop at means—but we also, if we want to make good predictions, have to account for our uncertainty in the model. But we’re probably all exhausted by now, so we’ll save that task for another day.

##### 1Given the model and priors I used, this is true.

1. Ian Rae

I read your blog regularly, and enjoy the statistic tutorials and critiques. However, today I’m confused by your use of the word ‘mean’ here. i thought ‘mean’ was a single value, such as” the mean income of Americans is \$48000″.

Here, y is a function, not a single value. Is the model A a “mean”?

2. Khebab

I’m not sure I completely understand, y is a non stationary random variable so, of course, you cannot derive a stationary mean (or variance) but OS will be a stationary random variable describing the residuals. You can still derive an uncertainty interval for the 2006 value by adding the trend and the residual uncertainty interval s.

3. Briggs

All,

I encourage you, if it is possible, to forget everything you have ever learned about statistics before thinking about today’s post. It can be a real distraction.

Ian, You’re right. A mean is a single value, but so is `y` for a single, fixed value of `t`. Another way to say “mean” is “expected value”, and that’s why `y` is because we can never know `OS`.

Khebab, Actually, in classical statistics, `y` is a stationary “random” variable: it still is even if we know `a` and `b` because we can never know `OS`.

And we have indeed derived an uncertainty interval, in two different ways: one around the expected value (the mean) and another around the observable.

I appreciate this comments, everybody. This is very confusing stuff.

Briggs

4. Mike Rankin

I appreciate your efforts to bring understanding to the assumptions built into statistical modeling. Your bringing in a cyclical term at the end really nailed the point. I expect none of us really noticed the pattern at first glance and jumped to only calculating parameters of a straight line regression.

5. Steve Hempell

Briggs,

It might not matter, but the graph you give doesn’t seem to fit the equation. Shouldn’t the line intercept the y axis at -139 not the + value it does? I’m thinking of the old y= mx + b from ancient times. Just don’t understand this, which bothers me.

6. Briggs

Steve,

Spot on. The graph does intersect at -139, but only when `t=0` (plug that value in to see).

Actually, you accidentally bring up an excellent point. Would the observation really equal -139 at year 0? That is what model B says.

Which goes to show you that most models, inexpertly applied, and usually because that’s the kind of model with which people are familiar, stink.

Briggs

7. Steve Hempell

Ah, the dreaded floating Y axis or whatever you call it.

It’s nice being accidental – actually learn or become more aware of something.

I find your discussions on this topic fascinating, I am following the discussion on Whats up with that? on the satellite/land based data and want to scream when everyone draws straight lines everywhere.

Your previous discussion on the satellite data seems to indicate that there is some sort of periodicity to the temperature data. I have this fantasy that there is some hidden meaning to it if only it could be untangled – probably not possible with climate.

When I play with residuals and lowess smoothing, and a trend analysis in a program called MR1D_v2 everything comes out curvy. I have no idea what I’m doing but it’s fun.

Certainly more interesting than a straight line to a hot doom. Maybe a rollercoaster to a hot or icey doom.

Another question: does the fact that satellite data is accurate to 0.05 degrees C make any trendline less than 0.05 Deg/X meaningless?

8. Wolfgang Flamme

William,
would you mind posting the data / code for the graph above? Just want to be sure my interpretation of your presentation is correct.

Just to add more confusion, your model is fit to “observed” data, but in many cases (global temps being a good example) the “observed” data is squishy, or as statisticians say, loaded with measurement error.

Some phenomena can be measured with great precision but some cannot. When the “yardstick” is rubbery, the data are more like guesses than precise readings that can be replicated by unbiased observers.

Measurement error throws an entirely different uncertainty into models, called statistical bias, that is difficult to account for. When measurement error is folded in, somehow or other, it invariably boosts the model variance (a stat pun) and widens the confidence intervals.

PS Satellite data has measurement error. Don’t kid yourself about that. And terrestrial weather station data is thoroughly messed up. And anytime an analysis is based on “proxies,” you can sure the “observed” data is as squishy as mud.

10. Whoops. I meant to say data are, not data is.

11. Bob North

Thank you for a well-written and easy to understand on the fact that the mean is simply a measure of central tendency and this does not require that the mean (or the results of a predictive formala) actually match any of the observed data. For me at least, it is always easier to understand a math concept when it is written in normal, expository english rather than what I’ll call “mathese” where everything is written as formulas and “proofs”.

I think one thing that some lose sight of when they get too caught up in the statistics and formulae is that math doesn’t explain the physical world around us, we are only trying to use math to explain the world around us as best we can. Sometimes it works, sometimes it doesn’t, but ultimately all mathematical descriptions of the physical world are subject to refutation or refinement. In other words, if there is an observable phenomenom that is proven to be true, but which is in conflict with one of our formulae with which we attempt to describe the physical world, it is the formulae that needs to modified, not the phenomenom.

Regards,
Bob North

12. Erik

This point is so obvious, but it somehow eludes lots of smart people.

A model of a mean is meaningless (?).

Even if you could be certain that the global mean temperature would rise by x, you would have no idea what that would me for people or polar bears or crops.

It’s easy to construct a scenario in which the global mean temp goes up by six degrees yet we all freeze to death.

13. PI

Can you elaborate on what you’re calculating? I’ll restrict myself to the Bayesian method.

“It must be stressed that the 95% interval is for the mean, which is itself an unobservable parameter. What we really want to know about is that data values themselves.”

1. This (and most of the rest of your post) implies that you’re talking about the posterior predictive distribution for the data, instead of the posterior for the model trend.

“What we have to do is to completely account for our uncertainty in the values of a and b, but also in the parameters that make up OS.”

2. On the other hand, this implies that you’re treating the distributional parameters (e.g., residual variance or autocorrelation) as uncertain with some prior, and marginalizing over them in the posterior.

Which of these represents the red lines? 1, 2, or both?

As for 1, I believe it’s useful to give both posterior trend and posterior predictive intervals. The underlying trend itself is interesting and informative, but if you want to predict what will be measured in a specific year, you need the posterior predictive distribution.

(Actually, instead of intervals, I prefer a collection of random draws from the posterior or posterior predictive distributions. Means and confidence intervals can hide structure.)

2 should always be done when computationally feasible, unless there is good reason to believe the priors on distributional parameters are very narrow.

14. Briggs

Wolfgang, I am working on a paper on this subject with my friend Russ Zaretzki (who, everybody should know, has no idea what the word “climate” even means, which I point out so nobody will think he is a crazy as I am; he is an excellent statistician, however). Data and code will be coming when finished. However, it’s no secret. I used normals and flat priors for all, which lead to t posteriors. See Bernardo and Smith, the Appendix to see the exact equations for regression.

Mike D, Amen. Measurement error just adds to the uncertainty. I have pointed this out elsewhere, but have given no examples yet.

Bob, right on. It’s not that a model cannot describe reality, it’s just that they often do a poor job.

Π 1. Yes, the posterior predictive, not the posterior for μ 2. Not quite. See today’s post on “meaning on mean means”. Red lines are just posterior predictive. The blue lines are the parameter posterior (on μ).

And I agree with you that it’s always better to show the full posterior distribution when possible. This is because showing an interval can give misleading results, as you say. However, these posteriors (for observables or μ) are all t, and are symmetric, so the intervals are reasonable summaries.

The term “random draws”, oh my. I’ll be talking about the mysticism of “random” in future posts.

And no matter if I know the exact value of the parameters, I still do not know the future value of the observables.

Briggs

15. PI

I don’t know what “mysticism” you’re talking about. There is nothing wrong with taking random draws from the posterior to see examples of what the predictions look like. (See, for instance, pretty much anything ever written by Andrew Gelman.) In general, not all of this information will be captured either by full posteriors of the parameters, or by confidence intervals on the predictions. What you want is posteriors of the predictions, but when the predictions are time series this is difficult. Hence, pick N samples and plot them on top of each other. As you say, when the posterior is of simple form this is not necessary, but in more complicated cases it is useful.

And while the trend doesn’t tell you the future values of observables, the posterior predictive distribution also obscures the trend. We always experience the combination of weather and climate, but it is of physical interest to disentangle weather from climate. Both kinds of inferences are useful.

16. Khebab

Re #3.

I’m a little bit lost here. I thought that a stationary process (wide-sense stationarity) requires that 1st and 2nd moments do not vary with respect to time:

E[y(t]]= m(t)= m(t+Dt)

which is not the case here.

17. Briggs

Khebab, it might help to just replace the “t” with an “x” in the models and ignore the “time” aspect. These are just made up data anyway.

Π, thank you for bringing up Gelman. I take his name in vain in a paper about logic in probability, which you can find on my Resume tab : look for “Broccoli”.

However, this does not answer your point, which I promise to do at a later time in a post about “randomness.”

The distinction between climate and weather is true, but it is separate entirely from this posting. Whether “trend” is useful is relevant, but only insofar as it helps us make predictions. There, the main point still stands: usual statistical methods lead to overconfidence.