Be sure to first read Statistical Significance Does Not Mean What You Think. Climate Temperature Trends, as this post is an extension of it.
As yesterday: If y is some temperature and t time, a simplified model might be
yt = β0 + β1t + ε
where we make the usual assumption that our uncertainty in ε is characterized by a normal distribution with parameters 0 and σ. If β1 is positive we announce that there is a “trend” in the data. Once more, this is to speak improperly. No model in the world can tell you if there was a trend with certainty. But a simple glance at the data can. Just look!
Please understand that it makes no difference how many complications you add to this model (e.g., “correlated residuals”, etc.), everything said below still holds.
Another example: If y is income and x the number of years of education, a standard model is
yt = β0 + β1x + ε
where once more the usual assumptions. (How to pick the “best” x’s, I’ll answer another day.)
Incidentally, all normal distributions1 are characterized by two parameters, a central parameter which tells us where the peak of the density is, and a spread parameter which controls the width of the distribution. I am not being pedantic when I insist that these are not to be called the “mean” and “standard deviation.” Those two objects are functions of the observable data which are often used as classical estimates of the parameters. But they are not the parameters.
The problem is that since the parameters do not exist, we can never know whether our guesses are accurate. I repeat: parameters are not observable. They are entirely metaphysical. Their placement in equations is the logical consequence of the premise that a normal (or other distribution) is assumed to characterize our uncertainty in some observable (like y).
The observable y itself is not “normally distributed.” No thing in the world is “normally distributed”, nor “binomially”, nor any-other-distribution-ally. It is true that we often say, perhaps after looking at graphical evidence of y, that y is or is not “normally distributed.” But this is to speak incorrectly. The observable takes whatever values it does because of some causal process. Each, and yes, every instance of y is caused. Probability distributions cannot cause.
Our slogan is: end the slavery of reification! (I’ll speak more on this another day.)
What we should say in regression (or any case in which we use probability) is that our uncertainty in y, for a given value of t or x, is characterized by a normal distribution with parameters
y ~ N(β0 + β1t, σ)
y ~ N(β0 + β1x, σ).
Regression, then, is a model not of the observable y but of the central parameter of the normal distribution which characterizes our uncertainty in y. That line we usually draw over scatterplots to indicate regression is a line of the central parameter given varying values of t and x. To emphasize: this line says nothing directly about the observable y; it is itself unobservable, metaphysical, a fiction.
Even if we assume we know the values of the parameters β0, β1, and σ, we still know nothing directly about y. Thus, when we speak of “residuals”, which are had by plugging in classical guesses for the parameters and “solving” for y, we speak incorrectly. What we are solving for are the values of the central parameters, for given values of t and x.
We cannot solve for y in any probability model. Plus, there is no reason to. If we want to know what the values of y were, all we have to do it look!
What we can do is to assume our model is true and use it to quantify our uncertainty in values of y we have not yet seen. We do not need a model to quantify uncertainty in data we already know! We properly speak of our uncertainty in new values by “integrating out” the parameters of model and (as yesterday) producing statements like this:
Pr (ynew > a | tnew, old observed data, model true)
Pr (ynew > a | xnew, old observed data, model true)
Where a can be any number we like, even an interval, and we assume that our model is perfectly true, flawlessly true, just plain true. There is no information in these probabilities about whether the model is true. Notice that there are no parameters in these equations.
It is also improper to say that our model “makes predictions of y.” It does not. It gives us the probability that y takes certain values. This probability can be turned into a prediction of a unique y, but only after we marry the predictive probability distribution with a measure of how important prediction mistakes are to us. That is, a prediction of a unique y, since it is a distillation of the probability that y takes any value, is a kind of decision, and to understand those we need to enter into the subject of “decision analysis”, which I won’t do here today.
Suffice to summarize: We start by observing pairs of (y , t) or (y , x). We need never use probability to talk about these values: they are there, lying open; whatever knowledge about them we want can be had just for asking. But we do not know what values y will take when we observe new values of t or x. Thus we must characterize our uncertainty in new y given new t and x, and assuming our model is true.
The parameters of our model are only of tertiary importance; they are dull things; they cannot be seen; they do not exist. Best to “integrate them out” and speak directly of our uncertainty in actual observables, like y, t, and x.
Technical Facts For Geeks
We might infer the posterior values of β0, β1, and σ to sufficiently high certainty. But the certainty in these parameters does not imply that we have sufficiently high certainty in new values of y (given new values of t or x and assuming our model is true).
Researchers often issue statements about their models, but talk only of their uncertainty in the parameter values. This certainty is “transfered” to certainty that the model is true, or that the new values of y can be predicted just as certainly. This is false. As in not true. As in not so.
Thus, climate temperature modelers might go on about how Pr(β1 > 0 | old data, model true) is high, and say therefore it is also highly probable that y itself will increase substantially in the future. Again, this is not so. We can know (by assumption, for instance) the value of β1 precisely, but this does not mean we know future values of y with precision.
Even stronger, all these probabilities assume the model itself is true. Once more, there is no information in the old data which can prove a model is true or false. It is always an assumption—or an inference (of the kind frequentists are forbidden to make!).
What we can do is use the model to characterize our uncertainty in new values of y; then, after incorporating these probabilities into a decision process and producing guesses of new y, we can wait until we see new values of y and then see how useful the guesses were. And that is it.
We can, of course, compare the usefulness of other models. But—and here is the subtlety—to pick the “best” model from a suite of competitors is itself an inferential process, just like our original regression.
But enough! I can’t possibly re-create an entire theory of probability in one post.
1We’ll also never mind that normal distributions are always an approximation for our uncertainty: we always make a mistake of some kind when using a normal. Let he that readeth understand.