This is a technical addendum to the main series. I would have skipped this, but Climategate 2.0 revealed many misapprehensions of verification statistics that I want to clear up, particularly about R2 and skill. This will be fast and furious and directed at those already with sufficient background.
A model is created for some observable y. The model will be conditional on certain probative information x and at least some unobservable parameters θ All terms can of course be multidimensional. Classical procedure—in physics, climatology, statistics, wherever—first gathers a sample of (y,x) and uses this to find a best guess of the parameters, called hat-θ (no pretty way to do display this in HTML). The “hat” indicates a guess. It does not matter to us how this guess is derived, merely that it exists. Nobody—and I mean nobody—believes the guess to be perfectly accurate.
Next classical step is to form “residuals”, which are derived by plugging the hat-θ into the model and then back-solving for y: the results are called hat-y. From this we calculate R2, which is just the norm of (y – hat-y), i.e. of the “residuals”—in one-dimension, the norm is just the normalized sum of squared residuals.
The problem is that since nobody believes the guess of hat-θ, nobody should believe the residuals. If we base our verification solely on R2 we will too certain of ourselves. If you’ve based your confidence in the climate model based solely on measures like R2, or on any other norm/utility that takes as input a guess of the parameters, you think you know more than you do. This is utterly indisputable. Every temperature reconstruction I’ve ever seen uses R2-like measures for verification: they are thus too certain. To eliminate over-certainty, you must account for the inaccuracy in the guess of θ.
This is easy to do in Bayes: one simply integrates out the parameters, giving as a result the probability distribution of y given x. This speaks directly in terms of the observables and only assumes the model is true—which is what R2 also assumes, but R2 adds the assumption that hat-θ is error free.
Now it gets tricky. Using Bayes, we indeed have the probability distribution of new y’s given new x’s and given the information contained in the original sample (y,x). But all we have in front of us is that original sample. We can one of two things: one weak, one strong.
Weak: We see how well we could have predicted the old sample assuming it is new. We have Pr(y-new | x-new, (y,x) ), which is the prediction of new observables y given new observables x and given the information contained in the old sample (the parameters are integrated out). We take each pair of old data (y,x)i and use the x from this pair as the x-new. We produce the prediction Pr(y-new | xi, (y,x) ). We compare this prediction of y-new with yi. The prediction is of course a probability distribution, and yi is a (possibly multi-dimensional) point. But we can use things like the continuous ranked probability score (CRPS) or other measure to score this prediction. Many other scores exist which will work: use the one that makes most sense to a decision maker who uses the prediction.
This is weak because it double-uses the sample. But so does R2, and in just the same way. Everybody double-uses their sample. Even cross-validation is a double-use (or more!). It’s not wrong to do the double-use, since it does give some idea of model performance. But since—are you ready?—it is always—as in always—possible to build a model that fits (y,x) arbitrarily well, you will always—as in always—go away more confident about your model than you have a right to be.
I’ll repeat that: R2 (and similar measures) double-uses the original sample and does not account for uncertainty in the parameters. Over-certainty is not just likely, it is guaranteed. This is not Briggs’s opinion. This is true. Using Pr(y-new | xi, (y,x) ) also double-uses the original sample and also causes over-certainty.
Strong: Wait for x-new and y-new to come along—ensure they are never seen before in any way, brand-spanking new observables! Produce your prediction Pr(y-new | x-new, (y,x) ) and then compare it to the y-new using CRPS or whatever loss function makes sense to the user of the prediction. This is the only way—as in the only way—to avoid over-certainty. This again is not opinion, this is just true.
Physicists, chemists, electronics engineers and the like of course do this sort of thing all the time. They are not satisfied to produce one version of the model, report once on R2 and call it a day. They test the models out again and again, and on new data. Models that perform poorly are scrapped or re-built. Statisticians should do the same.
Skill: I often speak of model skill and I want to give the technical definition. Skill also comes in weak and strong versions.
Weak: Take the verification measure from your model as above—whether it’s R2, CRPS, whatever—and save it. Then build a new model which should look like your old model, except that it should be “simpler.” Perhaps in your original model the dimension of θ is 12, but in the simpler model it is only 7. The choice is yours. For climate models the natural choice is called “persistence”, which is model that says “the next time period will be exactly like this time period.” The choice of the simpler model should be directed by the question at hand.
Skill is when the more complex model beats the simpler model in terms of the verification score. Simple as that. If the more complex model cannot beat the simpler model, then the simpler model (of course) is better and should be used in preference to the complex model.
Of course, if you’ve used R2 or the weak-Bayes prediction (re-used the sample), your estimate of skill will be too certain. Re-use is still re-use, even for skill.
Strong: Same as the weak except the comparisons of performance are done on the model’s predicting new data, as above.
Climate models don’t have strong skill (as far as I’ve seen) at predicting yearly global average temperature. They do not (again, as far as I’ve seen) beat persistence. Thus, their predictions cannot yet be trusted.
Last word: a sufficient sample of performance measures must be built to demonstrate there is high probability that skill is positive. We build these models (of future skill) as we build all probability models.
We’re all weary of this, so that is all I want to say on models and model performance at this time.