This brings us to the second reason for measuring model goodness. Or rather an incorrect implementation of it. A lot of folks announce how well their model fit past data, i.e. the data used to fit the model. Indeed, classical statistical procedure, which includes hypothesis testing, is built around telling the world how well models fit data. Yet model fit is of no real interest to measure forecast goodness.
I’ll repeat that. Model fit is of no interest to measure forecast goodness.
It matters not to the world that your patented whiz-bang prize-winning model, built by the best experts government grants can buy, can be painted on to past data so closely that nothing is left to the imagination, because a model that fits well is not necessarily one that predicts well. Over-fitting leads to great fits but it causes bad predictions. Incidentally, this is yet another in a long list of reasons to distrust p-values.
You have to be careful. What we’re listening for are claims of model skill. What we sometimes get are announcements of skill that appear to be predictive skill but are model-fit skill. A model, once created (and this can be a purely statistical or purely physics model or somewhere in-between as most are) is used to make “predictions” of the data used to create the model and skill scores are calculated. But these are just another measure of model fit. They are not real predictive skill. We only care about predictions made on observations never made known (in any way) to the model developers.
There are claims that climate models have skill or that they have good proper scores. This is false in the predictive sense for forecasts out past around a year or so (climate models out a few months actually have a form of skill noted below). What is usually meant by this, when people claim good model performance, is that the model either fit old data well or that the model was able to reproduce features in old data. However much interest this has for model builders—and it does have some—it is of zero, absolutely zero interest for judging how good the model is at its job.
There are two standard simpler or naive model is meteorology and climatology: climatology (unfortunately named) and persistence. The climatology forecast is some constant, just like in the naive regression model. It’s usually the value (mean and standard deviation used to fit a normal) over some period of time, like 30 years. Obviously, a complex model that can’t beat the forecast of “It’ll be just like it was on average over the last 30 years” is of no predictive value. Persistence says the next time point will be equal to this time point (again, this time point might be fit to something like a normal, a procedure which uses more than just one time period, in order to make persistence into a probability forecast). Again, complex models which can’t beat persistence are of no predictive value.
Would you use a model which can’t beat just guessing the future will be like the past?
I’m not entirely sure, but I don’t think the sort of models on which the IPCC relies even have climatology predictive skill these last 20 or so years. None have persistence skill.
Again I ask: why use any model which can’t beat just guessing? The “model” of saying the future will be (more or less) like the past is beating the pants off of the highly sophisticated complex models. Why? I have no idea: well, some idea. But even if we’re right in that link, it doesn’t solve the model problems. Indeed, nobody really knows what’s wrong. If they did, they would fix it and the models would work. Since they don’t work, we know that nobody has identified the faults.
The third reason to check model performance is somewhat neglected. If the model over some time period has this-and-such score, it is rational to suppose, outside of evidence to the contrary, that it will continue to perform similarly in the future. This is why, unless we hear of major breakthroughs in climate science, it is rational to continue doubting GCMs.
But past performance can also be quantified. In effect, the past scores become data to a new model of future performance. We can predict skill. Not only that, but we can take measures of things that were simultaneous with the forecast-observation pairs. These become part of the model to predict skill. Then if we have a good idea of the value of these other things (say El Niño versus non-El Niño years), then we might be able to see in advance if the forecast is worth relying on.
Those are the basics of forecast verification. There is, of course, much more to it. For instance, we haven’t yet discussed calibration. Of that, more later.
Bonus Via Bishop Hill, this.