My heart soared like a hawk1 after learning that Messieurs Philip Klotzbach and William Gray have admitted in print the hardest thing scientists can ever confess. That they were wrong.
If only other climatologists would follow suit!
The abstract of their paper Qualitative Discussion of Atlantic Basin Seasonal Hurricane Activity for 2012 (pdf) says it all (see also this press report):
We are discontinuing our early December quantitative hurricane forecast for the next year and giving a more qualitative discussion of the factors which will determine next year’s Atlantic basin hurricane activity. Our early December Atlantic basin seasonal hurricane forecasts of the last 20 years have not shown real-time forecast skill even though the hindcast studies on which they were based had considerable skill. Reasons for this unexpected lack of skill are discussed.
I wept joyously when I read this paragraph because Bill Gray’s pre-season hurricane predictions (how many, what strengths) have been annual events for two decades. They are covered by the media. The forecasts themselves are used in decisions involving real money, real lives. In short, they are important. This is why admitting that they aren’t accurate is so momentous.
And, yes, there is strong suspicion that because we cannot forecast how many hurricanes the coming season will have, we might also not be able to forecast what the global average temperature will be to the nearest tenth of a degree fifty years hence.
The key lies in understanding “real-time” forecast skill and “hindcast” skill. Two alternate names are predictive skill and model fit. If we can get these two concepts, we will appreciate the vast amount of over-certainty in science.
All models are fit to past observed data. This is true of correlational hurricane models, statistical-physical GCMs, purely statistical models used in sociology, psychology, etc.; that is, any model that uses observed data. Hindcast skill is when the model fits that past data well.
It turns out that for any set of observed data, a model that demonstrates good fit, a.k.a. hindcast skill, can always be found. By always I mean always, as in always. There is thus nothing special about reporting that a statistical model fit past data well (demonstrated hindcast skill)—except in the rare situation where the form of the model has been deduced, and is therefore rigidly fixed. That was not the case with Gray’s hurricane models, nor is it the case in any of the social-science statistical models that I know. (It is true for example in casino games.)
Again, a model that demonstrates hindcast skill or good fit can always be found. So the true test of a model is how well it predicts data that was not used in any way to fit itself. That is, new data. If a model can skillfully predict brand new data, if it can demonstrate “real-time” forecast or prediction skill, then we know the model is good, is worthy, is telling something true about the world.
There is, incidentally, nothing special in the use of the word forecast. It is only shorthand to indicate that never-seen-before-new data is being predicted. It is a natural word when talking of data that comes to us indexed by time, but that index isn’t needed. All that is required to demonstrate real-time or prediction skill is data not seen before; whether that data comes from the future or the past is immaterial.
Gray’s hurricane models showed hindcast skill but not real-time forecast skill. This means that the model should not be used and that the model used as a reference, either persistence of climatology (I believe the case here), should be looked to. That is, Gray’s very clever statistical model is giving results which are poorer than the model that says, “Use the average number of storms as a prediction for the number of storms this year.” That is the “climatology” model.
It’s even worse than it appears, for Gray was using (in part) an over-certain measure of skill, R-square. R-square always—as in…well, you get the idea—inflates the certainty one has in model fit or prediction skill. The reason this is so is explained in this series What Is A True Model? What Makes A Good One? (skill is also defined here). This means that if Gray were to use a better measure of performance, his confidence in the usefulness of his model would decrease further.
But give Gray—and most meteorologists—credit: they do what most users of statistical models never do, and what should be mandatory. The used their models to predict new data. That’s how they were able to learn that the model didn’t work. They could not have learned this lesson using hindcast skill/model fit. I cannot recall seeing any sociologist, psychologist, educationist, etc., etc. report prediction skill of their models. They only report (an over-confident version of) model fit. Over-certainty is thus guaranteed.
1A favorite saying of Old Lodge Skins from Thomas Berger’s Little Big Man: highly recommended.