Hurricane Predictors Admit They Can’t Predict Hurricanes

My heart soared like a hawk1 after learning that Messieurs Philip Klotzbach and William Gray have admitted in print the hardest thing scientists can ever confess. That they were wrong.

If only other climatologists would follow suit!

The abstract of their paper Qualitative Discussion of Atlantic Basin Seasonal Hurricane Activity for 2012 (pdf) says it all (see also this press report):

We are discontinuing our early December quantitative hurricane forecast for the next year and giving a more qualitative discussion of the factors which will determine next year’s Atlantic basin hurricane activity. Our early December Atlantic basin seasonal hurricane forecasts of the last 20 years have not shown real-time forecast skill even though the hindcast studies on which they were based had considerable skill. Reasons for this unexpected lack of skill are discussed.

I wept joyously when I read this paragraph because Bill Gray’s pre-season hurricane predictions (how many, what strengths) have been annual events for two decades. They are covered by the media. The forecasts themselves are used in decisions involving real money, real lives. In short, they are important. This is why admitting that they aren’t accurate is so momentous.

And, yes, there is strong suspicion that because we cannot forecast how many hurricanes the coming season will have, we might also not be able to forecast what the global average temperature will be to the nearest tenth of a degree fifty years hence.

Bill Gray's hurricane models

The key lies in understanding “real-time” forecast skill and “hindcast” skill. Two alternate names are predictive skill and model fit. If we can get these two concepts, we will appreciate the vast amount of over-certainty in science.

All models are fit to past observed data. This is true of correlational hurricane models, statistical-physical GCMs, purely statistical models used in sociology, psychology, etc.; that is, any model that uses observed data. Hindcast skill is when the model fits that past data well.

It turns out that for any set of observed data, a model that demonstrates good fit, a.k.a. hindcast skill, can always be found. By always I mean always, as in always. There is thus nothing special about reporting that a statistical model fit past data well (demonstrated hindcast skill)—except in the rare situation where the form of the model has been deduced, and is therefore rigidly fixed. That was not the case with Gray’s hurricane models, nor is it the case in any of the social-science statistical models that I know. (It is true for example in casino games.)

Again, a model that demonstrates hindcast skill or good fit can always be found. So the true test of a model is how well it predicts data that was not used in any way to fit itself. That is, new data. If a model can skillfully predict brand new data, if it can demonstrate “real-time” forecast or prediction skill, then we know the model is good, is worthy, is telling something true about the world.

There is, incidentally, nothing special in the use of the word forecast. It is only shorthand to indicate that never-seen-before-new data is being predicted. It is a natural word when talking of data that comes to us indexed by time, but that index isn’t needed. All that is required to demonstrate real-time or prediction skill is data not seen before; whether that data comes from the future or the past is immaterial.

Gray’s hurricane models showed hindcast skill but not real-time forecast skill. This means that the model should not be used and that the model used as a reference, either persistence of climatology (I believe the case here), should be looked to. That is, Gray’s very clever statistical model is giving results which are poorer than the model that says, “Use the average number of storms as a prediction for the number of storms this year.” That is the “climatology” model.

It’s even worse than it appears, for Gray was using (in part) an over-certain measure of skill, R-square. R-square always—as in…well, you get the idea—inflates the certainty one has in model fit or prediction skill. The reason this is so is explained in this series What Is A True Model? What Makes A Good One? (skill is also defined here). This means that if Gray were to use a better measure of performance, his confidence in the usefulness of his model would decrease further.

But give Gray—and most meteorologists—credit: they do what most users of statistical models never do, and what should be mandatory. The used their models to predict new data. That’s how they were able to learn that the model didn’t work. They could not have learned this lesson using hindcast skill/model fit. I cannot recall seeing any sociologist, psychologist, educationist, etc., etc. report prediction skill of their models. They only report (an over-confident version of) model fit. Over-certainty is thus guaranteed.


1A favorite saying of Old Lodge Skins from Thomas Berger’s Little Big Man: highly recommended.


  1. Gavin

    “we might also not be able to forecast what the global average temperature will be to the nearest tenth of a degree fifty years hence. ”

    Strawman (again).

  2. obiwankenobi

    Anecdote: While involved in the international model validation exercise using blind (Chernobyl) data our probabilistic model out performed the other 17 non-stochastic models. When told by the model creator that it only over predicted the actual value by a factor of 4 was “really good for a computer model” I shook my head and went back to my paperwork.

  3. Bob T.
  4. genemachine

    Regarding the correct “null model” to compare model forecast skill to, is it always best to to compare against a static model or might it be worthwhile also comparing against a linear equation, a polynomial, or another simple model?

  5. Briggs


    Any comparator may be used. Usually it is one that “makes sense” in the situation at hand, as persistence and climatology do in forecasting the weather or climate.

    For a typical sociological model, say one which demonstrates a new form of bias, the relevant natural comparator is the identical model but without the bias terms. But in any case, the comparator is an extra-statistical choice.

    Gav! How I’ve missed you!

  6. genemachine

    Thanks Briggs. I see that in the Hargreaves paper linked to above they did actually consider a persistence model and a trend and, yeah, persistence generally makes sense in temperature trends.

  7. j ferguson

    Gavin’s right. it was the tenth of a degree resolution that no-one is forecasting.

  8. Will

    This is really good news. This is, dare I say it, scientific progress. Now they can move on and start building new model using different ideas. Heck, this openess and honesty might even be seen by some as an invitation to contribute!

  9. Matt

    So…Gavin’s strawman was really hyperbole?

  10. David S

    Hello Gavin. In support of your argument that WMB’s aside about climate was a straw man, can you remind us what the climate models’ predicted global mean temperature anomalies were for this year 10 years ago, and how many of them are currently inside their error bars as issued at the time? Obviously if they have any skill at all, far more of them will be within their forecast error bars than one would have expected to have happened by chance.

  11. If not hyperbole, at least toes being tightly pinched.

  12. A significant amount of money could be saved if resources were applied to improvements in estimates of the tracks: narrow the angle of that cone of potential landfall. At the present time, the early estimates are almost of no practical value.

  13. Gavin

    David S.

    The spread of AR4 model simulations for 2011 (global mean temperature) compared to the 1980-1999 baseline is [-0.12, 0.87] deg C (55 simulations), approximately N(0.43,0.20) (i.e. 0.43 +/- 0.4 deg C, 95%).
    Preliminary estimates for 2011 are around 0.27 deg C with respect to the same baseline (GISTEMP). Thus the actual temperature is well within error bars. The spread of the simulations is a function of the internal variability that for short periods overwhelms any forced signal, and so this isn’t a particularly useful test. See also:


    “Gav! How I’ve missed you!’

    Why? You surely don’t need me to remind you of the difference between logic and rhetoric. Though your frequent confusion of the two on this particular subject does give one pause…

  14. Will

    Gavin: Was hoping you might answer a few questions. My apologies if they appear ignorant. Thanks for the link btw.

    1. the estimates were made using a model that’s seen no refinement in ten years? Or the estimates were made using a model trained only using data up to 2000? I’m asking because the original question was about a prediction made 10 years ago. He wasn’t looking for an estimate from a model made this year using 10 year old data.

    2. It looks as though the uncertainty margin in the graph you provided seems rather large. One could draw a flat line from 1980 to 2010 and still be within the 95 percentile. Am I misreading that? Is this uncertainty for the models output, or the observed temperature?

    3. What do you mean by 55 simulations? Is the data the same each run, with slight alterations to the model, or is it the data that changes? Im assuming this is an ensemble approach; is the final result a simple average, or is it a weighted average?

    4. Maybe a dumb question, but do you know of a graph that shows model prediction versus a null hypothesis? In terms of model validation it would help in understanding performance against margins. A histogram of the error, made by each of the 55 simulations, would also provide a “risk assessment” overview.

    Thank you. Hope you do respond.

  15. Noblesse Oblige

    “If only other climatologists would follow suit!”

    Messrs. Gray, Klotzbach, and a few others (e.g., Roger Pielke Sr.) exist at the impurity level in the climate business — one part per million! They are honest and trustworthy scientists ‘of the old school’ who let the chips fall where they may. Follow their work and their thinking and you will learn a lot.

  16. Noblesse Oblige

    Some of us miss real science instead of realclimate.

    As usual you have to keep your eye on the pea.

    What the boys do is to take the ensemble of models and treat them as a statistical distribution, even though different models have different physics and thus different climate sensitivities. The envelope of uncertainty of this ensemble is so large that it doesn’t yet rule out ‘consistency’ with the observed trend at the 95% confidence level. Of course this means that treating the models this way has little or no predictive value over a 10-20 year period because their spread is so large. Thus anything that happens ‘is consistent’ with the models. OK so far?

    But in reality, the models do not comprise a statistical distribution of attempts to ‘measure’ the same thing. Some models have lower climate sensitivities (e.g., < about 2 degrees for 2 x CO2) so that they cannot yet be falsified when compared to observation, whereas the ones with larger climate sensitivities can be falsified. In real science, what would be done now is to throw out the models with the high sensitivities and keep for now the ones with lower sensitivities. One would then announce that the temperature record is consistent with lower climate sensitivity, which it is. But that is not how the boys operate. They circle the wagons around ALL the models and say that they are consistent.

    If you think I just made this up, you would be wrong.

  17. Andy

    “Gav! How I’ve missed you!’

    Why? You surely don’t need me to remind you of the difference between logic and rhetoric. Though your frequent confusion of the two on this particular subject does give one pause…

    Unintentionally ironic methinks.

  18. I’m paying Gavin to massage (and occasionaly waterboard) data, to defend those tender ministrations on a PR firm’s website, and, occasionally, to grace the pages of Briggs’s blog?

    Well, OK, if I must…

  19. mike williams

    Gavin says:
    14 December 2011 at 6:46 pm “compared to the 1980-1999 baseline”

    $CAGW$ Strawman again..sigh..they will never learn..
    Special time scale chosen ..why not go further back in the time series..go on..dont be have time to waste using those “special” start dates.. If thats an example of “logic” rather than hand waving rhetoric than sorry..ya get a fail.. 🙂
    Please read the links posted and try again..

  20. obiwankenobi

    Let me see if I’m following all of this stuff. Way back when the FDA set a 1e-06 limit for manmade carcinogens in foodstuffs. We now know that the test used to determine the carcinogenticity of manmade substances was bogus (50% of all chemicals, including manmade ones are carcinogens according to the FDA approved test). Moving along we now learn that the Linear No Threshold Theory of Harm (again cancer is the endpoint) of ionizing radiation is bogus (low doses elicit different mechanisms than high doses and the resulting risk of cancer is vastly sublinear and possibly even hormetic). Now, the same bunch that told us in the ’70s that low doses of chemicals and ionizing radiation are cancer-causing is telling us that low doses of CO^2 will destroy the planet. Go figure.

  21. Michael Ozanne

    “David S says:

    14 December 2011 at 2:46 pm

    can you remind us what the climate models’ predicted global mean temperature anomalies were for this year 10 years ago, ”

    From the PDF of the WG1 technical summary Section F.3 Projections of Future Changes in Temperature Figure 22 from IPCC Third Assessment Report – Climate Change 2001 logged on the IPCC website by GRID-Arendal in 2003.

    Temperature change at the beginning of 2012 was predicted at 1995 + 0.2 to 0.6 deg C with individual predictions given as

    IS92 (TAR) 0.3
    A1T 0.45
    B2 0.45
    B1 0.4
    A1F1 0.35
    A2 0.3
    A1B 0.35

    The Forecast range at Year 2100 varies between +- 30% and +- 50% which at rough glance would agree with 0.4 +- 0.2

    Figures not as precise as I would like As I couldn’t easily locate a data view with the actual numbers in it, and I do have a day job after all.


    Arithmetic mean of the Global Temperature anomaly in 1995 was 0.44 so predicted temp anomaly would be 0.84 +- 0.2

    Arithmetic mean of 2011 (Jan to Nov) GTA is 0.52 and is currently (i.e Nov value) 0.45

    Apologies for hurried and imprecise nature of this response, what you’ll do to avoid boredom during a database restore, but I hope that it is at least responsive.

  22. JH

    I wish the paper had briefly described the forecasting schemes under discussion instead of referring to the papers, some of which is incorrectly referenced. Where is Klotzbach (2008) cited in Figure 2?

    The sample correlation coefficient r is defined to measure the LINEAR strength between two variables. Rsquare is only equal to r*r =r^2 for a simple (one independent) linear model.

    Using r to measure the performance of a forecasting scheme should be accompanied by a scatter plot of observed versus predicted values. A small r between them might not be an indication of large deviations (poor fit) between them; it could be the relationship between them deviates from the linear pattern. So using r to conclude a poor skill of the scheme may not be appropriate. How about using MSE or MAD?

  23. JH

    One probably can simulate a times series such that x(t)= [ x(t-1)+x(t-2)+x(t-3) ] /3 + 0.1 *w[t], where w[t] is a standard normal white noise. Notice the small standard deviation by multiplying 0.1. Let’s use the scheme of the average of observations from the three past time periods to make a one-step-ahead forecast. Now, do the hindcasting or backtesting, I think…just my professional intuition that could be wrong… it’d not be hard to generate a series that would result in a higher r between observed and predicted values for the first, say, 45 time periods, and a very low, possibly negative, r for the last 4 or 10 time periods. Yet, the scheme seems adequate for the model. The uncertainty in the conclusion made based on a sample correlation calculated from 4 time periods (see Figure 2 in the paper) is high.

  24. Paul Linsay

    Way back in the bad old days, Steve M. saw fit to print a little note of mine about hurricanes. I pointed out that annual hurricane counts were consistent with a Poisson process. It’s easiest to see in the Atlantic hurricane count since the average is about 5.2 hurricanes per year. Sure enough, the distribution of annual counts fits a Poisson distribution with a mean of 5.2. (Actually, no fitting is needed, just plot the distribution with its area normalized to one and plot a Poisson distribution on top, they pretty much match.)

    A second test for a Poisson process is that the time between hurricanes should follow an exponential distribution if the rate of hurricane production is constant. This also works fine for intervals between about 2 and 100 days.

    The number of hurricanes in a year is going to follow a random walk since the process is described by a Poisson process, and it does. Early in the 20th century there was a lull, towards the end there was an uptick. We’re now back to a lull but it’s all consistent with a random walk.

    The Poisson distribution is harder to see in the various Pacific basins because there are so many more hurricanes per year. It would require many more years of data to see a nice distribution. With each each basin the time distribution between hurricanes falls on a nice exponential as predicted by a Poisson process.

  25. Will

    Michael Ozane: thank you for posting that information. Answered a lot of questions.

Leave a Reply

Your email address will not be published. Required fields are marked *