Climate Model Uncertainty: Part II

Read Part I

The Analysis (cont.)

Two problems arise when comparing a model’s integration (the forecast) with an analysis of new observations, which are not found when comparing the forecast to the observations themselves. Verifying the model with an analysis, we compare two equally sized “grids”; verifying the model with observations, we compare a tiny number of model grid points with reality.

Now, some kinds of screwiness in the model are also endemic in the analysis: the model and analysis are, after all, built from the same materials. Some screwiness, therefore, will remain hidden, undetectable in the model-analysis verification.

However, the model-analysis verification can reveal certain systematic errors, the knowledge of which can be used to improve the model. But the result is that the model, in its improvement cycle, is pushed towards the analysis. And always remember: the analysis is not reality, but a model of it.

Therefore, if models over time are tuned to analyses, they will reach an accuracy limit which is a function of how accurate the analyses are. In other words, a model might come to predict future analyses wonderfully, but it could still predict real-life observations badly.

Which brings us to the second major problem of model-against-analysis verification. We do not know actually how well the model is performing because it is not being checked against reality. Modelers who rely solely on the analysis model-checking method will be—they are guaranteed to be—overconfident.

The direct output of most climate and weather models is difficult to check against actual observations because models makes predictions at orders and orders of magnitude more locations than there are observations. Yet modelers are anxious to check their models at all places, even where there are no observations. They believe that analysis-verification is the only way they can do this.

This is important, so allow me a redundancy: models make predictions at wide swaths of the Earth’s surface where no observations are taken. At a point near Gilligan’s Island, the model says “17oC”, yet we can never know whether the model was right or wrong. We’ll never be able to check the model’s accuracy at that point.

We can guess accuracy at that point by using an analysis to make a guess of what the actual temperature is. But since model points—in the atmosphere, in the ocean, on the surface—outnumber actual observation locations by so much, our guess of accuracy is bound to be poor.


Actual observations can be brought into the picture by matching model forecasts to future observations and then building a statistical model between the two. This is called model output statistics, or MOS. The whole model, at all its grid points, is fed into a statistical model: luckily, many of the points in the model will be found to be non-predictive and thus are dropped. Think of it like a regression. The models’ output are like the Xs, and the observations are like the Ys, and we statistically model Y as a function of the Xs.

So, when a new model integration comes along, it is fed into a MOS model, and that model is used to make forecasts. Forecasters will also make reference to the physical model integrations, but the MOS will often be the starting point.

Better, MOS predictions are checked against actual observations, and it is by these checks which we know meteorological models are improving. And those checks are also fed back into the model building process, creating another avenue for model improvement. MOS techniques are common for meteorological models, but not yet for climatological models.

Measurement Error

MOS is a good approach to correct gross model biases and inaccuracies. It is also used to give a better indication of how accurate the model—the model+MOS, actually—really is, because it tells us how the model works at actual observation locations.

But MOS verification will still given an overestimate of the accuracy of the model. This is because of measurement error in the observations.

In many cases, nowadays, measurement error of observations is small and unbiased. By “unbiased” I mean, sometimes the errors are too high, sometimes too low, and the high and low errors balance themselves out given enough time. However, measurement error is still significant enough that an analysis must be used to read data into a model; the raw data measured with error will lead to unphysical model solutions (we don’t have space to discuss why).

Measurement error is not harmless. This is especially true for the historical data that feeds climate models, especially proxy-derived data. Proxy-derived data is itself the result of a model from some proxy (like a tree ring) and a desired observation (like temperature). The modeled—not actual—temperature is fed to an analysis, which in turn models the modeled observations, which in turn is physically modeled. Get it?

Measurement error is a problem is two ways. Historical measurement error can lead to built-in model biases: after all, if you’re using mistaken data to build—or if you like “inform”—a model, that model, while there is a chance it will be flawless, is not likely to be.

Plus, even if we use a MOS-type system for climate models, if we check the MOS against observations measured with error, and we do not account for that measurement error in the final statistics (and nobody does), then we will be too certain of the model’s accuracy in the end.

In short, the opportunity for over-certainty is everywhere.

Read Part I


  1. DAV

    I dunno. It sounds like MOS amounts to an extension of the original model. A Stage II after which you are back to square one.

  2. Steve E

    I can see how MOS improves meteorological forecasting. Forecasts and observations can be made in small enough time chunks to keep the error bars small. How can you make it work in climatology when the forecasts and observations need to occur over significantly longer time intervals. By the time I reach a point in time to test a forecast (say 30 years) the potential error bars could be significant. Of course I can retune at that time, but my retuning could dramatically change the much longer term (say 100 year) forecast. Meanwhile I’ve made significant capital investments and perhaps socially reengineered my society and now I may need to move in a dramatically different direction.

    My brain hurts.

  3. Ray

    I always wondered how you could use tree rings as a proxy for temperature. Way back when I was doing work in optimal control using estimtion theory, there had to be some functional relationship between what was measured and what was being estimated from the measurement. As an example, in radio location finding you measure the angle of arrival to estimate transmitter location. There is a functional relationship between AOA and location.
    I don’t see any functional relationship between tree ring width and temperature.

  4. Steven Mosher

    Also, it’s instructive to note that some folks focus on the model successes ( like predicting the effect of volcanoes) without noting how many predictions a GCM actually makes. I have a model
    for human growth. My model predicts that every male will be 5 foot 9 inches. I average this output. It’s 5 foot 9! I take some observations. I sample. I analyze that sample and conclude that the
    average of the sample is 5 foot 9.1 inches+- 2 inches. There my model of height matched my analysis of the observations. how good is my model? two other people build models. One predicts that average height will be 5 foot 7. the other predicts the average will be 5 foot 11. We average our models and
    presto, we have an ensemble average that is 5 foot 9, and we have a confidence interval for our
    collection of models. They range from 5 foot 7 to 5 foot 11 ! And our analysis of observations
    is consistent with these models. Averaging models works! We add more and more models. Some models have one set of input files for the amount of food some one eats. Others use different models of this input parameter ( hint look at TSI inputs). Some model exercise others dont
    ( see volcanoes)

    This of course is a silly example but it drives home the point that a model output is compared to the analysis of the observations. Some details, the average, may match, while finer grained details, like the distrubition, are wildly wrong. The analog of this of course is the desire to have better regional forecasts for GCMs. Some approaches to mitigation and adaptation could rely on
    better regional forecasts. In essence the granularity of the forecast drives the policy. If the granularity is global, then it follows “naturally” that global action is “required”

    One last little tidbit. When the community of modelers is trying to conduct an ATTRIBUTION study, that is a study that turns C02 on and off to see if the historic warming can be explained by natural causes do they?

    1. Use all the models in their arsenal, even those with poor skill?
    2. Use models that Drift after they are spun up?

    or would that procedure lead to wide errors bars….

    When They are trying to forecast ( where a nice big CI is benefit to consistency tests ) do they

    1. Reduce the number of models to those that have more skill.
    2. Eliminate models that drift after spin up

    Which procedure gets used?

  5. Steven Mosher

    tuning the model to ‘observations’ of course never happens:

    In the second type of calculation, the so-called ‘inverse’
    calculations, the magnitude of uncertain parameters in the
    forward model (including the forcing that is applied) is varied in
    order to provide a best fi t to the observational record. In general,
    the greater the degree of a priori uncertainty in the parameters of
    the model, the more the model is allowed to adjust. Probabilistic
    posterior estimates for model parameters and uncertain forcings
    are obtained by comparing the agreement between simulations
    and observations, and taking into account prior uncertainties
    (including those in observations; see Sections, 9.6 and
    Supplementary Material, Appendix 9.B).

  6. John Bowman

    It seems to me “modelling” would be more accurately termed “muddling”.

    Climate modellers are then be the muddle-men between scientists’ conceit and hubris and politicians’ tractability and cupidity.

  7. dearieme

    It really is all ratther primitive stuff, isn’t it? When I wrote physico-chemical models (starting more than 40 years ago) the idea was to get all parameter values by separate experiments in the lab, and then compare model output to experiments on the whole system, or on a prototype of the system. I didn’t “tune” models to fit the system, though I did sometimes waggle parameter values to see how much they’d have to be changed to yield a better fit to system performance. If the required change was within their individual confidence intervals, my satisfaction with the model increased, and if not, not. Even then, the point of the models was to be used in (1) interpolation within experimental results, and (2) in designing new experiments. The idea that you would gaily extrapolate, as the “climate modellers” do, would have seemed far too cocky. For systems of the size of climate, it is plain hubristic to suppose that you know all the effects to include, and all the associated parameter values well enough, to be prepared to stake trillions of dollars on the predictions. Hubristic, dim, mad or criminal, I suppose.

  8. Yes, dearieme, and in my business we call them separate-effects experiments. And these are supplemented with what we call integral-effects experiments for which additional phenomena and processes are added so as to capture coupled effects. And finally there are full-scale tests. These latter are generally not available unless there is sufficient requirements to justify the enormous costs.

    As the spatial and temporal scales increase there is generally decreasing amounts of very high quality data to be obtained.

    We also test only realistic parameter space and basic numerical values for engineering correlations are based only on highly-focused, separate-effects tests having high-quality data. We generally don’t fiddle with parameters at the integral-effects and full-scale levels. These are typically confirmation type tests.

    Interestingly, there is usually a one-to-one mapping between the coding routines for the models and the separate-effects level. After Verification that the continuous equations and the coding were correctly implemented, Validation of the models could be based on separate-effects data. And, if the instrumentation and data were sufficient at the larger scales, Validation can also be based on these data.

    I really don’t completely follow the practice of injection of measured data into an ongoing simulation for several reasons. The system being tested exhibits all aspects of the fundamental equations. The model equations almost never include accurate descriptions of all the possible physical phenomena and processes, not even at the continuous equation level. Then there’s the matter of spatial resolution in the simulation compared to the real-world data. These usually do not coincide, with the resolution of the data being much coarser than that in a simulation. The spatial, and sometimes temporal, locations at which the data are measured must be processed in some way to get it mapped onto the simulation resolution. And, typically, everything being calculated in the simulation is not measured in the data. The data not injected might satisfy the simulation equations under the conditions of the simulation, but they do not satisfy the continuous equations that the system satisfy, and are surely not consistent with the measured data being injected.

    So, even given that there are no data-processing and no missing data, and complete agreement between what is being calculated and what is being measured, the measured data do not satisfy even the continuous equations. At the discrete-approximation level plus the all-important run-time decisions regarding temporal and spatial resolutions, there are very more important details. Injection of measured data into a simulation is sure to set the calculations off on a perturbation, even if there are no errors in the measurements.

    How are the effects of these purely numerical perturbations due to data injection, the errors ( due to incomplete continuous equations ) in the model equations, the errors in the numerical values of the parameterizations, the errors due to lack of sufficient resolution in the simulations, plus errors in the measurements all sorted out?

    It seems to me that estimating parameters under these conditions has a potential to be kind of fuzzy. The parameters are very likely sucking up into the numerical values lots o’ stuff not associated with physical phenomena and processes.

  9. Bernie

    We seem to be assuming a lot how models are actually refined or modified. Is there a written procedure for evaluating the accuracy of GCMs?

  10. j ferguson

    Dan Hughes,
    Do you think the process you describe so well is what Gavin and company are doing? If so, it would support his statement that they are not incorporating measured data in the models.

    But if they compare the models’ output to what had happened (measured data) and what is happening (measured data again) and make adjustments in their equations to better agree with what they’ve seen outdoors, doesn’t this mean then that the models come to reflect the measurements?

    If the models produce results to a fine edge and the measurement data to which they have been tuned do not contain these fine edges, then the models can be corrupted just as are the temperature time series.

    So if what I’m saying above accurately describes what these guys are doing, they are not injecting historical data into the model, but they are referencing it. Making any adjustments in its direction must produce contamination of the model if the data are contaminated.

    So how are they able to discount this effect?

    Steve Mosher,

    Over at Lucia’s, you objected to my use of the term “validation” in connection with model making. My take of the basis of your objection was that “validation” had hard edges and the models and data sets might be a little softer. This would mean that “good enough” temperature time series could still be “useful” and usefulness was a good test of “something” if not “validity.”

    I don’t know enough to discuss this at the level of model-making, or even temperature data analysis, and I had hoped to avoid getting into a philosophical discussion – thinking that these don’t belong here, but don’t you agree that if the model makers touch the equations where they diverge from the history, they become inextricably linked to the history? And if it’s a not so good history?


    maybe this is a rehash, but I thought Dearieme was suggesting two model runs starting at different points in the past, not running a model backwards? Wasn’t that both possible and reasonable? If you’ve covered this, don’t waste time rehashing, I’ll read it all again.

    Matt, these threads are very useful to me and your writing clear and comprehensible. Thank you.

  11. Ken

    Briggs, Gavin,

    I’m reading your discourse with rapt amusement…and still cannot get the “GIGO” (to way oversimplify Briggs’ simplification) out of mind as a fundamental theme here.

    SO HERE’s A BASIC QUESTION to a basic observation: How do the climate models predict, or explain, the recent approximately decade-long period of minimal, or no, “global warming”????

    If the models don’t explain it, please explain why.

    While at it, could you explain how they completely failed to predict this — even after the stagnation in warming TREND was readily apparent??

    Sure, they may explain things like the cooling effect of a major volcanic eruption shading the planet & so forth…but…are such things really that impressive relative to what these climate models are supposed to be, are asserted to be, doing??? Seems to me that the effects of shade on any scale really aren’t that hard to model….

    ANOTHER QUESTION, with some background first: I’ve noticed Gavin & his cronies have been pre-emptively attacking solar effects on climate, with particular interest in lambasting cursory (and I use that term intentionally) studies involving cosmic ray & cloud interactions. Many of these positions are freely available at RealClimate online. It is clear that, whatever the reported observation, the response (or “retaliation”?) fits a neat template ending with something to the effect of ‘we’ve already addressed that & the matter is settled.’ HOWEVER, those deadbeat CERN scientists–a whole gaggle of them–are spending $Ms on the CLOUD experiment to objectively study & identify what, and to what extent (we know that there is “some” extent under certain circumstances from the SKY experiment) solar outputs, cosmic rays, and cloud formation may be intertwined. Two fact are: 1) the available data on clouds is poor due to a lack of measurements, and 2) CLOUD with objective data results is still pending.

    SSSsssssooooooooo, How can we simply dismiss–as so many posts at RealClimate do–solar/cosmic-ray/cloud interactions in the absence of data that is in the process of being collected??? Given the stature of those CERN researchers, a whole international team of them, this seems, to me, to be a pretty insulting slight to them if our “climate researchers” can dismiss the focus of the need for their study in advance.

    It seems to me that if we don’t understand what some of the physics might be we cannot simply dismiss such unknowns ‘out of hand’ before some data is collected. But that’s what the undeniable message from RealClimate is. This is one area in the climate models that cannot possibly be modeled correctly–unless they’ve got it right by pure luck.

    Either way, that’s not science by any objective standard.

    And one is judged by the company they keep; see: which notes that Keven Trenberth has [intentionally or not] acknowledged & published words in direct contradiction to his testimony to Congress. Related: , and, .

  12. Steven Mosher

    j fergy.

    “Over at Lucia’s, you objected to my use of the term “validation” in connection with model making. My take of the basis of your objection was that “validation” had hard edges and the models and data sets might be a little softer. This would mean that “good enough” temperature time series could still be “useful” and usefulness was a good test of “something” if not “validity.””

    In industry we talk about models being “validated and verified”.

    Let me see if I can explain.

    I ask you for a model to how fast a car will go.
    I ask you to predict this to within 1MPH top speed in the quarter mile.

    You respond with a model specification and design.
    You describe all the things that you will calculate.
    horsepower, traction, drag, engine dynamics, tires heating.

    I then demand that your model undergo independent verification and validation.

    verification: Did you DO what you said you would do? You said you would model traction.
    did you? Check. verify that your model meets it specification.

    is it valid? Wikipedia is your friend:

    Validation is the process of determining the degree to which a model, simulation, or federation of models and simulations, and their associated data are accurate representations of the real world from the perspective of the intended use(s).

    The problem with GCMS is that there is no design spec. So they cant be validated.
    make sense? That’s why guys like dan hughes and I tear our hair out when we read about what they do. In fairness, its research code. So questions of validation are always squishy when no criteria is set out.

    You continue:

    “I don’t know enough to discuss this at the level of model-making, or even temperature data analysis, and I had hoped to avoid getting into a philosophical discussion – thinking that these don’t belong here, but don’t you agree that if the model makers touch the equations where they diverge from the history, they become inextricably linked to the history? And if it’s a not so good history?”

    ya that’s a danger. One of the things Judith Curry and I called for was a FORMAL separation of the people who work on models from the people who produce observation data MODELS

  13. j ferguson

    Thanks Steve, I can work with that explanation.

    I think my suspicions that what you say is the case were deflected by all the assurances that the models were unaffected by the quality of the histories, or current measurements to which they are compared. Obviously they would be in any case, but to the greater detriment to their usefulness if the histories or current measurements are messy.

    the histories and current measurements do look messy.

  14. Ken says: “minimal, or no, “global warming”????”

    Ken, Ken, Ken… Don’t you read your national newspaper. Just last Thursday, USA Today quoted Mann saying “there’s a better than 50-50 chance” that 2010 will be the hottest year ever. He also says “the core arguement – that the Earth [sic] is warming, humans are at least partly responsible and disaster may await unless action is taken – remains intact.”

    Must be true if USA Today said so. 🙂

  15. Dan Hughes and all,

    As I said in in a comment to part I, from reading the description of model E I find that there are stability problems in the dynamics solver and that these are kept in check by using diffusion, filters and also parameterizations. While stability problems could be caused by their choice of spherical coordinates, grid size, step size or even the numerical method of solution, I also wonder if perhaps the physical equations of dynamics are chaotic. If they are chaotic, this would seem to limit their use in any connection to extrapolation of future climate. I admit I only know the very basics of how these solutions work but I also have some intuition of how they can react.

    Could stability issues with the solution reduce the usefulness of these simulations?

  16. VS

    Do come over, though… inviting all statisticians.. 😉


    PS. I did not claim temp’s are a random walk… I claim that the series contains a unit root, making regular OLS based trend inference invalid.

  17. Briggs


    Thanks for the invite!

Leave a Reply

Your email address will not be published. Required fields are marked *