Read Part I
What we’re after is a score that calculates how close a prediction is to its eventual observation when the prediction is a probability. Now there are no such things as unconditional probabilities (see the Classic Posts page at the upper right for dozens of articles on this subject). Every probability is thus conditional on some evidence, i.e. premises. This evidence is called the model. Assuming no mistakes in computation or measurement (which, unfortunately, are not as rare as you’d like), the probability put out by the model is correct.
The probability is correct. But what about the model? If the model itself were correct, then every probability it announced would be 0 or 1, depending on whether the propositions put to it were deduced as true or false. Which is to say, the model would be causal and always accurate. We thus assume the models will not be correct, but its probabilities are.
Enter scoring rules. These are some function of the prediction and outcome, usually a single number where lower is better (even when the forecast is a vector/matrix). Now there is a bit of a complexity involving proper scoring rules (a lot of fun math), but which don’t make much real-life difference. A proper score is defined as one which (for our purposes) is “honest”. Suppose our scoring rule awarded all forecasts for probabilities greater than 0.9 a 0 regardless of the outcome, but used (probability – I(event occurred))^2 for probabilities 0.9 or less (the I() is an indicator function). Obviously, any forecaster seeking a reward would forecast all probabilities greater than 0.9 regardless what his model said. (The analogy to politics is obvious here.) What we’re after are scores which are “fair” and which accurately rate forecast performance. These are called proper.
A common score for dichotomous outcomes is as above: (probability – I(event occurs))^2, or its mean in a collection of forecast-observation pairs. This is the Brier score, which is proper. The score is symmetric, meaning the penalty paid for having large probabilities and no outcome equals the penalty paid for having small probabilities and an outcome. Yet for many decisions, there is an asymmetry. You’d most likely feel less bad about a false positive on a medical test for pancreatic cancer than for a false negative—though it must be remembered that false positives are not free. Even small costs for false positives add up when screening millions. But that is a subject for another time.
The Brier score is thus a kind of compromise when the actual decisions made based on forecasts for dichotomous events aren’t known. If the costs are known, and they are not symmetric, the Brier score holds no meaning. Neither does any score which doesn’t match the decisions made with the forecast! (I keep harping on this hoping it won’t be forgotten.)
There are only main reasons to score a model. (1) To reward a winner: more than one model is in competition and something has to judge between them. (2) As a way to understand particularities of model performance: suppose our model did well—it had high probabilities—when events happened, but it did poorly—it still had high-ish probabilities—when events didn’t happen; this is good to know. (3) As a way to understand future performance.
Senate hearing on EPA CO2 rules: EPA air chief says climate change science is 'clear.' Yes, — climate models failed. pic.twitter.com/815hke0lKF
— Steve Milloy (@JunkScience) February 11, 2015
The first reason brings in the idea of skill scores. You have a collection of temperature model forecasts and its proper score; say, it’s a 7. What does that 7 mean? What indeed? If your score was relevant to the decisions you made with the forecast, then you wouldn’t ask. It’s only because we’re releasing a general-purpose model into the wild which may be used for any number of different reasons that we must settle on a compromise score. That score might be the complete rank probability score (CRPS; see Gneiting and Raftery, Strictly Proper Scoring Rules, Prediction, and Estimation, 2007, JASA, 102, pp 359–378), which has many nice properties. This takes a probability forecast for a thing (which itself gives the probability of every possible thing that can happen) and the outcome and measures the “distance” between them (not all scores are distances in the mathematical sense). It gives a number. But how to rate it?
There has to be a comparator, which is usually a model which is known to be inferior in some aspect. Take regression (see the Classic Posts for what regression really is; don’t assume you know, because you probably don’t). A complex model might have half a dozen or more “x” variables, the things we’re using to predict the “y” variable or outcome with. A naive model is one with no x variables. This is the model which always says the “y” outcome will be this-and-such fixed probability distribution. Proper scores are calculated for the complex and naive model and their difference (or some function of their difference) is taken.
This is the skill score. The complex model has skill with respect to the simpler model if it has a better proper score. It’s usually worked so that skill scores greater than 0 indicate superiority. If a complex model does not have skill with respect to the simpler model, the complex model should not be used. (Unless, of course, your real-life decision score is completely different than the proper score underlying skill.)
We’ll wrap it up in Part III.
Update Here is an example of an improper score in real life. Tanking in sports.
There are only main reasons to score a model.
There are only [three?] main reasons to score a model.
Briggs, I’m sorry, this bear of very little brain can’t get his head around what you’ve said. Could you please give (or give a link to) an example with numbers?
Or maybe it’s just that the synapses are getting much further apart with the cold weather.
Another question: are the dots in your figure the troposphere measurements? (I would guess so from the differences between those and reported weather station temperatures.)
A proper score is defined as one whose expected value is maximized at the true probabilistic forecast. If one sees the score as a reward system, it’s reasonable that a correct forecast shall be rewarded the most.
In statistical modeling, one relies on the evidence or data available to formulate a probability forecasting. A forecaster would want to stay as true (honest) as possible to what the evidence entails because it shall maximize one’s credibility in terms of the proper score.
Hence, “the role of scoring rules is to encourage the assessor to make careful assessments and to be honest,” as stated in the paper. To say the score is honest sounds a bit strange to me.
What??? So, if “this evidence is called the model,” wouldn’t replacing the word “model” with “evidence” throughout this post make sense also? But no.
I believe, but I am not sure, that the image Briggs posted belongs to Roy Spencer. The plot by him I’m more familiar with is here:
Which more clearly identifies which observational series were used, though not which model runs were used to compute the ensemble mean. It doesn’t matter much though, because both plots suffer a far more serious issue in that Spencer zeroed the 5 year running means at a single year, which is a no-no. The IPCC uses a 20 year baseline from 1986-2005. 20 years to get a reasonable climatology reference, 2005 because that’s the final year that historical forcings were used in the model runs. 2006 is the year that the scenario projections begin based on assumed future emissions.
It makes a difference:
The series with markers are from Spencer’s original plot with the spaghetti removed. Series without markers are 5-year running means (trailing) with the reference period set to 1986-2005 prior to doing the running mean calculation.
All data are through 2014 except Spencer’s plot which only included data through 2013 because his image is almost exactly a year old at this point.
And yes, one of the observational series is lower troposphere from Spencer’s very own UAH product. HADCRUT4 is surface based observation, of course.
If I correctly understand part of the issue you raise here, the paleo modelers are apparently acutely aware of uncertainty in observations as well: http://www.clim-past.net/9/811/2013/cp-9-811-2013.pdf
Especially in the case of paleoclimates, the uncertainty on the data is often substantial (Hargreaves et al., 2011), and must be taken into account for a fair evaluation.
So they go on to discuss including the observational uncertainties (1-sigma std. deviation) in the numerator and denominator of the classic formula for root mean square difference skill scoring, and further comment:
Note that the skill score here becomes undefined if either the model or the reference agrees more closely with the data than the data errors indicate should be possible. Such an event would be evidence either that the data errors are overestimated, or else that the model had already been over-tuned to the observations. In principle, no model should agree with the data with smaller residuals than the observational errors, since even reality only just achieves this close a match, and only then if the observational errors have not been underestimated.
So many interesting wrinkles in this topic.
I revisited my friendly Psychometric Chart yesterday and peaked at the enthalpy ranges associated with different temperatures.
If you do not include %humidity into the equation for “averaging”, you ignore > 50% of the information. Folks can wave their hands about this. Those folks who ignore this aspect of the weather equation need to get sent back to their thermo class. Water is a miracle substance. It exists in all three states on this planet. It transitions those phases with huge energy transformations.
The transformations are well understood within closed loops. Even inside those closed loops are models trip. Open loops make the equations so much worse. If your models output an average temperature there is a disconnect in your model. Temperatures are nice approximations for what we evaluate day to day. Temperatures are crap for what they actually represent. Energy is what you are interested in.
Temperatures are a marketing ploy. They are a marketing ploy that works though. We are here talking about the matching up the temperatures of the models out put to the actual data. The error bars on both items are bigger than the chart they are displayed on. Not by a little. The error bars are likely 10x bigger than the axis of the chart.
The irrational part of this is that the
However, I think that it is important to be very clear what we mean by “statistically valid”, before we decide whether the evidence that we have conforms to that or not. What would statistically valid evidence for global warming look like? Perhaps more usefully, what would statistically valid
evidence against the theory of anthropogenic induced warming look like?
To say that observational evidence is not “statistically valid” is probably more a comment on our statistical framework, than our knowledge of the climate. I think it is unfair to make the argument that there is no “statistically valid” evidence, without stating what statistical framework we are working under, and then going on and showing that the evidence for warming is not statistically
valid. To further extend this and say that there is “no observational evidence” for global warming is stretching the point even further.”
(Doug McNeall—Met office statistician)
For example, if W ~ Normal distribution with a mean of 0(MODEL/premise), then the Prob(W>0)=0.5 (CORRECT PROBABILITY).
If this is what you meant to convey, then isn’t this tautological? I am trying to figure what the point of the above quoted paragraphs is!
In practice, a probability forecast is estimated under the assumption of a model. One shall not claim the probability forecast has no uncertainty and is truly correct. Again, a model is postulated based on the observations and other relevant information, and is never assumed to be the true or correct for obvious reasons, hence there is a vast literature on the performance of models, not on whether a model is true.
Thanks for the paper.
Let a be the sum of the squared e_i’s in the modified skill score (3). From the notation, it appears that the same quantity a is subtracted in both the numerator and denominator of the quantity inside the square root.
If this is the case, for comparative purpose, I don’t see the advantage of the modified skill formula (3) over formula (2). Instead, (3) might be undefined since the square root of a negative number is undefined. When (3) is well defined, there is a monotone relation between (3) and (2), and the values of (3) and (2) have the same sign. So, subtracting a in (3) would not change the comparison conclusion.
Basically, for both formulas, one only needs to compare the goodness-of-fit measures (sum of squared residuals/prediction errors, GOF) for the two models under evaluation. The one with a better GOF wins.
If a depends on the forecasting model, then it’d be a different story. (I only read Section 3.2 this morning, and didn’t looked into the paper to see how those observational errors (e_i) are estimated. )
I do think it’s misleading to call synthesis results from multi-model ensemble “data” though.
Here is a better graph tracking models from 1950:
Mysteriously, models hindcast with remarkable precision for 50 years but diverge the moment they are expected to forecast. An important reason for why one should not take these researchers seriously.