We must resist extensive quotation, except to note that Keenan caught the ear of the man himself, Richard Muller, particularly over a dispute of what smoothing time series prior to analysis does to certainty (it increases it unduly). Muller says that there “are certainly statisticians who vigorously oppose this approach, but there have been top statisticians who support it.”
Vigorously oppose it I do. The reasons are laid out by Keenan and by me in the links Keenan provides. Except to agree with Keenan’s critiques—and they are many and fundamental—that’s all I’ll say about the matter here. Instead, I’ll provide my own commentary on the BEST paper “Berkeley Earth temperature averaging process.” These do not duplicate Keenan’s criticisms. I do not attempt simplicity or exhaustiveness.
I agree with Keenan and say that, while the general point estimate is probably roughly in the ballpark, more or less, plus or minus, but the uncertainty bounds are far too narrow.
The authors use the model:
T(x,t) = θ(t) + C(x) + W(x,t)
where x is a vector of temperatures at spatial locations, t is time, &theta() is a trend function, C() is spatial climate (integrated to 0 over the Earth’s surface), and W() is the departure from C() (integrated to 0 over the surface or time). It is applied only over land. Not all land, but most of it. The model becomes probabilistic by considering how an individual measurement di(tj) relies on trend θ(), W(), a “baseline” measure, plus “error.” By this, they do not mean measurement error, but the standard model residual. Actual measurement error is not accounted for in this part of the model.
The model takes into account spatial correlation (described next) but ignores correlation in time. The model accounts for height above sea level but no other geographic features. In its time aspect, it is a singly naive model. Correlation in time in real-life is important and non-ignorable, so we already know that the results from the BEST model will be too sure. That is, the effect of ignoring correlation in time is to produce uncertainty bounds which are too narrow. How much too narrow depends on the (spatial) nature of the time correlation. The stronger it is in realty, the more the model errs.
Kriging (a standard practice which comes with its own set of standard critiques with which I assume the reader is familiar) is used to model non-observed spatial locations. Its correlation function is a fourth-order polynomial of distance (eq. 14). A fourth-order, yes. Smells like some significant hunting took place to discover this strange creature. Its average fit to a mass of blue dots (Fig. 2) appears well enough. But stop to consider those dots. The uncertainty around the fit is huge. Update: correction: I meant to write exponential of a fourth-order polynomial. The rest of the criticism stands.
This is important because the authors used a fixed correlation function with plug-in estimates. They say (p. 12) that “further refinements of the correlation function are likely to be a topic of future research.” The problem is that their over-certain estimate will cause the certainty in the final model results to be overstated. No Bayesian techniques were harmed during the creation of this model, but it would have better if they had been.2 The uncertainty in this correlation absolutely needs to be accounted for. Since the mass of blue dots (Fig. 2) have such an enormous spread, this uncertainty is surely not insignificant. Stop and understand: this correlation function was assumed the same everywhere, at every point on the Earth’s surface, an assumption which is surely false.
Update: correction: I mean this last statement to be a major criticism. If you have a mine in which at various spots some mineral is found and you want to estimate the concentration of it in places where you have not yet searched, but which are places inside the boundaries of the places you have searched, kriging is just the thing. Your mine is likely to be somewhat homogeneous. The Earth’s land surface is not homogeneous. It is, at the least, broken by large gaps of water and by mountains and deserts and so forth. To apply the same kriging function everywhere is too much of a simplification (leading to over-certainty).
About measurement error (p.15), the authors repeat the common misconception, “The most widely discussed microclimate effect is the potential for ‘urban heat islands’ to cause spuriously large temperature trends at sites in regions that have undergone urban development.” This isn’t poor statistics, but bad physics. Assuming the equipment at the stations is functioning properly, these trends are not “spurious”. They indicate the actual temperature that is experienced. As such, these temperatures should not be “corrected.” See this series for an explanation.
To account for one aspect of estimated measurement error, the authors develop an approach on which they bestow a great name: the “scalpel.”
Our method has two components: 1) Break time series into independent fragments at times when there is evidence of abrupt discontinuities, and 2) Adjust the weights within the fitting equations to account for differences in reliability. The first step, cutting records at times of apparent discontinuities, is a natural extension of our fitting procedure that determines the relative offsets between stations, encapsulated by &bcirc;i , as an intrinsic part of our analysis.
It’s not clear how uncertainties in this process carry through the analysis (they don’t, as near as I can tell). But the breaking-apart step is less controversial than the “outlier” weighting technique. There are no such things as “outliers”: there is only real data and false data. A transposition error, for example, is false data. Inverting the sign for a temperature is false data. Very large or small observations in the data may or may not be false data. There are a huge number of records and all can’t be checked by hand without substantial cost. Some process that estimates the chance that a record is false is desirable: those points with high suspicion can be checked by hand. No process is perfect, of course, especially when that process is for historical temperature measurements.
Update A change in a station siting does not introduce a “bias” in that station’s records. It becomes a new station. See the temperature homogenization series for more about this.
The authors did do some checking; e.g. they remove truly odd values (all zeros, etc.), but this cleaning appears minimal. They instead modeled temperature (as above) and checked the given observation against the model. Those observations that evinced large deviations from the model were then down-weighted and the model re-run. The potential for abuse here is obvious, and is the main reason for suspicion of the term “outlier.” If the data doesn’t fit the model, throw it out! In the end, you are left with only that data that fits, which—need I say it?—does not prove your model’s validity. No matter what, this procedure will narrow the final model’s uncertainty bounds. The authors claim that this down-weighting process was “expected” to effect about 1.25-2.9% of the data.
The next step in “correcting” the data is more suspicious. They say, “In this case we assess the overall ‘reliability’ of the record by measuring each record’s average level of agreement with the expected field &Tcirc;(&xtilde; , t) at the same location.” At least reliability is used with scare quotes. Once again, this has the direct effect of moving the actual observations towards the direction of the model, making the results too certain.
Are the results on pages 24 and 25 register all the actual changes? It’s not clear. Dear Authors: what percentage of data was effected, taking account of the raw data removal, scalpel, outlier down-weighting, and reliability down-weighting? 5%, 10%, more? And for what time periods was this most prevalent?
In Section 9, Statistical Uncertainty, I am at a loss. They take each station and randomly assign it to
one of five groups, n = 1, 2, …, 5, and say “This leads to a set of new temperature time series
hat-θn(tj)….As each of these new time series is created from a completely independent station network, we are justified in treating their results as statistically independent.” I have no idea what this means. The five series are certainly not independent in the statistical sense (not in space or time or in sample).
The procedure attempts to estimate the uncertainty of the estimate hat-θn(tj)—i.e. the parameter and not the actual temperature. Treating the samples as independent will cause this uncertainty to be underestimated. But leave all this aside and let’s move to what really counts, the uncertainty in the model’s final results.
Fig. 4b is slightly misleading—I’m happy to see Fig. 4a—in that, say, in 1950 85% of the Earth’s surface was not covered by thermometers. This is coverage in terms of model space, not physical space. This is proved by Fig. 4a which shows that physical space coverage has decreased. But let that pass. Fig. 5 is the key.
This is not a plot of the actual temperature and it is not a plot of the uncertainty of the actual temperature. It is instead a plot of the parameter hat-θn(tj) and the uncertainty given by the methods described above. Users of statistics have a bad, not to say notorious, habit of talking of parameters as if they were discussing the actual observables. Certainty of the value of a parameter does not translate into the certainty of the observable. Re-read that last sentence, please.
From p. 26:
Applying the methods described here, we find that the average land temperature from Jan 1950 to Dec 1959 was 8.849 +/- 0.033 C, and temperature average during the most recent decade (Jan 2000 to Dec 2009) was 9.760 +/- 0.041 C, an increase of 0.911 +/- 0.042 C. The trend line for the 20th century is calculated to be 0.733 +/- 0.096 C/century, well below the 2.76 +/- 0.16 C/century rate of global land-surface warming that we observe during the interval Jan 1970 to Aug 2011. (All uncertainties quoted here and below are 95% confidence intervals for the combined statistical and spatial uncertainty). [To avoid HTML discrepancies, I have re-coded the mathematical symbol “+/-” so that it remains readable.]
Note the use for the word “temperature” in “we find that the average land temperature…” etc., where they should have written “model parameter.” From 1950 to 1959 they estimate the parameter “8.849 +/- 0.033 C”. Question to authors: are you sure you didn’t mean 8.848 +/- 0.033 C? What is the point of such silly over precision? Anyway, from 2000 to 2009 they estimate the parameter as 9.760 +/- 0.041 C, “an increase of 0.911 +/- 0.042 C.” Update: this criticism will be unfamiliar to most, even to many statisticians. It is a major source of error (in interpretation); slow down to appreciate this. See this example: the grand finale.
Accept that for the moment. The question is then why choose the 1950s as the comparator and not the 1940s when it was warmer? Possible answer: because using the 1950s emphasizes the change. But let’s not start on the politics, so never mind, and also ignore the hyper precision. Concentrate instead on the “+/- 0.033 C”, which we already know is not the uncertainty in the actual temperature but that of a model parameter.
If all the sources of over-certainty which I (and Keenan) mentioned were taken into account, my guess is that this uncertainty bound would at least double. That would make it at least +/- 0.066 C. OK, so what? It’s still small compared to the 8.849 C (interval 8.783 – 8.915 C; and for 2000-2009 it’s 9.678 – 9.842 C). Still a jump.
But if we added to that the uncertainty in the parameter so that our uncertainty bounds are on the actual temperature, we’d again have to multiply the bounds by 5 to 73. This makes the 1950-1959 bound at least 0.132, and the 2000-2009 at least 0.410. The intervals are then 8.519 – 9.179 C for the ’50s and 9.350 – 10.170 C for the oughts. Still a change, but one which is now far less certain.
Since the change is still “significant”, you might say “So what?” Glad you asked: Look at those bounds on the years before 1940, especially those prior to 1900. Applying the above changes pushes those bounds way out, which means we cannot tell with any level of certainty if we are warmer or cooler now then we were before 1940, and especially before 1900. Re-read that sentence, too, please.
And even if you want to be recalcitrant and insist on model perfection and you believe parameters are real, many of the uncertainty bounds before 1880 already cover many modern temperatures. The years around 1830 are already not “statistically different” than, say, 2008.
An easier way to look at this is in Fig. 9, which attempts to show the level of uncertainty through time. All the numbers in this plot should be multiplied by at least 5 to 10. And even after that, we still haven’t accounted for the largest source of uncertainty of all: the model itself.
Statisticians and those who use statistics never or rarely speak of model uncertainty (same with your more vocal sort of climatologist). The reason is simple: there aren’t cookbook recipes that give automatics measures of this uncertainty. There can’t be, either, because the truth of a model can only be ascertained externally.
Yet all statistical results are conditioned on the models’ truth. Experience with statistical models shows that they are often too sure, especially when they are complex, as the BEST model is (and which assumes that temperature varies so smoothly over geography). No, I can’t prove this. But I have given good reason to suspect it is true. You may continue to believe in the certainty of the model, but this would be yet another example of the triumph of hope over experience. What it means is that the uncertainty bounds should be widened further still. By how much, I don’t know.
UpdateNeither the BEST paper nor my criticisms say word one about why temperatures have changed. Nobody nowhere disputes that they have changed. See this discussion.
1“Say, Briggs. You’re always negative. If you’re so smart, why don’t you do your own analysis and reveal it?” Good question. Unlike the BEST folks, and others like my pal Gav, I don’t have contacts with Big Green, nor do I have a secretary, junior colleagues, graduate students, IT people, fancy computer resources, printer, copier, office supplies, access to a library, funds for conference travel, money for page charges, multi-million dollar grants, multi-thousand dollar grants, nor even multi-dollar grants. All the work I’ve ever done in climatology has been pro bono. I just don’t have the time or resources to recreate months worth of effort.
2The authors used only classical techniques, including the jackknife. They could have, following this philosophy, bootstrapped the results by resampling this correlation function.
3This is what experience shows is the difference for many models. For the actual multiplier, we’d have to re-do the work. As to that, see Note 1.
The statistics on the BEST paper were designed with the assistance of the eminent (no sarcasm) David Brillinger, though he does not appear as a co-author. Charlotte Wickham was the statistician and was a student of Brillinger’s. “Charlotte Wickham is an Assistant Professor in the Department of Statistics at Oregon State University. She graduated with her PhD in Statistics from the University of California, Berkeley, in 2011.”