How not to plot
The following plot was sent to me yesterday for comment. I cannot disclose the sender, nor the nature of the data, but neither of these are the least essential to our understanding of this picture and what has gone horribly, but typically, wrong.
There is one data point per year, measured unambiguously, with the item taking values in the range from the mid 20s to the high 50s. Lets suppose, to avoid tortured language, the little round circles represent temperatures at a location, measured, as I say, unambiguously, without error and such that the manner of measurement was identical each year.
What we are about to discuss applies to any—as in any—plot of data which is measured in this fashion. It could be money instead of temperature, or counts of people, or numbers of an item manufactured, etc. Do not fixate on temperature, though it’s handy to use for illustration, the abuses of which we’ll speak being common there.
The little circles, to emphasize, are the data and are accurate. There is nothing wrong with them. As the box to the right tells us, there are 18 values. The green line represents a regression (values as linear function of year); as the legend notes, the gray area shows the 95% confidence limits. Let’s not argue why frequentism is wrong and that, if anything, we should have produced credible intervals. Just imagine these are credible intervals. The legend also has a place value for “95% Prediction Limits”, but this isn’t plotted. Ignore this for now. The box to right gives details on the “goodness” of fit of this model, R-Square MSE and the like.
Questions of the data
Now let me ask a simple question of this data: did the temperature go down?
Did you say yes? Then you’re right. And wrong. Did you instead say no? Then you too are right. And just as wrong.
The question is not simple and is ill phrased: as it is written, it is ambiguous. Let me ask a better question: did the temperature go down from 1993 to 2010? The only answer is yes. What instead if I asked: what is the probability the temperature went down from 1993 to 2010? The only answer (given this evidence) is 1. It is 100% certain the temperature decreased from 1993 to 2010.
How about this one? Did the temperature go down from 1993 to 2007? The answer is no; it is 100% certain the temperature increased. And so forth for other pairs of dates (or other precise questions). The data is the data.
Did the temperature go down in general? This seems to make sense; the eye goes to the green line, and we’re tempted to say yes. But “in general” is ambiguous. Does that mean, from year-to-year there were more decreases than increases? There were half of each: so, no. Does the question mean that temps in 2001 or 2002 were lower than 1993 but higher than 2010? Then yes, but barely. Does it mean if I take the mean temps from 1993 to 2001 and compare it against the mean from 2002 to 2010? Then maybe (I didn’t do the math).
Asking an ambiguous question lets the user “fill in the blank”, different opinions can be had, merely because nobody is being precise. What we should do is just plot the data and leave it at that. Any question we can ask can be answered with 100% certainty. The data is the data. That green line—which is not the data—and particularly that gray envelope is an enormous distraction. So why plot it?
What is a trend?
It appears as if somebody asked: was there a trend? Again, this is ambiguous. What’s a “trend”? This person thought it meant the straight line one could draw with a regression. That means this person said it was 100% certain that this regression model was certain; that no other model could represent the uncertainty in the observed data than this one. But there are many, many, many other meanings of “trend” and other models which are possibilities.
No matter which model is chosen, no matter what, the data trumps the model. The green line is not the data. The data is the data. It makes no sense to abandon the data and speak only of the model (or its parameters). You cannot say: temperatures decreased, for we already have seen this is false or true depending on the years chosen. You can say “there was a negative trend” but only conditional on the model being true. And then a negative trend in the model does not correspond to a negative change in the data, not always.
Assume the regression is the best model of uncertainty. Is the “trend” causal? Does that regression line (or its parameters) cause the temperatures to go down? Surely not. Something physical causes the data to change: the model does not. There is no hidden, underlying forces which the model captures. The model is only of the data’s uncertainty, quantifying the chance the data takes certain values.
But NOT the observed data. Just look at the line: it only goes though one data point. The gray envelope only contains half or fewer of the data points, not 95% of them. In fact, the model is SILENT on the actual, already observed data, which is why it makes no sense to plot a model on the data, when the data does not need this assistance. Since the model quantifies uncertainty, and there is no uncertainty in the observed values, the model is of no use to us. It can even be harmful if we, like many do, substitute the model for the data.
We cannot, for instance, say “The mean temperature in 2001, according to the model, was 38.” This is nonsensical. The actual temperature in 2001 was 25, miles away from 38. What does that 38 mean? Not a thing. It quite literally carries no meaning, unless we consider this another way to say “false.” It was 100% certain the temperature in 2001 was 25, so there is no plus or minus to consider, either.
What’s a model for?
Again I say, the data is the data, and the model something else. What, exactly?
Well, since we are supposing this model is the best way to represent our uncertainty in values the data will take, we apply it to new data, yet unseen. We could ask questions like, “Given the data observed and assuming the model true, what is the probability that temperatures in 2011 are greater than 40?” or “Given etc., what is the probability that temps in 1992 were between 10 and 20?” or whatever other years or numbers which tickle our fancy. It is senseless, though, to ask questions of the model about data we have already seen. We should just ask the data itself.
Then we must wait, and this is painful, for waiting takes time. A whole year must pass before we can even begin to see whether our model is any good. Even then, it might be that the model “got lucky” (itself ambiguous), so we’d want to wait several years so we can quantify the uncertainty that our model is good.
This pain is so acute in many that they propose abandoning the wait and substituting for it measures of model fit (the R-Squared, etc.). These being declared satisfactory, the deadly process of reification begins and the green line becomes reality, the circles fade to insignificance (right, Gav?). “My God! The temperatures are decreasing out of control!” Sure enough, by 2030, the world looks doomed—if the model is right.
Measures of model fit are of very little value, though, because we could always find a model which recreates the observed data perfectly (fun fact). That is, we can always find a better fitting model. And then we’d still have to wait for new observations to check it.
Lastly, if we were to plot future values, then we’d want to use the (unseen) prediction limits, and not the far-far-far-too-narrow confidence limits. The confidence limits have nothing to say about actual observable data and are of no real use.
The data is the data. When desiring to discuss the data, discuss the data, do not talk about the model. The model is always suspect until it can be checked. That always takes more time than people are willing to give.