Update This post of such importance, that it remains on top today. See below for more comments.
Presented for your satisfaction, a way to cheat either yourself or others using time series. The patter below is only a suggestion.
Just look at these anomalies, which are related to rampant, deadly climate change. Higher anomalies are worse for all of mankind in every imaginable way.
The anomalies are presented as monthly measures, over roughly a 10-year time period. A regression was fit to them and is plotted. The per-decade increase in this not-good anomaly is 0.87 per decade. Why, after 20 years, the anomaly will be almost twice as large as it is now!
The 95% confidence interval for the decadal change is 0.44 to 1.3. That means that the anomalies are surely heading up!
But ignore these sorrowful facts, because there is good news to be had. Here are some more anomalies.
These anomalies are on the way down, thus our spirits should be on the rise. In fact, the anomalies will drop at 0.57 over the next 10 years. And after 20 years, they’ll be down more than one full point!
The 95% confidence interval for the decadal change is -1.1 to 0. That means that the anomalies are surely heading down!
How the trick works
The pictures are the same!
Even if you flash up both pictures, the audience members will never notice that they are seeing the same anomalies. Yes, it’s true. You’ll worry that somebody will catch on, but they won’t! I have seen this done many times and nobody ever notices that the pictures are identical—except, of course, for those colorful straight lines. And the starting date.
Now take a look at these anomalies, which are the same as above, and see if you can spot the difference.
Instead of one regression line, there are 24. The first one is drawn using the entire time series. The second one is drawn using the entire time series except for time point number 1. The third removes time points 1 and 2, and so on. There are 24 lines in total, showing anything from a large increase to a large decrease, and each drawn by choosing a new starting point.
Do you get it? This is the whole trick! Nobody ever asks why you chose a particular starting point. You can tell any story you like and people will never think to ask what would happen if you were to use a slightly different data set.
Of course, very clever magicians will manipulate both starting and end points, but it’s best not to meddle with the end points until you become a master. People will (or should) naturally ask why you haven’t included the most up-to-date data, but they will absolutely never ask why you only used some of the history and not all of it.
The time series above was generated by the R armia.sim() function, using a mean 0, standard deviation 1, AR(0.64,0,0,0,0,0,0,0,0,0,0,0.35) process, which mimics many different real-world monthly time series. But try your own model. It works for models of any kind. And it’s fun!
Next thing is to show how reliable this trick is. The true answer—given our evidence E that the model is mean 0, etc.—is that the anomalies neither increase nor decrease over a decade. The slope of any regression line, in other words, should be 0. Or the confidence intervals of any line drawn should include 0. Of course the actual results will vary.
It’s your confidence intervals which are the real convincers in the trick. Did you notice that both confidence intervals (for the first two figures) confirm the hypothesis that things are getting better and things are getting worse? Isn’t that great!
To show the reliability of this, suppose your funding depends on things getting worse: you need the anomalies to increase. Therefore, you’ll pick a starting date which gives you the best evidence. Not every time series that is truly unchanging (as our E says it is) will cooperate such that you can definitely show an increase. But you can limit the damage against yourself by showing the smallest possible decrease.
I simulated 1000 different time series, each time picking the best starting point (to show the largest possible increase). Remember: if no cheating occurred, the mean of these samples should be 0. It isn’t. It’s much higher at 0.21—with a 95% confidence interval of -0.88 to 1.31.
Notice how much wider this (better) interval is. It’s better because it takes into account cheating.
What if you don’t want to cheat? Well, your interval will still be wider than if you just ran the regression on the data at hand. Except if the data at hand is all the data that will every occur (and if it is, there is no real to run a time series), the arbitrariness of the starting (and ending_ point must be accounted for. If it isn’t, then you will go away too confident of yourself.
The lesson is, of course, that straight lines should not be fit to time series.
Update More comments.
Question: why fit a straight (or any shaped) line to a time series like this? There are three reasons: (1) to discover whether there was a trend, (2) to predict the future, and (3) to use the analysis as part of a larger analysis.
(2) is a respectable goal, and should be encouraged. Most who fit lines to time series have this goal in mind, at least tacitly; that is, they at least imply that the line they have fitted will “continue” into the future. Therein lies a problem. For that line is an all-too-sure guess of what the future will be.
Notice that we stated specifics of the line in terms of the “trend”, i.e. the unobservable parameter of the model. The confidence interval was also for this parameter. It most certainly was not a confidence interval on the actual anomalies we expect to see.
If we use the confidence interval to supply a guess of the certainty in future values, we will be about 5 to 10 times too sure of ourselves. That is, the actual, real, should-be-used confidence interval should be the interval on the anomalies themselves, not the parameter.
In statistical parlance, we say that the parameter(s) should be “integrated out.” So when you see a line fit to time series, and words about the confidence interval, the results will be too certain. This is an inescapable fact.
(1) is also a goal, but a shady one. If we want to know if there has been a change from the start to the end dates, all we have to do is look! I’m tempted to add a dozen more exclamation points to that sentence, it is that important. We do not have to model what we can see. No statistical test is needed to say whether the data has changed. We can just look.
I have to stop, lest I become exasperated. We statisticians have pointed out this fact until we have all, one by one, turned blue in the face and passed out, the next statistician in line taking the place of his fallen comrade.
It is true that you can look at the data and ponder a “null hypothesis” of “no change” and then fit a model to kill off this straw man. But why? If the model you fit is any good, it will be able to skillfully predict new data (see point (1)). And if it’s a bad model, why clutter up the picture with spurious, misleading lines?
Why should you trust any statistical model (by “any” I mean “any”) unless it can skillfully predict new data?
Again, if you want to claim that the data has gone up, down, did a swirl, or any other damn thing, just look at it!
(3) If you fit a line and then use the parameter estimates of that line as input into other analysis (as was done in our sample paper, referenced below), your results will be too certain. We all know the dangers of smoothing time series. If you’ve forgotten, I, II, III.
This post was inspired by an actual paper—where I do not accuse the authors of cheating; but they do use time series with different starting and ending dates and then combine those time series to make a conclusion. We can see now that they will be too sure of themselves.
Update See this cartoon which shows that the IPCC has been known to employ the technique of variable start dates.