Class 81: The Second Biggest Error In Time Series

Class 81: The Second Biggest Error In Time Series

Reminder: The Thursday Class is only for those interested in studying uncertainty. I don’t expect all want to read these posts. So please don’t feel like you must. Yet, I have nowhere else to put them besides here. Your support makes this Class possible for those who need it. Thank you.

Do not smooth time series and then pretend the smoothed series was the “actual” data.

Video

Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty

HOMEWORK: Find your own time series with trendy trend lines. See if they suffer from the Deadly Sin of reification.

Lecture

The first big error in time series we’ve discussed endlessly. It is the same big error in all of science. False, fallacious and badly over-certain ascriptions of cause. Desiring cause is the correct attitude. But that desire leads to methods, dressed in fancy math, which guarantee failure. See all hypothesis testing, Bayes factors, all regression and on and on. These we’ve covered.

The second is smoothing, which we can call Models of Models of Models. The idea is simple. If you model data and then use that model as input to a second model, or then a third and so on in a great Daisy Chain Of Science, and do not carry forward the uncertainty, then your end result, the Final Boss Model, will be wrong.

For the life of me, I cannot see how this is not obvious. Okay, that’s a double negative. Stated positively, I think it is plain as sunshine. Yet I used to have such fights over this. You will not be surprised (I hope) to learn that forgetting uncertainty led to mini-panics in what we used to call global warming (a phrase in common use before people realized a slight increase in temperature was a welcome thing).

This can be related to an elementary result in probability.

Suppose you have a normal model of the average of some thing, perhaps an average temperature, with parameters (a,b), i.e. known parameters which are given to you somehow. Thus you have uncertainty in the average temperature; there is a 68% chance the average is a +/- b.

A friend has a Boss Model of the full temperature (the actual measure, not its mean; it doesn’t have to be temperature, it could be anything modeled with a normal). This is also a normal model with parameters (mu, 1) (i.e. the 1 for the ‘variance’ is known but the central parameter is unknown).

Your friend decides to use your (smoothed) model as input to his Boss Model. But he’s lazy so only takes your “mean”, i.e. he uses your parameter a for his mu. In other words, his uncertainty in temperature is Normal (a,1).

The Boss Model does not take into account the uncertainty in the average model. It ignores it. It has forgotten it. Whereas, if your friend did it properly, he would have formed the predictive posterior in his Boss of Normal (a,1+b). In other words, the uncertainty in the full temperature is greater when the Boss Model fully accounts for the uncertainty in your first model.

This is obviously a trivial example. And I can already hear some carping, “Well, if b is small, what’s the difference!?” Probably little. But of course, this example is severely constrained to make the idea easy to see. In real modeling, there’s be a lot more to it, with the shortcuts taken quickly adding up into enormous perposterosities (yes). Besides, why not do it right?

I am now going to quote from one of my original articles on the subject “Do not smooth times series, you hockey puck!“, inspired by the litigious Michael Mann, a sometime climate scientist. But only a brief quote, because you can go there and read the rest. This is just enough to give you a flavor of how common these analytical atrocities are.


Mann and others have published a new study melding together lots of data and they claim to have again shown that the here and now is hotter than the then and there. Go to climateaudit.org and read all about it. I can’t do a better job than Steve, so I won’t try. What I can do is to show you what not to do. I’m going to shout it, too, because I want to be sure you hear.

Mann includes at this site a large number of temperature proxy data series. Here is one of them called wy026.ppd (I just grabbed one out of the bunch). Here is the picture of this data:
wy026.ppd proxy series

The various black lines are the actual data! The red-line is a 10-year running mean smoother! I will call the black data the real data, and I will call the smoothed data the fictional data. Mann used a “low pass filter” different than the running mean to produce his fictional data, but a smoother is a smoother and what I’m about to say changes not one whit depending on what smoother you use.

Now I’m going to tell you the great truth of time series analysis. Ready? Unless the data is measured with error, you never, ever, for no reason, under no threat, SMOOTH the series! And if for some bizarre reason you do smooth it, you absolutely on pain of death do NOT use the smoothed series as input for other analyses! If the data is measured with error, you might attempt to model it (which means smooth it) in an attempt to estimate the measurement error, but even in these rare cases you have to have an outside (the learned word is “exogenous”) estimate of that error, that is, one not based on your current data.

If, in a moment of insanity, you do smooth time series data and you do use it as input to other analyses, you dramatically increase the probability of fooling yourself! This is because smoothing induces spurious signals—signals that look real to other analytical methods. No matter what you will be too certain of your final results! Mann et al. first dramatically smoothed their series, then analyzed them separately. Regardless of whether their thesis is true—whether there really is a dramatic increase in temperature lately—it is guaranteed that they are now too certain of their conclusion.


If you want more of that, also read “How Smoothing Time Series Generates Massive Over-Certainty“. This is a real-life example of how bad forecasts can be made to look good using smoothing.

Here’s a forecast (in red) of actual data (in black):

Stinks, right? The R^2 between the prediction and data, a measure which I cannot love, and indeed beg you not to use, is only 0.03 between the forecast and data. That number goes from 0 to 1, which higher indicating better fits. We will cover all measures like this later.

But suppose we apply a little running-mean or a loess (a model) to the data, i.e. smooth it. We do this because everybody does. (Don’t forget our lesson about signal and “noise.”)

The more we smooth, the better the R^2 gets, as this picture shows (comparing the prediction against the smoothed data):

It ought to be obvious that the more you smooth, the more the data looks like a straight line, so that any other straight line, like the forecast, will show a better and better “correlation” with the smoothed data. This is formally proved in Uncertainty, but I take it as so obvious I’m not going to belabor the point much further. I’ll instead add details in the video. Here’s a (edited) quote and then a picture from Chapter 10 in Uncertainty.

Shown in Figs. [below] are a succession of images with two simulated normal noise time series per panel that have nothing to do with one another. Fig. [first] shows the series with the correlation in the titles, and Fig. [second] shows the series in x-y plot fashion overplotted with a regression line. The smoothing was produced by increasing the window ($k$) of running means, but any type of smoothing (e.g.“low-pass filters”) will produce similar effects. As the smoothing increases, the correlations increase from near 0 to something quite high (in absolute value). I urge the reader to try it for himself, experimenting with different kinds of smoothers (and not just running means). Some surprising results can be had.

Of course, any given smoothing may decrease (in absolute value) the correlation between two or more series and not increase it. To discover how general any increase smoothing causes would require specifying not only the kind of smoother, but the probabilistic structure of the time series, and so forth; a worthy investigation but one which would take us too far afield here.

Experience shows the danger, however, is real and common. The reason the trick “works” is that smoothing takes uneven points andstraightens” them, making them more line-like, as the plots in Fig. [first] show. Any two lines with non-zero slopes have perfect Pearson correlation, as is trivially proved below. Never should any time series that has been smoothed be used as input for “downstream” analyses, e.g. that which shows how the time series is associated with some external $x$. This substitutes fake data for real, and causes massive over-certainty. Yet this mistake is often found. And not only in time series. Regression analyses often make the same error.

FIRST FIGURE

SECOND FIGURE

And now it occurs to me that this is too much for one class. Read ahead if you like, but I’ll cover these separately.

Here are the various ways to support this work:


Discover more from William M. Briggs

Subscribe to get the latest posts sent to your email.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *