Reminder: The Thursday Class is only for those interested in studying uncertainty. I don’t expect all want to read these posts. So please don’t feel like you must. Yet, I have nowhere else to put them besides here. Your support makes this Class possible for those who need it.Thank you.
HOMEWORK: Find your own time series with trendy trend lines. See if they suffer from the Deadly Sin of reification.
Lecture
We have before us a collection of 50 stopping distances of cars. They were, we deduce, going before they could be stopping. Yes? Here is a histogram of those distances, in the civilized units of feet.
Suppose we want to predict new stopping distances. We do not need to predict old ones. If all we wanted was to know about the old ones, we are done. Yes? We never need a model to tell us what happened. So?
Any number of models can be pulled out of a statistician’s bag of tricks. Perhaps a discretized gamma distribution. (Discretized because stopping distances will always be measured in a finite discrete set, whereas gammas give probabilities to an uncountably infinite set.) Whatever you like.
Just as you’re doing this, you hear a voice from the crowd. It says, “Wouldn’t it be nice to know something about the cars’ going?”
Why, yes; yes, it would. In fact, we can find all manner of things that we believe might be correlated to distance and we can introduce these measures into our New & Improved! model. Like a regression model. We could plot up a scatterplot of the speed cars were going by the distance required to stop. Then we could use new values of the speed to help predict new values of distance. Brilliant!
But do not forget: We were already using just the old values to predict new ones.
I took the advice and plotted this same picture as in the AGI post, only I can now reveal to you that this was the cars data all along. The red dot is the prediction, the green the actual value (not used in any way in making the model).
This is also Granger causality.
No, really, that’s it. Granger causality is badly named, which is generally recognized. It ought to be called Granger correlation. Which can be shortened even more to just correlation.
Something happens to mathematicians, AI mavens, economists, sociologists and the like when they think about time series. I’ve discussed this several times, but I wonder if it sticks. Time series people all hope, and believe, that there lurks in the past data occult signals that can be revealed if only the the right machine learning, neural net, AI, Fourier analysis, or whatever, can be applied.
This hope is forlorn. It will not be realized, except in the too-rare-to-mention special cases where the data was produced using some set known formula. In real life? No.
I am speaking of accidental time series. To be distinguished from per se series. Accidental series are those which occur over time, like stock prices, weather measures, and so forth. Per se series happen right now: together. Only in per se series do we have hope of generally recognizing cause.
The classic example is something like a baseball breaking a window. The ball is right now hitting the window. The window is right now breaking. The two things are happening simultaneously, through time. The ball is the cause of the window breaking in time. The ball exercising its causal power to break is prior to the window in this sense. But the event is happening at once, as one thing.
They are not separate. It is not “baseball first” and then, some time later, “window breaks.” It is together. We know what is cause and what is effect because we understand the properties of these objects, we know their essences. We know the power is inherent in the ball, we know the susceptibility is in the window.
We do not not know this is cause-and-effect because the ball struck, and at some point, who knows when, the window broke. The event is not events: the “events” are not, as Hume said, “loose and separate”. They are one thing, one happening. Cause and effect together.
This is the per se time series. We know, when we know anything, where the cause lay and why. And it has nothing to do with separation in time.
This is crucial to understand.
Enter the accidental series. This is where, as we have been stressing, the points are related only because we—as in we—envision a set of causes and conditions which persist in the series. They are not the same every time, but they are from the same fixed set, we believe. After all, that is why we grouped together this set of points. Like that IBM stock price in time. The causes of any one point are many, as are the conditions, but the set we believe is the same in any run of data. If we believe the set has changed, for whatever reason, we can say the series is not “stationary”.
But stationarity, or its lack, is not a property of the data itself. It is the causes we do not, and almost never can, witness that bring us the data. These myriad causes and conditions are “loose and separate”. That is why we can’t be sure of cause from correlation. Hume was right, but only here.
If the difference between per se and accidental series were known, we wouldn’t have so much magic substituting for science.
All right. Now for the badly misnamed Granger causality. Here from Scholarpedia is the idea:
The basic “Granger Causality” definition is quite simple. Suppose that we have three terms, Xt, Yt, and Wt, and that we first attempt to forecast Xt+1 using past terms of Xt and Wt. We then try to forecast Xt+1 using past terms of Xt, Yt, and Wt . If the second forecast is found to be more successful, according to standard cost functions, then the past of Y appears to contain information helping in forecasting Xt+1 that is not in past Xt or Wt. In particular, Wt could be a vector of possible explanatory variables. Thus, Yt would “Granger cause” Xt+1 if (a) Yt occurs before Xt+1; and (b) it contains information useful in forecasting Xt+1 that is not found in a group of other appropriate variables.
In other words, if we have a model with X and W, and we add in Y, and find this to be a superior model than X and W alone, then we have Granger “causality”. Which isn’t causality, but correlation. Nothing more.
Granger rediscovered correlation and regression, but this wasn’t seen because it was applied to time series, which for whatever reason always seem more mysterious. This is partly because people think data possesses probability, which is real, and so on, which we know is false. Granger’s efforts might have been seen as duplicative at the time, but weren’t because regression and other common models are almost never put in terms of predictive probability, as they ought to be, and instead are left lingering at correlation and weak parameter-analysis. So his idea seemed new.
There just is no causality to this, though. And you can’t shake the idea that time series is something unlike other data. The Scholarpedia article says neuroscientists have now taken up GC, to tie signals from different parts of the brain together. And they aren’t so cautious about ascribing cause. Sadly.
It’s worth spending a moment with Wokepedia’s explanation of GC, which gives the full lighting to the magic aura we’ve come to expect of time series:
Granger defined the causality relationship based on two principles:
The cause happens prior to its effect.
The cause has unique information about the future values of its effect.
Given these two assumptions about causality, Granger proposed to test the following hypothesis for identification of a causal effect of on :
where P refers to probability, is an arbitrary non-empty set, and and respectively denote the information available as of time t in the entire universe, and that in the modified universe in which is excluded. If the above hypothesis is accepted, we say that Granger-causes .
By which they mean X doesn’t cause Y, or that if it does, we can’t know it this way.
Let’s be careful here. We want to know the uncertainty about a new value of Y, or Y(t+1), appearing in some set A. We condition on either I(t), which is some set of evidence, or I_x(t), which is the same set of information but where all the evidence in X is removed. This is nothing special. Suppose I(t) is all the past data of Y, i.e. Y_t-1, Y_t-2, and so on. But we call X = Y_t-3, for whatever reason. Then I_x(t) = Y_t-1, Y_t-2, Y_t-4, Y_t-5, and so on.
I spell this out because this kind of notation is often confusing. The curliness makes it look like special things are going on. Which often happens in math. Like I said from day one, notation can be a hinderance to clear understanding.
Anyway, as long as you take I(t) in a sober sense, there is nothing wrong. If adding evidence X moves the probability of Y(t) in A closer to 1 or 0, then X is helping, assuming the model is good (which we’ll cover another day). If adding in X moves the probability closer to 1/2, then X is hurting. If adding X does nothing to the probability, then X is irrelevant—given everything else in I(t).
But here I(t) is not sober. If I(t) is all “the information available as of time t in the entire universe” , then necessarily Pr(Y(t) in A | I(t)) = 0 or it equals 1. Because I(t) will have the known cause of Y in it. If X is that known cause, then Pr(Y(t) in A | I_X(t)) in (0,1) because we removed the known cause of Y! You can even make that case that X contains all that is known about Y, then Pr(Y(t) in A | I_X(t)) = (0,1), where all that is left in I_X(t) is the tacit knowledge that Y is contingent. Notice that the second equation is strict equality, and not “in (0,1)” but “= (0,1)”, i.e. the whole interval (and not, say, “= 1/2”).
All this means is that causality is accidental time series is often confused. Which is our lesson for the day. Stay sharp.