The term predictive statistics is used to describe a focus on observables, and not on any invisible model-based parameters as is found in estimation and null hypothesis significance testing.
It isn’t sticking, the term. Perhaps it is confusing. Is it—help me, here—that it seems it can only be used in what are thought of as traditional forecasting scenarios? And not in every single situation? What exactly comes to your mind when you hear predictive statistics? I want to hear from detractors especially.
Anticipating bad news, another possible tag is observable-based statistics. I don’t think that’ll fly, since it sounds too close to observation statistics, which would be redundant in most ears. Aren’t all statistics based on observations? (Stats, yes; probability, no.)
A while ago I tried out reality-based statistics, which has excellent connotations. But it’s so mysterious, especially in our reality-shunning society, that it doesn’t work.
Causal statistics is a possibility. By far, most models are not causal. Thus the phrase is a sort of misnomer. However, the term is in the right direction, because part of the modeling process would be to declare causal or not causal (i.e. purely correlative). Since most model aren’t causal, a constant reminder of this would be of great benefit, especially to users of models, like reporters. This is because, as regular readers know, almost everybody takes statistical models as if they are causal.
The positive connotations is that it tries to direct models in the right direction, that of cause. If we know cause, we don’t need probability. The more we have to rely on probability, the further we are from cause. This might help eliminate the media-word “linked to”. Probably not, though, since the temptation to ascribe cause is overwhelming.
I’m going to try this out for a while and see if anybody salutes. Unless somebody else has a better idea.
Anyway, let’s do an example of…….causal statistics!
This data is taken from the ground-shaking paper “Uncertainty in the MAN Data Calibration & Trend Estimates“. The “LML” is a site in the Netherlands at which atmospheric ammonia is measured.
The solid black is a yearly mean computed from hourly data.
The two questions which excite the most interest are (1) Is there a trend, and (2) What might the next value be? Let’s do this in inverse order.
First, the real question, the one that should interest us most: what caused these observations? If we knew that, we’d know all. Do we know the cause? No, sir, we do not. We have already said this is a simple numerical average from hourly observations. Each of those observations has a cause, some shared some not. We deduce this because not all hourly values are identical, and because we know something about which makes ammonia flee to the air. To know the cause of this yearly data, we’d need to know the cause of each hourly value: we do not know.
The causes we do know won’t be incorporated into the model, because we did not measure them. All we have are the values and times and no accompanying information. We do not know cause in the model.
This causal-statistical models thus first admits it will not be causal.
Many happy things flow from this acknowledgement, as we shall see.
(2) The model itself, given from on high, is a simple regression on time. There are many other ad hoc models that can be tried, and should be, but this ad hoc model is the one used by competent authorities, so we’ll use it, too. It is ad hoc because we have already declared it is not causal. Here it is:
y_t = beta_0 + beta_1 * y_t-1 + epsilon_t
It looks causal! y_t is caused or determined by y_t-1, it says, and something called episilon_t. Classically, the parameters are estimated. In our causal model, we don’t care what they are. At all. But if you did care, each estimate has a +/- around it, an acknowledgment we don’t know the values of these ad hoc, non-existent creations.
In the upper left you see two equations related to the unknown future point:
Pr(LML in (6.2, 9.6) | date = 2018) = 90%? NO!
Pr(LML in (4.2, 11.6) | date = 2018) = 90%? YES!
The conditions should also include the model assumptions and data, but they’re—very dangerously—left out here. Don’t forget them.
The narrow interval (dotted green inner lines) is the parametric uncertainty, i.e. the uncertainty in the beta_1 parameter applied to 2018 (the next year). It’s the uncertainty most use! That’s because classical statistics, frequentist or Bayesian, cares about parameter estimation.
The narrow interval is obviously wrong. The predictive interval (dashed green outer lines), found by integrating out the uncertainty in the non-existent parameters, is obviously right.
If you use the parametric instead of the predictive interval you will be far too certain. In this case, about (11.6-4.2)/(9.6-6.2) = 2.2 times too certain (there are other ways to define over-certainty). This is a huge mistake to make.
(1) Was there a trend? Well, what do you mean by trend? A steady cause that is always pushing data at the yearly level down? Or at the hourly level?
Do you think you have discovered a cause because the +/- of the estimate of beta_1 are both negative numbers? Is there certainly a cause because the p-value was wee (which it is in this case)?
Strange cause! The data is all over the place: up, down, jump jump jump. The cause is really steady? How do you know? It doesn’t look like it. How do you know?
We don’t know. Not from this model. We started by admitting we didn’t know the cause. We can’t end by discovering it from a p-value.
Anyway, there is only one way to truly test this model and say whether there was this mysterious trend—which might be true!—and that is to test the model.
Use it to make predictions (hence predictive statistics): if it makes useful ones, then the model is on to something. If not, not.
To support this site and its wholly independent host using credit card or PayPal (in any amount) click here