The term predictive statistics is used to describe a focus on observables, and not on any invisible model-based parameters as is found in estimation and null hypothesis significance testing.
It isn’t sticking, the term. Perhaps it is confusing. Is it—help me, here—that it seems it can only be used in what are thought of as traditional forecasting scenarios? And not in every single situation? What exactly comes to your mind when you hear predictive statistics? I want to hear from detractors especially.
Anticipating bad news, another possible tag is observable-based statistics. I don’t think that’ll fly, since it sounds too close to observation statistics, which would be redundant in most ears. Aren’t all statistics based on observations? (Stats, yes; probability, no.)
A while ago I tried out reality-based statistics, which has excellent connotations. But it’s so mysterious, especially in our reality-shunning society, that it doesn’t work.
Causal statistics is a possibility. By far, most models are not causal. Thus the phrase is a sort of misnomer. However, the term is in the right direction, because part of the modeling process would be to declare causal or not causal (i.e. purely correlative). Since most model aren’t causal, a constant reminder of this would be of great benefit, especially to users of models, like reporters. This is because, as regular readers know, almost everybody takes statistical models as if they are causal.
The positive connotations is that it tries to direct models in the right direction, that of cause. If we know cause, we don’t need probability. The more we have to rely on probability, the further we are from cause. This might help eliminate the media-word “linked to”. Probably not, though, since the temptation to ascribe cause is overwhelming.
I’m going to try this out for a while and see if anybody salutes. Unless somebody else has a better idea.
Anyway, let’s do an example of…….causal statistics!
Example
This data is taken from the ground-shaking paper “Uncertainty in the MAN Data Calibration & Trend Estimates“. The “LML” is a site in the Netherlands at which atmospheric ammonia is measured.
The solid black is a yearly mean computed from hourly data.
The two questions which excite the most interest are (1) Is there a trend, and (2) What might the next value be? Let’s do this in inverse order.
First, the real question, the one that should interest us most: what caused these observations? If we knew that, we’d know all. Do we know the cause? No, sir, we do not. We have already said this is a simple numerical average from hourly observations. Each of those observations has a cause, some shared some not. We deduce this because not all hourly values are identical, and because we know something about which makes ammonia flee to the air. To know the cause of this yearly data, we’d need to know the cause of each hourly value: we do not know.
The causes we do know won’t be incorporated into the model, because we did not measure them. All we have are the values and times and no accompanying information. We do not know cause in the model.
This causal-statistical models thus first admits it will not be causal.
Many happy things flow from this acknowledgement, as we shall see.
(2) The model itself, given from on high, is a simple regression on time. There are many other ad hoc models that can be tried, and should be, but this ad hoc model is the one used by competent authorities, so we’ll use it, too. It is ad hoc because we have already declared it is not causal. Here it is:
y_t = beta_0 + beta_1 * y_t-1 + epsilon_t
It looks causal! y_t is caused or determined by y_t-1, it says, and something called episilon_t. Classically, the parameters are estimated. In our causal model, we don’t care what they are. At all. But if you did care, each estimate has a +/- around it, an acknowledgment we don’t know the values of these ad hoc, non-existent creations.
In the upper left you see two equations related to the unknown future point:
Pr(LML in (6.2, 9.6) | date = 2018) = 90%? NO!
Pr(LML in (4.2, 11.6) | date = 2018) = 90%? YES!
The conditions should also include the model assumptions and data, but they’re—very dangerously—left out here. Don’t forget them.
The narrow interval (dotted green inner lines) is the parametric uncertainty, i.e. the uncertainty in the beta_1 parameter applied to 2018 (the next year). It’s the uncertainty most use! That’s because classical statistics, frequentist or Bayesian, cares about parameter estimation.
The narrow interval is obviously wrong. The predictive interval (dashed green outer lines), found by integrating out the uncertainty in the non-existent parameters, is obviously right.
If you use the parametric instead of the predictive interval you will be far too certain. In this case, about (11.6-4.2)/(9.6-6.2) = 2.2 times too certain (there are other ways to define over-certainty). This is a huge mistake to make.
(1) Was there a trend? Well, what do you mean by trend? A steady cause that is always pushing data at the yearly level down? Or at the hourly level?
Do you think you have discovered a cause because the +/- of the estimate of beta_1 are both negative numbers? Is there certainly a cause because the p-value was wee (which it is in this case)?
Strange cause! The data is all over the place: up, down, jump jump jump. The cause is really steady? How do you know? It doesn’t look like it. How do you know?
We don’t know. Not from this model. We started by admitting we didn’t know the cause. We can’t end by discovering it from a p-value.
Anyway, there is only one way to truly test this model and say whether there was this mysterious trend—which might be true!—and that is to test the model.
Use it to make predictions (hence predictive statistics): if it makes useful ones, then the model is on to something. If not, not.
To support this site and its wholly independent host using credit card or PayPal (in any amount) click here
I dispute that just because one uses a model, say
y_t = beta_0 + beta_1 * y_t-1 + epsilon_t
that they are therefore saying the right hand side literally “causes” the left hand side.
It is just a model, ie. we tentatively assume it is true but only for the purposes of getting estimates and trying to further understanding a little more than where we were before, and the model is open to rejecting terms and refining and error.
One interval is not “wrong” and the other “correct”, it is just one interval is for a modeled mean, the other is for a single y based on a single x.
As far as finding trends, and the trend can change depending on where you start, etc. (you just demonstrated the reference set from frequentism is important) the point is to have an objective repeatable method to use and clearly state your assumptions, which is a much more defendable process than any subjective ‘just look’ method of assessing.
Justin
Perhaps you could emphasize “skill” instead of “significance”. Try to change the view from small “p-values” to skillful forecasts/predictions. Increasing discussion of “skill” will help emphasize the poor performance of today’s “linked to” focus.
Keith,
Just so. Skill over significance.
Justin,
The “mean” is meaningless here. You can create a point guess, which has meaning in a decision, but the eventual observation will be a point, not a mean. The narrow interval is just plain wrong when categorizing uncertainty in the observations.
Absolutely “trends” depend on start and stop times, a point I have made repeatedly. But this is the data we have, and none other, and this is the model that is used (I didn’t pick it). There is nothing frequentist about it at all.
predictive statistics? psychics, prognosticators, con artists, fake news, marketers
Justin: I disagree. As far as I can see, a model means the cause is whatever the model shows and only that cause. One sees that in climate science and evil personal injury lawsuit lies. Zantac does NOT cause cancer, but the proper model gets it that designation and the lawyer billions of dollars (as in the case of Roundup and cancer). It’s all a lie, using a multicaused phenomena as the single cause of a disease, but it is ONE HUNDRED PERCENT believed by many, many idiots out there. Ask a “green fool” and they will attest to Roundup killing people, thousands of people. The model means the cause is whatever the model concluded. People DO NOT distinguish, even scientists. We live in voodoo and witch doctor days, not enlightened times.
The example shown barely illustrates a trend line. This is all rediculous. However, a blood sugar continuous monitoring system sells on this LIE. It says it can tell you where your blood sugar will be. NO, it cannot. But for a few hundred bucks, paid for by someone other than yourself, you can believe the lie and wonder why your sugars aren’t like other people’s sugars that obviously follow predictions or the seller wouldn’t say the machine predicts. You dare not ask, lest you look like a fool for spending hundreds on a lie equivalent to a gypsy with a crystal ball. Science has little or no use these days. I thank God I started out with diabetes before the lie became the “truth” and fools lived their lives based on idiotic gypsy prognostications and the fear of a little computerized box’s reporting. Diabetics do not die from lack of money–they die from lack of knowledge and intelligence, the fear of actually thinking for themselves. This predictive lie problem can kill and kill quickly in some cases. This is a microcosm of what science has become using the lie of trend lines and “causes”.
Deducible statistics?
Variation is due to causes. There are often a host of such causes. Sometimes, one or a few causes dominate the variation. The chart you displayed looks a great deal like one I encountered in a metal stamping/forming operation. The trend had a cause; viz., tool wear. The variation around the trend had causes, too; viz., all the other causes except tool wear. The data, which was a dimension on the stamped part, could be predicted from a combination of the assignable cause (tool wear) and the “random” causes (everything else). They were called “random” because the variation was due to particular combination of mini-causes that happened to be in effect at any given time, but which could not in a practical sense be measured or known. The green boundaries, prediction interval, told us if the projected results would fall within specs, and indicated how long the tool could remain in use before preventive maintenance should replace it.
The internal limits are on the slope of the line, not on the data. We called it the “slop in the slope,” and it meant that the rate of tool wear might be steeper or shallower than the estimate shown. (Rotate the regression line as long as it stays within the inner limits.) This can be useful for tooling usage planning.
You will notice in your example that this allows one extreme of the regression slope to be flat. IOW, there is no trend. Had this been the case in my tool wear example (it wasn’t), it would have suggested a step-change (“shift”) rather than a continual change (“trend”). Other possibilities included spikes (or icicles) and cycles, as well as more complex patters (mixtures, stratification, et al.)
Self-validating statistics?
“The narrow interval is just plain wrong when categorizing uncertainty in the observations.”
Sorry, saying “wrong” is not correct no matter how often you say it. It is just a different interval based on a different calculation: Variance of estimated mean at X_i vs. Variance of prediction at X_i.
Speaking of the average of X, both intervals are narrowest at the average of X.
Yes, you do have a single dataset, which could have arose many different ways, arose this way with measurement error, and another one can presumably be obtained in future studies. We can assess a specific dataset using that specific dataset of course and also the sample space/sampling distribution.
Cheers,
Justin
Justin,
You use the inner interval, I’ll use the outer one and we both bet. Who’s going to win more?
“Justin,
You use the inner interval, I’ll use the outer one and we both bet. Who’s going to win more?”
We’re both winners!
Depends on if we are trying to get at E(Y|X), or predicting the next Y we will observe.
If I use the latter to get at the former, it is like saying I am right because I miraculously “predicted” someone’s age to be in the interval 0 to 120.
Either one shows frequentist confidence intervals work well though. 🙂
Justin