Class - Applied Statistics

Using P-Values To Diagnose “Trends” Is Invalid

Look at the picture, which is real data, but disguised to obscure its source. It is a physical measurement taken monthly by a recognized authority. The measurements are thought to have little error, which we can grant (at first); the numbers are used to make decisions of importance.

Question: is there a trend in the data?

Answer: there is no answer, and can be no answer. For we did not define trend.

What is a trend? There is no unique definition. One possible definition is the majority of changes: do more values increase or decrease. Another: Is the mean of the first half higher or lower than the second? Another: Are the values increasing or decreasing more strongly? For instance, there could be just as many ups and downs, but the ups are on average some-percent higher than the downs.

We could go on like this. In the end, we’d be left with a definition that fits the decisions we wish to make. Our definition of trend might therefore be different than somebody else’s.

Whatever we have, if we wish to declare a trend is present, all we have to do is look. Does the data meet the definition or not? If it does, then we have a trend; if it doesn’t, then it does not. Simple as that.

We don’t need p-values, we don’t need Bayes factors, we don’t models of any formality. We just look. Are there more than four coins on the table today (to suppose another question) as opposed to yesterday when there were three? There are or there aren’t. We just look. There are still three coins on the table today. Is three more then four? We do not need a formal model.

We need a model, which is the simple counting model. This is a model, but it cannot be said to be formal in any sense statisticians use models.

Now, keeping with our coin example, we will all agree that something caused the number of coins to be what they were. Perhaps several causes, of cause in its full sense of formal, material, efficient, and final aspects. If we decide there are more than four coins—by simple counting—it is clear that there would be a different cause or causes in effect than if there were four or fewer coins. Obviously!

There is no difference in the coin example than in our physical-measurement example. Yet the two are treated entirely differently by statistical trend hunters.

Statistical trend hunters will do something like compute a regression on the data. If the coefficient for trend is coupled with a wee p-value, the trend is declared to be present, else it is not. This is different in spirit from the definition of trend above. One definition could be an overall mean decrease or increase, as in a regression. But there is no sense to the idea the mean change is or is not there unless a p-value or some other measures takes a certain value.

How is it that coins are “really” greater than four even though we see three? How it the coins “really” aren’t three unless a function of the number of coins through time gives some value? How is it that the mean increase or decrease “really” isn’t there, even though we can see it, unless the p-value is wee?

It’s the same question that’s asked when after a medical trial, which showed a difference in treatments but where the statistician says the difference which was seen isn’t “really” there, because the p was not wee.

If the difference is not “really” there, but visible, the not-really-there difference is said to be “caused by chance.” Same with trends.

There we have the real lesson. It’s all about cause. Or should be.

Chance is not real, thus cannot cause anything. Yet some believe chance does exist, and that probability exists, too. If chance and probability exist, then causes can operate on them. Cause operates on real things. Cause must then act on the parameters of probability models, at least indirectly. How? Nobody has any idea how this might work. It can’t work, because it is absurd.

A complete discussion of cause and probability is in this paper. It is long and not light reading. That cause cannot operate on probability is another reason to reject p-values (and Bayes factors). For either of those measures ask us to believe that probability itself has been changed, i.e. caused to take different values.

(Incidentally, if you’re inclined to say “P-values have some good uses”, you’re wrong. Read this and the paper linked within.)

Cause is crucial. If a trend has been judged present, which happens when the p is wee, correlation suddenly becomes causation. The judgement is that the trend has a cause. It is true that all trends, however defined, have causes, but that is because every observation has a cause, and observations make up a trend.

Trend-setters say something different. They say the trend itself, the straight line, is real. Therefore, since the line is real, and real things have causes, the line must have a cause. That cause must have been a constant force of some kind, operating at precise regular intervals. If such a cause exists, as it can, then it should be easy to discover.

The problem is not that this kind of cause cannot exist, but the identification is too easily made. Consider the problem of varying the start date of the analysis. We have observations from 1 to t, and check for trend using (the incorrect) statistical means. The trend, as above in the picture, is declared. It is negative. Therefore, the cause is said to be present.

Then redo the analysis, this time starting from 2 to t, then 3 to t, and so on. You will discover that the trend changes, and even changes signs, all changes verified by wee p-values. But this cannot be! The first analysis said a linear force was in operation over the entire period. The second, third, and so forth analyses also claim linear forces were in operation over their entire periods, but these are different causes.

This picture shows just that, for the series above. For every point from i to t, a regression was run with linear trend, the trend estimate plotted, blue for wee p-values and decreasing trends, red for wee p-values and increasing trends, black for either increasing or decreasing and non-wee p.

The statistician would be convinced a negative linear cause was in effect for the first few months. A different cause for the entire series from i to n. Then it went away! No causes were present, except “chance”, for a while, then a positive linear cause appeared. And appeared again. And again, each time different, each valid for the entire series from i to n. It becomes silly in roughly 2013, where one month we are certain of a positive linear trend, and the next we are certain there is “nothing”, then certain again of another positive trend, then “nothing” again. And so on.

This is a proof by absurdity that cause has not been identified when a trend is accompanied by a wee p-value or large Bayes factor.

Correlation is not causation. But we can put correlation to use. We can use the correlational (not causation) line in predicting future values of the series. In a probabilistic sense, of course. Since liner causes might be in operation, perhaps approximately, then the linear probability model might make skillful predictions.

But since we almost never have the future data in hand when we want to convince readers we have discovered a cause, it’s best then to do two things.

(1) Make the predictions; say at time t+1 with value will be X +/- x, at t+2 it will be something else, etc. This allows anybody to check the prediction, even if they don’t have access to the original data or model. What could be fairer?

(2) DO NOT SHOW THE TREND. Showing it each and every time causes the Deadly Sin of Reification to be committed not just by you, but by your reader. He sees the line first and foremost. The line becomes realer than Reality. The stuff that happens outside the line is called “noise”, or something worse. No! The data is real: the data is what happened. The world felt the data, not the line.

Please pass this on to anybody you see, especially scientists, who use statistical methods to claim their trends are “significant”.

18 replies »

  1. I don’t know if this was intended as a ‘for dummies’ explanation but it certainly helped this dummy understand your thesis better. Thank you!

  2. X13-ARIMA-SEATS is used in many gov agencies, banks, etc. in the world for seasonal adjustment, and uses p-values and a variety of statistics. What is your well-developed and accepted substitute, exactly because “all we have to do is look” doesn’t pass the smell test.

    And modern time series essentially uses an iterated moving average window filter for a trend component, not a straight line. These things are quite well defined in the documentation and program code.
    Yes, output can change based on your model span (input). That’s not surprising or undesirable whatsoever.

    Justin

  3. Trend is one of several patterns that can be spied in charts. They are so easy to spot that many people see them when they are not there, ‘p-value’ or no. For example, at the forthcoming link, regression would “found” a trend when what really happened was a shift. https://tofspot.blogspot.com/2013/05/the-wonderful-world-of-statistics-part.html

    The reason why it’s useful to distinguish among shifts, trends, cycles, etc. is that they point toward different species of causes. A shift indicates a cause that occurred at a particular time and suggests reviewing logs and such for clues; while a trend indicates a cause that occurred over a particular period of time, and for clues, the engineer should consider factors like tool wear, accumulation, aging, depletion, and the like. They don’t tell you what the cause is, but they do indicate what kind of cause you should look for.

  4. Justin:

    ‘What is your well-developed and accepted substitute, exactly because “all we have to do is look” doesn’t pass the smell test.’

    There is extensive research activity among a community of statisticians who are developing nonparametric and predictive methods to replace traditional parametric statistics. This author, with a shaky grasp of high-school math, is, needless to say, not part of this effort: he doesn’t publish in their journals, and seems unaware of their work, never citing any of it. He’s just a polemicist; a non-physicist who thinks he knows that all the physicists are wrong about climate, for example, because of “over certainty”, or something. You’re wasting your time if you expect a substantive response from this guy.

  5. Lee! We thought you might be dead. Nice to see you’re still animated.

    Still relying on the old argument-by-insult fallacy, eh? Stick with what works is what I say.

  6. Justin,

    Sure, there are scads of models in use. Obviously (well, maybe not; see Lee), this wasn’t a disquisition on strengths and weaknesses of every kind of time series model.

    The point is that no matter which model you use, the data is still the data, and the trend is there or not regardless of the size of any p. And that start and stop points can make a huge difference. You have read the many arguments against p-values in the linked paper, yes? And the other words about cause in that same paper?

    You can, like YOS, turn the model around and not make predictions of the future, but of the past. This is like a murder mystery. At what point did the manufacturing process go bad? I.e., at what point did a particular cause (out of presumably many) change? Well, a model can work for that (and certainly not a straight line).

    But that then becomes, like all statistics should become, a prediction problem. And that, therefore, calls for probability. As in “Given the model, data, and suppositions about causes, the probability this cause changed at X is P”. That sort of thing.

    It remains true that all uses of p-values are invalid, as proved previously.

  7. “You have read the many arguments against p-values in the linked paper, yes?”

    Yes, I am familiar with them. I am more wondering however what the arguments are against claimed proposed alternatives to p-values, or how the pros of said alternatives stack up to the pros of p-values. For example, a reader may be wondering why ‘looking’ is not relied upon by official agencies to make important decisions but X13-ARIMA-SEATS is.

    Justin

  8. Justin,

    X13-ARIMA-SEATS is used for forecasting, primarily. And forecasting, which is to say predicting, is exactly what I do advocate. That is the alternative to hypothesis testing/Bayes factors. If you have to adjust for seasonality, and want to assume constant variance, then go for ARIMA. Make predictions with that model, using, of course, predictive probabilities (and not confidence or credible intervals).

    If you don’t like ARIMA, use GARCH, or use whatever you like. It’s all ad hoc anyway. I am not pushing this or that model. I am discussing how we know if we have identified cause—and the possible nature of that cause. I used a linear model for ease of explanation, which I assumed as clear.

    I know I said in all those papers that ad hoc models can be useful in making predictions. I have been conducting an on-line class in these techniques for over a year expostulating the same ideas. Try it!

    Because any model fits past data well does not mean it will make good predictions, nor does it imply that the model has identified causality.

    If you agree p-values are invalid, I don’t see we have any disagreement.

  9. “X13-ARIMA-SEATS is used for forecasting, primarily. And forecasting, which is to say predicting, is exactly what I do advocate. That is the alternative to hypothesis testing/Bayes factors.”

    I don’t agree that p-values are invalid.

    I’m saying X13-ARIMA-SEATS is successful and it uses p-values.

    Justin

  10. Justin,

    It might help if you spelled out your argument in a little more detail. For instance, you can’t be saying that p values are reliable because they are relied on, since, after all, plenty of people rely on horoscopes (say). But if you say p values have a track record of success, I wonder if you judge that using p values, or just looking, or some other. In the first case, you beg the question, in the second you concede the argument to Matt. So do you have a third way of judging the success of p values?

    Anyway, if you you spell out the argument, I’m sure all the confusion will be cleared up.

  11. “For instance, you can’t be saying that p values are reliable because they are relied on, since, after all, plenty of people rely on horoscopes (say). But if you say p values have a track record of success, I wonder if you judge that using p values, or just looking, or some other. In the first case, you beg the question, in the second you concede the argument to Matt.”

    Tim, I’m not going to go back and forth with word games. One uses p-values (distance the observed test statistic is away from model) to formalize the ill-defined “just look”. Experimental design, time series, quality control, and survey sampling to name a few areas, are successful sciences using p-values and not using “just look”.

    I’ll just post further about it at my site,

    Cheers,
    Justin

  12. Another view: You haven’t provided us with enough information to decide whether there is a meaningful trend in the data. I’m happy to personally define a trend as the slope returned by linear regression. By eye, a linear trend doesn’t explain much of the variance in this data. I suspect a 95% confidence interval for the linear term will easily include zero. So I don’t need to waste time looking for a cause the 30-year change in this data when it isn’t obvious there must be a cause. There may be an interesting cause for the high frequency changes, but their magnitude isn’t very constant.

    Without knowing the nature of the data, it may not make scientific sense to even look for a linear trend. If this were global temperature anomaly change, I would expect the fairly linear increase in forcing (0.4 W/m2/decade) to produce a linear trend (called the transient climate response) with a lot of noise from chaotic heat transfer within our climate system. In that case, looking for a linear trend makes sense.

  13. The goal of statistics is to help abstract some meaning from raw data, but scientific progress does NOT begin with a purely statistical analysis of data. There are an infinite number of possible statistical models that can be applied to any data set. Scientific progress BEGINS with a hypothesis, and that hypothesis determines what model should be applied to the data. Anytime you present data in the absence of a hypothesis, you aren’t illustrating anything useful about how scientists obtain meaning from data. What we are trying to do is determine how consistent the data is WITH A PARTICULAR HYPOTHESIS; not rule out all other hypotheses.

    There is no doubt from laboratory experiments and experiments in the atmosphere itself, that rising GHGs will slow the rate of radiative cooling to space. This slowdown is called radiative forcing. In the absence of other changes in the radiative balance at the TOA, conservation of energy demands that it warm somewhere below the TOA. After several decades, a radiative imbalance on the order of 1 W/m2 causes a buildup of energy so big that it can’t easily be hidden (except in the deepest oceans or underground, places current theories of heat transfer say are impossible to reach). Two compartment models for heat transfer show that the surface should warm at a linear rate on a decadal time scale. Of course, you won’t find a convincing plot of forcing vs warming anywhere, because climate scientists don’t really like to do experiments that test their hypotheses.

Leave a Reply

Your email address will not be published. Required fields are marked *