Today, just one example, and the simplest kind, to show you that using regular regression, with its assumptions of “normality”, can quickly lead to absurdities—absurdities which will pass unnoticed using classical methods.
This is a real example, using real data, an actual regression used operationally. The data could not be simpler: time to a certain event for one of five departments (here labeled A – E; the source of the data is confidential). Times to events, of course, must be greater than 0. The question is whether different departments have different times, but hoping that the two presented below would not.
This example is typical of one carried out everywhere, every day. I am not justifying the use of regression here—better models exist—but I am saying what would happen if it were used, as it was. As are similar models used routinely.
An examination of the boxplots of the times for each department would show roughly “normal” data. There is nothing untoward in the data (to the eye of the typical user). This is classical analysis of variance, a.k.a. ANOVA, which is equivalent to linear regression. Here’s the ANOVA table from the R glm() function:
We needn’t say much about this except that each p-value is pleasingly small, i.e. publishable, i.e. less than the magic number. Null hypotheses aplenty were rejected. The person writing up the results said that there were “statistically significant differences” between the departments. That there were differences we already knew: for one, the departments are different! For another, we could have just looked at the data we took and saw there were differences. We didn’t need a model for that.
Now look at this:
The first five plots are the posterior distributions of the parameters of the model. Each has a probability near 1 that the parameter is far from 0, i.e. high probability that the parameter “belongs” in the model. Thus, whether frequentist or Bayesian, one would say that, yes, clear differences in “mean” times could be seen between departments.
However, gaze at the bottom right picture. It shows two distributions (actually “densities”) for our uncertainty in future times for the departments B (solid line) and C (dashed line) given the old data and assuming the model we used is true.
Assuming a true model and old data, we can calculate the probability that future (not yet seen) values of time in department C will be greater than times in department D: this probability is 78%. If our uncertainty in the values of times for both departments were the same, the probability that times in department C would be greater than times in department D would be 50%. Thus, just like the classical analysis, the modern would indicate that knowledge of department is relevant to our knowledge of the uncertainty in times. So all is well.
Except take a closer look. Our uncertainty in future values of times in department B indicates there is about a 40% chance for times less than 0, i.e. values which are absurd. This is a huge error, but one commonly seen.
The fault lies in assuming the times were “normal.” They were not. No thing is. What we assumed was that our uncertainty in the times could be characterized (or quantified) by a normal distribution, the central parameter of which was allowed to vary between departments: that’s our model. A logical consequence of this assumption is that bottom-right picture.
Understand, what that picture shows is true assuming our premises, i.e our model, is true. We cannot learn from that picture that the model is absurd unless we look outside the model, as we did when we recalled that times less than 0 are impossible.
You might object, “Look, we wanted to say whether there were differences. We have both small p-values and high posterior probabilities. So what if the predictive distribution is ridiculous? I have what I wanted.”
But those p-values and posteriors are also conditional on the model’s truth. Since, upon accepting the premise that times less than 0 are impossible, we know the model is false and not even close to a good approximation, we have good evidence that those p-values and posteriors are also bad, i.e. misleading, i.e. false. You are too sure of your conclusions.
Besides, if all you wanted was to say there were differences, all you had to do was look: yes, the box plots of times between C and D were different. Accepting this, the probability they were different was 1, i.e. 100%, i.e. it is true there were differences. What better evidence could you want?
“But how could I know whether those differences arose by chance? That’s why I had to use regression. P-values and posteriors can tell me whether the differences I saw were real or were due to chance.”
Did you think the differences you saw were unreal? Something caused the differences we saw. What? “Chance” isn’t alive, Chance isn’t a mystical entity, small-c-chance isn’t a cause. But if we could identify the actual causes, then, once again, we would know with certainty not only there were differences but why those differences arose. It’s only because we don’t know the causes that we had to resort to characterizing our uncertainty using a probability model.
Since we did see differences, the only question is whether those differences will persist (if we continue observing data). And we can’t know that unless we characterize our uncertainty in the times. We did this using regression, where we allowed the central parameter of a normal distribution to vary based on department. But we saw that, even though assuming this model true, we believe there is a 78% chance differences would persist, we also had solid evidence that the model is false. What we should do is begin again and better characterize our uncertainty in the times, i.e. come up with a better model. And that is all we can say with sufficient certainty for now.
Incidentally, I did not have to ask what is the probability that times in C would be greater than B: I could have asked any question that was important to me. Anyway, everything I know about future values of C and B is shown in that picture.
This is only one example. I have not exhausted by far all the ways the classical ways lead to over-certainty and mistakes.