We’ve done regression a hundred times, but it isn’t sticking. That means my explanations are failing. Let me try again.
Everybody knows normal distributions, i.e. bell-shaped curves: in shorthand N(m,s), where the “m” is the central parameter, which describes where the peak of the bell is centered, and “s” is the spread parameter, which describes the width of the bell. Both parameters are needed to draw the curve. The one at the top of the post has m = 3 and s = 1.
Suppose we wanted to characterize our uncertainty in the GPA of a Harvard student. Well, that’s 4.0, because the little darlings enrolled there deserve nothing less. So let’s pick another school. Bowling Green. We might, for no good reason and many bad ones, use a normal distribution to quantify this uncertainty. Hey. Everybody else does it. Why not us?
That means the normal distribution which characterizes our uncertainty in the GPA of a Bowling Green student (singular) has some m and some s. Both are needed. Right?
Regression is the formula:
Do you see? We model the m; we say the m is a function of various things, the x’s. Maybe one of the x’s is age, another is sex, a third is income, a fourth is BMI, a fifth is height, a sixth is presence of some gene, a seventh is whether the individual is a science major, an eighth is whether the individual is Caucasian, a ninth is high school GPA, a tenth is SAT score, an eleventh, a twelfth, thirteenth, and on and on and so on some more. You’d never make it as a sociologist unless you can think of at least two dozen entries.
Wee p-values are invoked, via mathematical incantation, to decide which of the x’s to keep. But whichever x’s are there, the following interpretation holds. Each combination of x’s implies a different value of m. Each combination of x’s puts the peak of the bell curve at a new value. The spread is always the same. It does not matter what any x equals, s gives the same spread to everything. Where by “everything”, I mean everything.
Sharp readers will have noticed that there is not word one about causality. Because why? Because regression is silent on this important subject. Regression only does what we asked it to do: to characterize our uncertainty in some observable (here GPA) using a normal distribution with a central parameter given as a function of some x’s.
Regression does not say anything direct about the observable. It does not say what the value of the observable will equal given some combination of x’s. Regression would be a causal model if that were true. And it only indirectly says anything about the probability the observable will take any value given the x’s—but that’s because of screwy limitations of normal distributions and classical procedures, which we’ll skip here.
Consider that the x which represents whether the individual is Caucasian increases m. That does not therefore mean whites have higher GPAs than non-whites. No no no no no. No. It says, very indirectly and after manipulation most people forget to or don’t do, that given this model and these x’s, the probability whites have higher GPAs than non-whites is greater than 50%. (The exact probability can be calculated.)
Are these the right x’s? Maybe. What do you mean by “right”? Remember: regression doesn’t say what causes the observable, only which x’s change our understanding of the uncertainty of the observable. So if by “right” you mean just those x’s which cause the observable, then almost certainly we don’t. Consider our example.
Race can not cause GPA to take any value. How could it? You might try claiming that “racism” is what accounts for whites to have higher m’s, but you do so with no direct proof. All you can see is that, with the combination of x’s in your model of m, whites have a positive contribution to m. To say that racism is the cause is to eschew all others—an act of will. Why? Because there are an infinite number of possible causes of why whites have, in the presence of these x’s, a positive combination. Plus, this is just one model, the normal, out of many we could have chosen.
I emphasize “the presence of these x’s”, because in the presence of other x’s, the contribution of Caucasian, or any other variable, might switch signs, or might even evince a p-value not wee enough to keep it in the model. The only way to know what these changes would be is to check and see. But since we work with what we have, we’re stuck with the x’s on hand.
Yet it is difficult—impossible?—to find any “researcher” not abusing the definition of regression, and who doesn’t imply that his x’s are true causes, who doesn’t say the changes in m are changes in the observable.
What bad teachers we statisticians are.