Be sure to first read Statistical Significance Does Not Mean What You Think. Climate Temperature Trends then Regression Is Not What You Think. Climate And Other Examples, as this post is an extension of them.
This post will only be a sketch, and a rough one, of how to pick explanatory variables in regression. The tools used work for any kind of statistical model, however, and not just regression.
First remember that a regression is not a function of observables, like y, but of the central parameter of the normal distribution which represents our uncertainty in y. This language is tangled, but it is also faithful. Regression is not “empirical fits” or “fitting lines to the data” or any other similar phrase. We must always keep firmly in mind that we are fitting a model to parameters of a probability distribution which itself represents our uncertainty in some observable. Deviations from the language I use are part of what is responsible for the rampant over-certainty we see (a standard topic of this blog).
Here is a regression:
yt = β0 + β1x1 + β2x2 + … + βpxp + ε
where each of the xi are potentially probative of y. It is up to you to collect these x’s. How many x’s are there for you to consider in any problem? A whoppingly large number. A number so large that it is incomprehensible.
Anything can be an x. The color of the socks of the people who generated or gathered the data can be an x. The temperature in Quebec might just be correlated with your y. The number of monkeys in Suriname could be predictive of your y. The number of nose hairs of Taipei bus drivers could be relevant to saying something about your y.
You might think this silly, but how can you know these x’s are not correlated with your y if you don’t try? You cannot. That is, you cannot know with certainty whether any x is uncorrelated with your y unless you can prove logical independence between x and y. And that isn’t easy: it really can’t be done in any except mathematical and logical proofs using highly defined objects. For real-world (contingent) data, logical independence is hard to come by (see the footnote).
A sociologist (whose name I cannot look up because I am too pressed for time) said words to the effect that, in his field, everything is correlated with everything else. This is only a slight exaggeration. In any case, it remains true that for any contingent x and y, logical independence1 is denied us and so we must instead look to irrelevance.
Irrelevance is when, for some propositions x and y (data are observations statements, or propositions),
Pr(y|x & E) = Pr(y|E).
That is, the probability of y given some evidence E remains unchanged if we consider we also know x. This tells us that to say whether an x should be in our regression equation, we should examine whether x is relevant or irrelevant to knowing (future) values of y.
Recall that our goal for any probability model is to make statements like this:
Pr (ynew > a | xnew, old observed data, model true)
where we pick interesting a’s or we pick other questions about ynew which are interesting to us. If this probability is the same if we do not condition on x, then x is irrelevant to y and should not be included in our model. In notation (for our math readers): If
Pr(ynew>a| xi, other x’s, old data, model true) = Pr(ynew>a| other x’s, old data, model true)
then xi is irrelevant to y so it should not appear in our model.
Classic statistics tells us you cannot say which x’s you should include unless you first do a hypothesis test. This is a highly artificial construct which often leads to error. That is, some x’s can be relevant to knowing y even though the p-values of those x’s are larger than the magic number—oh, 0.05! how I love thee!—and some x’s can be irrelevant to y even though their p-values are less than the magic number.
And this holds equally for Bayesian posterior distributions of the parameters. Some x’s can be relevant to y even though their posterior probabilities show a large probability of not equaling (or being near) 0, and other x’s can be irrelevant to y even thought their posterior probabilities show a large probability of equaling (or being near) 0.
In other words, relevance as a measure of model inclusion does away with all discussions of “clinical” versus “statistical significance.” It also removes all tricks, like the one in where if you increase your sample size you guarantee a publishable p-value (one which is less than the magic number).
Relevance is the fairest measure because it puts the decision directly in terms of observables—and not in terms of unobservable parameters. We ask questions of the y that are meaningful to us—and these questions will change from problem to problem. We create the questions, not some software package. We need not rely on a one-size-fits-all approach like hypothesis testing or posterior examinations. We can adapt each analysis to the problem at hand.
Why isn’t everybody jumping on the relevance bandwagon. Ah, this is it. Ease. The relevance way puts the burden of decision making squarely on you. It is (as we shall see when I do examples later) more work. Not computationally; not really. But it doubles the amount of mental effort an analyst must put into a project. It makes you really think about what probabilities mean in terms of observables. It also removes the incredible ease of glancing at p-values (or posteriors), of having the software make the decisions for you.
But this is the least of it. Far, far worse is that relevance absolutely destroys the goosed up certainty found in classical (hypothesis testing and posterior examination) methods. Whereas before relevance, you might find dozens of x’s that are “highly significant!” for explaining y, with relevance, you’ll be lucky to find one or two, and those won’t be nearly as exciting as explaining y as you had thought (or hoped).
And that is bad news for your prospects of publishing papers or developing new “findings.”
1Logical independence exists if and only if each of the conjunctions “x & y”, “x & not y”, “not x & y”, and “not x & not y” are not necessarily false. It is also the case that some logically independent x and y, the x (or y) might be relevant to knowing y (or x).