Which x’s to pick? Unfortunately, there is an infinite universe from which to draw—and here we come closer to resembles. It could be, for instance, a person’s shoe color is important to understanding the uncertainty in y. Don’t laugh. Brown- and black-shod people would tend to wear dress or adult shoes, while white- and multi-color-shod folks tend to wear causal shoes, perhaps coming from a different socio-economic-political background; and surely men and women wear different colors. So understanding a person’s shoe color could change our understanding of y. Do not forget that regression is not telling us what causes an outcome, but how our uncertainty in the outcome changes with respect to external factors—like shoe color.
We could go on endlessly, telling stories about possible x’s. Each could be tested; its effect on the uncertainty of y measured. Those x’s which do not change the uncertainty in y (in the presence of other x’s), or change it very little, are usually tossed out of the model, while those that change it a lot—where “a lot” depends on the situation, i.e. how the results will be used—should be kept. There is an art form to this, with many techniques and rules-of-thumb (note: the x’s already in the model change how new x’s interact; the real process is explosively complex).
But since there are always an infinite number of x’s, it is we who must decide which enter our “model” and which are left out. The same is true of any study: it is the researchers in charge who decide which x to put in and which to eschew. There may be better x which they ignored, where better is used in the sense of changing the uncertainty in y more, or they may have left in x’s which are only coincidentally related to y, perhaps fooled by a mathematical quirk (again, regression is not a causal model; think of the correlation of ice cream sales and drownings). The point to remember is that the list you see is not, except in very special situations where the experimental environment was tightly controlled, “scientifically determined.”
And even when it was controlled, there are still unmeasured “factors” which might have caused y to change. This is the nature of a contingent y. A contingent thing does not necessarily (by logical necessity) take the values it does: the y’s of regression are always contingent on the state other “variables”; y only takes the values it does because of these other things, they are the effects of some cause of which we are admitting ignorance. If we knew the cause of y in its totality, there is no need for regression or probability. Science’s only concern, then, is with the contingent; things which are permanent or are necessary belong to metaphysics, to philosophy. Thus no matter what, the list of x’s in a regression is always to some extent the result of subjective judgment. This is why there is always room for criticism.
We can now define, loosely, resembles: when future data isn’t “too far away” in its x’s from the x’s used to build the model. “Far away” implies a distance, but how do you measure how far away data is? If our only modeled x was biological sex, we can easily see if future data has men and women. But the initial people we used as our data have an infinite list of attributes we didn’t measure, and some combination of these was the cause of y. It’s those characteristics that have to resemble the characteristics of future data. Incidentally, “randomizing” samples does not—no, no, no—guarantee a “balance” of unmeasured characteristics. It is time for this cherished myth to die.
We might guess that because the men and women we measured were, say, spoiled college kids, that the factors which go into being a spoiled college kid are important to understanding y. We didn’t use these factors in the model, but we suspect that if we did, the probabilities of y would change. Thus, future data should resemble other spoiled college kids; those people belonging to a retirement community in Florida are probably “farther away” and the results not as applicable. Or maybe college kids in (for instance) Pakistan are also distant. Or it could be spoiled college kids of this country, but matriculating ten years hence, are far away.
Except when y is something physical, and small and tightly constrained, i.e. where we are confident of a multitude of outside information that describes and delineates y, we will never be able to say with accuracy how much new data resembles old (and even in the physical cases, we are never completely sure; if we were, then y would not be contingent). Just how “far” are college students in Pakistan from those in the States? Nobody knows. Or, that is, anybody that claims to know is fibbing or fooling himself.
Point is: even if results were given in the manner outlined above, there are always residual uncertainties which could be enormous. It’s a good bet that if the regression in any way models human behavior, it is over-certain.
Now it doesn’t matter what the y is or the x’s are, the idea is always the same: to understand how the x’s change the uncertainty in y. Software will handle the tricky mathematics, but users have to create the scenarios (the combinations of x’s) that are of interest. Unfortunately, regression as she is used is rarely like this. Why? It’s a lot of work! Instead, something else happens. To understand what, we have to delve under the hood.
Next time: enter parameters. Probably Sunday or Monday. Who wants to read about regression on the weekend?