Many, many more details are available in Uncertainty: The Soul of Modeling, Probability & Statistics and at this page.
Last time we learned that the way to do probability models was this:
(1) Pr( Y | X D M )
where Y is the proposition of interest, X an assumption or supposed, D some past observations, and M a group of premises which comprise our model, propositions which “map” or relate the X and D to Y. Nearly always, M is not causal, merely correlational. Causal models are as rare [as me remembering to fill in a hilarious simile].
As a for-instance we assumed Y = “The patient improves”, and X_0 = “The old treatment”, X_1 = “The New & Improved! treatment.” D are a group of observations of treatment, whether the patient improved, and any number of things we think might be related in a correlational way to Y. By “correlational” way we mean something in the causal path, or a partial cause, or something related to a cause or partial cause. If we had the causes of Y, that would be our model, and we would, scientifically speaking, be done forevermore.
M is almost always ad hoc. The usual excuse is laziness. “We’re using logistic regression,” says researcher. Why? Because the people before him used logistic regression. M can be deduced in many cases, but it is hard, brutal work—though only because our best minds have not set themselves to creating a suite of these kinds of models as they have for parameter-centric models.
Parameters do not exist (parameters in a logistic regression relate the X to the Y, etc.). They are not ontic. Because M is ad hoc, parameters are ad hoc. Which is what makes the acrimony over “priors” on parameters so depressing. By the time we’ve reached thinking about priors, we are already two or three levels of ad hociness down the hole. What’s a little more?
As I say, M can be deduced, which means there are no parameters anywhere ever. But, as it is, we can “integrate them out”, and we must do so, because (again) parameters do not exist, and because certainty in some unobservable non-existant parameters in some ad hoc model do not, they most certainly do not, translate into certainty about Y. But, of course, everybody acts as if they do.
So our cry is not only “Death to P-Values!” but “Death to Parameters!”
If we are using a parameterized model, as all regression models are, the propositions about which priors we are using are just part of M; they are part of the overall ad hociness. Point is, our bookkeeping in (1) is complete.
Enough introduction. Let’s get down to a fictitious, wholly made up, imaginary example using our scenario.
M contains a list of correlates; these are the X (M is more than the X, of course). As is usual, we suppose there are p of them, i.e. X is the compound proposition X_1 & X_2 & … & X_p. Just to hammer home the point, ideally X are those observations which give the cause of Y. Barring that, they should be related to the cause or causes. Barring that, and as is most usual, X will be—can you guess?—ad hoc.
With so much ad hociness you might ask, “Why do people take statistical models so seriously?” And you would be right to ask that—just as you are right suspecting the correct answer to that question.
Anyway, suppose X_j = “Physician’s sock color is blue”, a 0-1 “variable”. We can then compute these two probabilities:
(1) Pr( Y | X D M ),
(2) Pr( Y | X_(-j) D_(-j) M_(-j) ) = Pr( Y | [X D M]_(-j) ).
Equation (1) is the “full” M, and eq. (2) is the model sans socks. Which of these two probabilities is the correct one?
THEY BOTH ARE!
Since all probability is conditional, and we pick the X and the X are not the causes, both probabilities are correct.
Suppose we observed (1) = 0.49876 and (2) = 0.49877. This means exactly what the equations say they mean. In (1), it is the probability the patient gets better assuming all the old data including physician sock color; in (2) it is the probability the patient improves assuming all data but socks. Both assume the model.
Now I ask you the following trick question, which will be very difficult for those brought up under classical statistics to answer: Is there is a difference between (1) and (2)?
The answer is yes. Yes, 0.49876 does not equal 0.49877. They are different.
Fine. Question two: is the difference of 0.00001 important?
The answer is there is no answer. Why? Because probability is not decision. To one decision maker, interested in statements about all of humanity, that difference might make a difference. To a second decision maker, that difference is no difference at all. Fellow number two drops socks from his model. The statistician has nothing to say about the difference, nor should he. The statistician only calculates the model. The decision maker uses it.
That’s it. That’s how all of statistics should work. There remains only one small thing to note about the Xs.
It is this: unless we are dealing with causes, the list of X is infinite. Infinite is a big number. Who gets to decide which X to include and which to leave out? Who indeed. To include any X is to assume implicitly that there is a causal connection, however weak or distantly related, to Y. These implicit premises are in M, but of course are not written out. (The mistake most make is reification; the mathematical model becomes more important than reality.)
Sock color could be causally related, weakly and distantly, to patient health. It could be that more of those docs with blue socks wear manly shoes (i.e. leather) and since manly shoes cost more, some of these docs have more money, and perhaps one reason some of these docs have more money is because they are better docs and see more or wealthier patients.
You can always tell stories like this; indeed, you must, and you do. If you did not, you would have never put the X in the model in the first place. The most important thing to recognize is this: probability is utterly and forever silent on the veracity of any causal story (unless cause is complete and known). This is why hypothesis testing—p-values, Bayes factors, etc.—are always fallacious. They mix up probability with decision.