We want Pr(Y|X) where Y is a question about the world and X all/everything/in toto of what we assume is probative. That’s not what we get usually. But we should! Here’s why. WARNING! If you are having difficulty viewing this post in a email, click the title in the email and view it on the site.
Video
Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty
HOMEWORK: Given below; see end of lecture.
Lecture
This is an excerpt from Chapter 9 of Uncertainty.
The Second Best Model
The goal of modeling should be this: to gather evidence X to make probability statements about propositions Y. Schematically, as in the last Chapter (this is expanded in a practical sense several Sections below):
Pr(Y |X) [eq 1].
Although probability is general and places no restrictions on the nature of the propositions Y, statistics and the physical sciences by custom and as a useful division of labor, but not of philosophy, restricts Y, and much of X, to be observable (contingent). X will contain past observations and other information, including deductions and possibly causal or logical propositions, thought probative of Y. Premise, assumption, or observation as names for X are preferred over variable, because premise and the like are easily seen to be what they are: assumptions. “Variable”, unless one is careful, can led to reification and mistakes in assigning cause. There are no parameters here, nor are there decisions: this model is pure probability. In this section, only the idea behind this approach is given; there are some suggestions about how to go about this at the end of this Chapter, and specific implementations in the next Chapter
Most unfortunately, statistics as practiced is rarely like [eq 1] and is something else entirely. Which is to say, the material that follows is not standard—but should be. Statistics as classically practiced is like this:
$$\mbox{Y} \sim D(\mbox{X},\theta), \mbox{ [eq 2]}$$
where the observable Y is said to be “distributed” according to some probability distribution D, which is a function of premises (usually a smaller set than in [eq 1] X, themselves indexed by parameters $\theta$. Now [eq 2] can be turned into [eq 1] by “integrating out” the uncertainty in the unobservable parameters. The operation is an integration over [eq 2] if $\theta$ is continuous, else it is a sum. In other words, given X, the probability of Y is the value of [eq 2] weighted by the uncertainty in $\theta$. Schematically:
$$\Pr(\mbox{Y} | \mbox{X}) = \sum_i \Pr(\mbox{Y} |\mbox{X},\theta_i)\Pr(\theta = \theta_i|\mbox{X}). \mbox{ [eq 3]}$$
where X “contains” everything we know, including past observations of the proposition of interest and probative observational data, other premises specifying the model, and specifications of new values of the probative data. This is a cartoon equation, which is made specific in individual implementations.
In modeling as commonly practiced, discussion settles on measures and statements about the parameters, about, that is, objects like $\Pr(\mbox{X}|\theta = h)$ in frequentist statistics or like $\Pr(\theta = h|\mbox{X})$ in Bayesian, for some value of the parameter $h$ (this can be a set). The remainder of the equation is forgotten. Because all or most of the math (as is proper) is set aside when communicating results, classical methods make errors in inference hard to spot, and it has led to a ritualized form of statistics. Decisions about Y based on statements about parameters mixes up probability and decision. Even in “non-parametric” statistics, the goal is decision not probability.
In the predictive approach, as given in [eq 1] or [eq 3], measures of relevance should replace hypothesis testing, and direct calculation of propositions of actual interest should replace estimation. The classical methods of testing and estimation are discussed below. The replacement in both cases is the same; which is to say, equation [eq 1] or [eq 3] and not [eq 2]. The onus of decision should be removed from the method and put where it belongs, on the narrow shoulders of users. Statistics must become less like ritual and more like hard work. Statistical pronouncements must be put in the form where they can be verified—independently verified by direct comparison with reality. Models which fail to conform to reality are to be expunged.
Some rough but familiar examples first. Suppose X is a compound proposition about past observations, nature of employment, residence and the like, and sex (M or F), and Y is the proposition Y = “Income is greater than fifty thousand.” We might be interested in (with obvious shorthand notation) $\Pr(\mbox{Y} | \mbox{X}_{M})$ and $\Pr(\mbox{Y} | \mbox{X}_{F})$. If both of these are equal, then knowledge of sex is irrelevant to knowledge of Y—given the remainder of what’s in X. If the other constituents of X change, sex might become relevant to Y (as was shown previously). If Y is modified to a different obtainable monetary figure, say $\mbox{Y}’$ =”Income is greater than seventy thousand”, and $\Pr(\mbox{Y}’ | \mbox{X}{M})=\Pr(\mbox{Y}’ | \mbox{X}{F})$, then again, knowledge of sex is irrelevant to knowledge of (as is obvious) Income at this level—assuming again, of course, the other conditions in X apply. Of course, there may be amounts of Y at which sex is relevant. Plots can be made in which Y indexed by amounts forms the abscissa and the probabilities calculated assuming the X of interest the ordinate (reversing the x-y!). In other words, plot $y$ by the probability of Y = “Income is greater than $y$” by conditioned on whatever (combination of) X is of interest.
If, say, $\Pr(\mbox{Y} | \mbox{X}{M}) = \Pr(\mbox{Y} | \mbox{X}{F})+ \epsilon$ for some small $\epsilon$, then knowledge of sex is relevant to knowledge of Y (and its stated income figure; recall Y is a fixed proposition). Whether $\epsilon$ is “large” and “important” or “small” and “negligible” or whatever are not statistical questions. Any $\epsilon>0$ is enough to prove relevance. Whether this is “practical” relevance or not depends on the decisions to be made with the information. The size of $\epsilon$ might be important in one context and ignorable in another. Suppose $\Pr(\mbox{Y} | \mbox{X}_{M}) = 0.4$ and $\Pr(\mbox{Y} | \mbox{X}_{F})= 0.45$. There is a higher chance that women, given what we know in X, will make the income stated in Y. Is that extra 5% “enough” to make a difference? That depends on what decisions are going to be made of these probabilities. Is somebody going to be sued? How many people matching the premises implied by X exist? Probability is silent—and should be—on the import of any number. Now it might be that the person making the calculation may judge this difference of 0.05 probability negligible. That being so, sex can be removed from X, i.e. stricken from the model. In more classical language sex is removed from the model, making the model more parsimonious. Or it might be kept in and used for downstream decisions.
To emphasize: there is no telling what $\epsilon$ is important. None. An $\epsilon = 0.01$ may be crucial to one man and less than trivial to another. Relevance is always an extra-statistical question. Its importance is always conditional on outside criteria. May the first man who says, “Why not make $\epsilon = 0.05$ be the standard?” be anathema.
If an element, such as sex in the example, was among (or was) the main reasons for a study, then it can be argued that no $\epsilon>0$ is too large to exclude that element from the model. It is the value of $\epsilon$ that should be reported to the world. Let each make of it what they will. Of course, the investigator can make of it as he wills, too. He can show the consequences of its removal or of retaining the element under circumstances he judges interesting. That is, as it sounds, a lot of work; vastly more effort than throwing data at a software package and hunting for wee p-values. But this approach is vastly more honest.
Another example. Suppose Y = “This person has COPD” and X is a compound proposition about the nature of a particular group of people including their body mass index (BMI), a weak and somewhat inaccurate measure of obesity. If $\Pr(\mbox{Y} | \mbox{X}_{BMI = 29}) = \Pr(\mbox{Y} | \mbox{X}_{BMI = 30})$ then knowledge of differences of BMI at these two levels is irrelevant to knowledge of Y. A plot may be made of $\Pr(\mbox{Y} | \mbox{X}_{BMI = x})$ by $x$. If this probability varies at all, BMI is relevant to Y, given the remainder of X (each premise in X will be fixed at some value). If the probability does not vary, BMI is irrelevant, given X. If the levels do vary, what is important not only depends on the difference between BMI levels, but what BMI’s are “actionable”. This depends both on measurement and on the consequences of decisions. We discuss measurement next. What levels of BMIs are “expected” (where I use this word in its plain-English sense)? That could depend on premises partly in X, and partly not in X. That is, outside information may have to be incorporated to see how serious or how likely any differences in probabilities between different levels of BMI would be.
Another example. A group of persons with measured (medical) characteristics X. We’re interested in Y = “Person $p$ will live past time $t$” given his characteristics and the information from the measured people. Note very carefully that we’re talking about future events for this person: if he is already dead, we know that and don’t need probability. Compute $\Pr(\mbox{Y} | \mbox{X}_{p})$, where $\mbox{X}_{p}$ represents the characteristics of the measured people and person $p$. A plot of $t$ by this probability can be made. This is, of course, survival analysis, but it differs in that this curve will not have uncertainty in it: there will be no “confidence” or “credible” bounds. The probability is the direct prediction for this person who has certain stated characteristics. Plots for fictional or representative persons having stated characteristics can be made in the obvious way. Relevance is ascertained as before: if the survival curves do not differ for different levels or values of some pertinent characteristic, then this characteristic is irrelevant (given the others) else it is relevant.
Another example. A group of objects belong each to one of several categories. Past data on similar objects and their category membership is available; all this is X. We want Y = “This object belongs to category $c$”. Compute $\Pr(\mbox{Y} | \mbox{X}_c)$. This is classification. Relevance is the same as before. Again, this is a direct prediction with no extra bounds or uncertainty. Further, it is stated as a direct probability and easily interpreted.
Keen observers will have noticed that there are no uncertainty bounds, “confidence” or “credible” intervals and the like, not just in the survival analysis example, but nowhere. They are not there because they are not needed. Why? Because no parameters exist in [eq 1]. Parameters might appear in the math which facilitates calculations, but they are “integrated” out at the end, as they are in [eq 3]. We are making predictions, stating strengths of associations; we are not producing statements about unobservable and uninteresting parameters.
Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.
Discover more from William M. Briggs
Subscribe to get the latest posts sent to your email.