Class 50: Independence Versus Irrelevance

Class 50: Independence Versus Irrelevance

Independence implies causality; irrelevance is consistent with logic. A reminder that there is no such thing as unconditional probability. WARNING for those reading the email version! The text below might appear to be gibberish. If so, it means the LaTeX did not render in the emails. I’m working on this. Meanwhile, please click on the headline and read the post on the site itself. Thank you.

Video

Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty

HOMEWORK: Given below; see end of lecture.

Lecture

This is an excerpt from Chapter 8 of Uncertainty.

What’s the difference between independence and irrelevance and why does that difference matter? This typical passage from The First Course in Probability by Sheldon Ross (p. 87) is lovely because many major misunderstandings are committed, all of which prove “independence” a poor term. And this is a book I highly recommend as an introduction to probability calculations; for readability, I changed Ross’s notation slightly, from e.g. “$P(E)$” to “$\Pr(\mbox{E})$” to keep in the style of this book.

The previous examples of this chapter show that $\Pr(\mbox{E}|\mbox{F})$, the conditional probability of E given F, is not generally equal to $\Pr(\mbox{E})$, the unconditional probability of E. In other words, knowing that F has occurred generally changes the chances of E’s occurrence. In the special cases where $\Pr(\mbox{E}|\mbox{F})$ does in fact equal $\Pr(\mbox{E})$, we say that E is independent of F. That is, E is independent of F if knowledge that F occurred does not change the probability that E occurs.

Since $\Pr(\mbox{E}|\mbox{F}) = \Pr(\mbox{EF})/\Pr(\mbox{F})$, we see that E is independent of F if $\Pr(\mbox{EF}) = \Pr(\mbox{E})\Pr(\mbox{F})$.

The first misunderstanding is “$\Pr(\mbox{E})$, the unconditional probability of E”. There is no such thing. No unconditional probability exists, as shown earlier. All, each, every probability must be conditioned on something, some premise, some evidence, some belief. Writing probabilities like “$\Pr(\mbox{E})$” is always, every time, an error, not only of notation but of thinking. It encourages and amplifies the false belief that probability is a physical, tangible, implicit, measurable thing. It also heightens the second misunderstanding. We must always write (say) $\Pr(\mbox{E}|\mbox{X})$, where X is whatever evidence one has in mind.

The second misunderstanding, albeit minor, is this: “knowing that F has occurred generally changes the chances of E’s occurrence.” Note the bias towards empiricism. We do not have to deal with observables in probability models, though we do in statistical and physical. In other places Ross writes, in order to judge these probabilities, we must imagine “An infinite sequence of independent trials is to be performed” (p. 90), which is an impossibility. Another misconception: “Independent trials, consisting of rolling a pair of fair dice, are performed (p. 92). We already learned “fair” dice are impossible in practice. “Events” or “trials” “occur”, says Ross, echoing many other authors, which are propositions that can be measured in reality, or are mistakenly thought to be measurable. Probability is much richer than that and applies to propositions that are not observable.

Non-empirical propositions, as in logic, easily have probabilities, as we recall. Example: the probability of E = “A winged horse is picked” given X = “One of a winged horse or a one-eyed one-horned flying purple people eater must be picked” is 1/2, despite that “events” E and X will never occur. So maybe the misunderstanding, or the empirical bias, isn’t so minor at that. The bias towards empiricism is what partly accounts for the frequentist fallacy, about which we already know something; but there is more to say below. Notice that our example E and X have no limiting relative frequency. Instead, we should say of any $\Pr(\mbox{E}|\mbox{F})$, “The probability of E (being true) accepting F (is true).”

Those missteps are common and not the main difficulty. The third and grand-daddy of all misunderstandings is this: “E is independent of F if knowledge that F occurred does not change the probability that E occurs.” The misunderstanding comes in two parts: (1) use of “independence”, and (2) a mistaken calculation.

Number (2) first. It is a mistake to write “$\Pr(\mbox{EF}) = \Pr(\mbox{E})Pr(\mbox{F})$”, there are times, because given the same E and F, there are times when this equation holds and times when it doesn’t. A simple example. Let E = “The score of the game is greater than or equal to 4” and F = “Device one shows 2”. What is $\Pr(\mbox{E}|\mbox{F})$? Impossible to say: we have no evidence tying the device to the game. Similarly, $\Pr(\mbox{E})$ does not exist, nor does $\Pr(\mbox{F})$.

Let X = “The game is scored by adding the total on devices one and two, where each device can show the numbers 1 through 6.” Then $\Pr(\mbox{E}|\mbox{X}) = 33/36, \Pr(\mbox{F}|\mbox{X}) = 1/6$, and $\Pr(\mbox{E}|\mbox{FX}) = 5/6$; thus $\Pr(\mbox{E}|\mbox{X})\Pr(\mbox{F}|\mbox{X})$ $(\sim 0.153$) which does not equal $\Pr(\mbox{E}|\mbox{FX})\Pr(\mbox{F}|\mbox{X})$ ($\sim 0.139$). Knowledge of F in the face of X is relevant to the probability E is true. Recall these do not have to be real devices; they can be entirely imaginary.

Now let W = “The game is scored by the number shown on device two, where device one and two can show the numbers 1 through 6.” Then $\Pr(\mbox{E}|\mbox{W}) = 1/2, \Pr(\mbox{F}|\mbox{W}) = 1/6,$ and $\Pr(\mbox{E}|\mbox{FW}) = 1/2$ because knowledge of F in the face of W is irrelevant to knowledge of E. In this case $\Pr(\mbox{EF}|\mbox{W}) = \Pr(\mbox{E}|\mbox{W})\Pr(\mbox{F}|\mbox{W}).$

The key, as might have always been obvious, is that relevance depends on the specific information one supposes.

Number (1). Use of “independent” conjures up images of causation, as if through dependence, somehow, F is causing, or causing something which is causing, E. This error often happens in discussions of time series, as if previous time points caused current ones. We have all heard times without number people say things like, “You can’t use that model because the events aren’t independent.” But you can use any model you like, it’s only that some models make better use of information because, usually, knowing what came before is relevant to predictions of what will come. Probability is a measure of information, not a quantification of cause.

Here is another example from Ross showing this misunderstanding (p. 88, where the author manages two digs at his political enemies):

If we let E denote the event that the next president is a Republican and F the event that there will be a major earthquake within the next year, then most people would probably be willing to assume E an F are independent. However, there would probably be some controversy over whether it is reasonable to assume that E is independent of G, where G is the event that there will be a recession within two years after the election.

To understand the second example, recall that Ross was writing at a time when it was still possible to distinguish between Republicans and Democrats. The idea that F or G are the full or partial efficient cause of E suffuses this example, a mistake reinforced by using the word “independence”. If instead we say that knowledge of the president’s party is irrelevant to predicting whether an earthquake will soon occur we make more sense. The same is true if we say knowledge that this president’s policies are relevant for guessing whether a recession will occur.

This classic example is a cliche, but is apt. Ice cream sales, we hear, are positively correlated with drownings. The two events, a statistician might say, are not “independent”. Yet it’s not the ice cream that is causing the drownings. Still, knowledge that more ice cream being sold is relevant to fixing a probability more drownings will be seen! The model is still good even though it is silent on cause. This point cannot be stressed too highly. Good and useful models can badly screw up causes but they can still make useful predictions. A woman can insist gremlins power her automobile and still get where she’s going.

The distinction between “independence” and “irrelevance” was first made by Keynes in his unjustly neglected A Treatise on Probability (pp. 59–61). Keynes argued for the latter term, correctly asserting, first, that no probabilities are unconditional. Keynes gives two definitions of irrelevance, which amplify the previous section. In my notation but his words, “F is irrelevant to E on evidence X, if the probability of E on evidence FX is the same as its probability on evidence X; i.e. F is irrelevant to E|X if $\Pr(\mbox{E}|\mbox{FX}) = \Pr(\mbox{E}|\mbox{X})$”. This is as above.

Keynes tightens this to a second definition. “F is irrelevant to E on evidence X, if there is no proposition, inferrible from FX but not from X, such that its addition to evidence X affects the probability of E.” In our notation, “F is irrelevant to E|X, if there is no proposition F’ such that $\Pr(\mbox{F’}|\mbox{FX}) = 1, \Pr(\mbox{F’}|\mbox{X}) \ne 1$, and $\Pr(\mbox{E}|\mbox{F’X}) \ne \Pr(\mbox{E}|\mbox{X})$.” Note that Keynes has kept the logical distinction throughout (“inferrible from”). Lastly, Keynes introduces another distinction (p. 60):

$h_1$ and $h_2$ are independent and complementary parts of the evidence, if between them they make up $h$ and neither can be inferred from the other. If x is the conclusion, and $h_1$ and $h_2$ are independent and complementary parts of the evidence, then $h_1$ is relevant if the addition of it to $h_2$ affects the probability of $x$.

This passage has the pertinent footnote (in my modified notation): “I.e. (in symbolism) $h_1$ and $h_2$ are independent and complementary parts of $h$ if $h_1 h_2 = h$, $Pr(h_1|h_2) \ne 1$, and $\Pr(h_2|h_1) \ne 1$. Also $h_1$ is relevant if $Pr(x|h) \ne Pr(x|h_2).$”

Keynes’s formulation emphasizes that it is not only the “raw” X which are premises, but those propositions which can be deduced from them, which was mentioned above but not emphasized. Note: two (or however many) observed data points, say, $x_1$ and $x_2$ are independent and complementary parts of the evidence because neither can be deduced—not mathematically or logically derived per se—from each other. Observations are thus no different than any other proposition. In other words, every observation if of the schema, X = “The value $x$ was seen”.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.

1 Comment

  1. Tony Cooke

    if you are using an Apple PC then you can use MathCapture to get the correct presentation of the probabilities or any mathematical expression.

Leave a Reply

Your email address will not be published. Required fields are marked *