There Is No Problem Of Old Evidence In Bayesian Probability

Rationalists, like those at Less Wrong (think Eliezer Yudkowsky and Scott Alexander), are prone to fetishsize Bayes theorem, seeing it as the key to all thought. It isn’t. Bayes is a helpful tool, and no more, and like all tools, not always needed. But because of the perceived importance of Bayes, people think they have discovered flaws in it. These are almost always based on simple mistakes, which can go decades without anybody noticing. As in the so-called problem of old evidence.

Here’s what one prominent author (Colin Howson) thinks is the “problem” of old evidence: Can a hypothesis h be confirmed by evidence e if the evidence is old and already known?

The answer will seem to be obvious, and yes. Howson and others say no. That “no” and the “problem” arises when people write things like this (as in the link):

Pr(h|e) = [ Pr(e|h)Pr(h) ] / Pr(e).

That might look to you like Bayes theorem, favorite of “rationalists” everywhere, but it is not. It is missing something. The missing parts are what cause the “problem.”

Howson, and many like him, says (modifying his notation so that it’s consistent with mine): “This [existence of background knowledge] has the following unpleasant consequence, however. If e is known at the time h is proposed, then e is in [the background knowledge] and so Pr(e)=Pr(e|h)= 1, giving, Pr(h|e) = Pr(h); which means that e gives no support to h.”

Before reading further, and recalling the hint about something missing, see if you can spot the flaw in this thinking.

Don’t cheat. Think.

The answer is this: There is no such thing as “Pr(h)” or “Pr(e)”. While “Pr(h|e)” and “Pr(h|e)” are fine, as such, they are incomplete in the face of the first two elements.

There is no such thing as unconditional probability: all probability is conditional. Every probability everywhere needs premises, conditions, assumptions, some evidence upon which to pass the judgement. That means “Pr(e)” is impossible. No such creature exists.

We can write, perhaps, Pr(h|K), which is the probability of h given some background knowledge K (the K is from Howson). We could also—and here comes the trouble—write Pr(e|K).

That’s fine as it stands, and it could be as Howson suggests that Pr(e|K) = 1. But that only happens when K includes the premise (or proposition, or assumption, or whatever you want to call it), “e has been observed.” That makes K = “‘bunch of other premises related to h’ & ‘e has been observed’.”

With that K, then indeed Pr(e|K) = 1. (Make sure you see this.)

Let’s rewrite the equation above properly, using this K (two letters put together mean logical “and”, so that “eK” means “e and K”):

Pr(h|eK) = [ Pr(e|hK)Pr(h|K) ] / Pr(e|K).

We have Pr(e|K) = 1, since K says e was observed, which obviously makes the probability of e equal to 1, given e was observed. Of course it does! Adding the h, unless that h says “e is impossible” or something like that, gives Pr(e|hK) = Pr(e|K) = 1. But since logically eK = K, then Pr(h|eK) = Pr(h|K). The math works! Both sides are Pr(h|K).

And so it seems e says nothing about h. But that’s not how evidence works.

What happens with evidence in real life is this. We do indeed start with some background knowledge, or surmises, etc. about h. Call that B. B says nothing about e already having been observed. It says stuff about h. We then write:

Pr(h|eB) = [ Pr(e|hB)Pr(h|B) ] / Pr(e|B).

No change, except from K to B. Let’s look at each piece.

Pr(e|hB) is the probability that e can be observed given h is true and B (which are our assumptions). This is so even if e never is observed! Even if e remains a thought experiment. Don’t read more until you grasp this.

Since B is silent on e having been observed (and ignoring “degenerate situations” like hB = “e is impossible”), then 0 < Pr(e|hB) < 1. Pr(h|B) is our "prior", given by our background information. Again (and still ignoring degenerate scenarios like B = 'h is impossible') 0 < Pr(h|B) < 1. Pr(e|B) is the probability e could be true given B, but it says nothing directly about h. We could always "expand" Pr(e|B) like this (using "total probability"): Pr(e|B) = Pr(e|hB)Pr(h|B) + Pr(e|not-hB)Pr(not-h|B). The first term on the right we already did. The second is similar, and where "not-h" is the logical contrary of whatever h is1. We could find Pr(e|not-hB), the probability e is true given h is false and B, and recalling Pr(h|B) + Pr(not-h|B) = 1 (this works for every h!).

So as long as

[ Pr(e|hB) / Pr(e|B) ] > 1,

which is to say, as long as the evidence e is more probable under hB than under B alone, then e supports or confirms h. Even if nobody in the world ever observes e! You must get this.

If [ Pr(e|hB) / Pr(e|B) ] < 1, then e disconfirms h. If [ Pr(e|hB) / Pr(e|B) ] = 1, then knowledge of e is irrelevant to h.

That’s it. The simple solution to the “problem”. It does not matter when e is observed, or even if it is observed. It could be ancient wisdom—like apples fall onto heads and do not soar into the air. And h is “gravity attracts”. Or it could be entirely novel.

It only matters whether e is already part of h, as in the “problem” which uses K, or that it is considered on its own, as with B.

There has been a lot of ink spilled on this “problem”, all of it because of bad notation. Notation that become popular because it was forgotten all probability is conditional. Change the conditions, change the probability.

1 h is a complex proposition, usually, of the form P_1 & P_2 & … & P_q, where each P_i is some proposition; thus not-h is not-“P_1 & P_2 & … & P_q”. Only one of the P_i need be false for not-h to be true. Failure to understand this leads to much confusion about what models and theories are.

This is not the first time we tackled this subject; however, the first article was put in obscure terms in answer to a technical question, and the point was lost.

Subscribe or donate to support this site and its wholly independent host using credit card click here. For Zelle, use my email:, and please include yours so I know who to thank.

Categories: Philosophy, Statistics

19 replies »

  1. Apparently, the best way to learn Bayesian statistics is to just be patient and read Briggs’ blog. Eventually he’ll spill it all out. xD

  2. I’ve only everseen Bayesian bs used wrong so I concluded its all pure bs. And I can’t be proven wrong. I have a Bayesian distribution proving that around here somewhere. I won’t show the math; they never do. Because Bayesianism is an occult religion not mathmatics.

  3. What you write is logically true, and of course we all use prior observations to support the validity of whatever hyptheses we happen to believe about the world. However, in PRACTICAL terms when it comes to extending scientific theory it makes a huge difference whether a hypothesis fitting an observation was made before or after the observation. You can always cook up a “just so” model that fits the known data, and any academic worth her government grants can cook up a fancy explanation to justify that model. If you’re really good at playing the game, you make sure your original “just so” model has a few free parameters that can be adjusted to fit at least a few rounds of additional observations. Lots of additional articles, lots of additional grants. KA-CHIING!
    True, you might say that any “just so” model implicitly assumes the observations already made, but any trained sophist… I mean, academic… can conceal such implicit assumptions in multiple layers of mumbo-jumbo and arcane notation.

    Actual predictions are a lot harder to fake. Hence prediction is the only reliable gold standard in science.

    But of course the real problem doesn’t have anything to do with defective scientific procedures. The real problem is defective people. Modern academia is beset by an increasingly pervasive culture of accepted dishonesty. Heck, some forms of dishonesty have already become mandatory. People willing and able to live within such a culture cannot discover any truth, whatever procedures they may be compelled to follow.
    If science is revived one day, it will not be within current academic institutions.

  4. Then I am not sure I understand your position on pre- vs. post-diction.
    From the first article you link: “There is nothing in “prediction” that says prediction must only be about the future.”

    My own position is that only pre-diction in the proper sense, that is made before observation, is effective at revealing falsehood. Post-diction is not completely worthless of course, but much, much easier to fake.

  5. Morten,

    Yes, I have all this in my Uncertainty, even with math. I only gave you a sampling. See the Books tab up top.

  6. The trouble is that while everyone ACTS as though there is no unconditional probability, most will refuse to use those words out of dogma.

    For example in the classic die rolling problem, where you are asked what the probability of a 1 coming up on a roll of a six sided die. The expected answer is of course 1/6, but there are plenty of situations in which every type of statistician would choose a different answer, ex. when the die is loaded or when “1`” appears on more or less than one face. If you say that you cannot give an answer just from being told that there is a six sided die, they will say “you have to take as given that the die is fair and that the numbers 1 through 6 appear on the sides.” But if you say “so I am calculating the probability under those conditions, i.e. a conditional probability” they will firmly say NO. This is an unconditional probability! You just need to assume certain conditions for it to make sense!

    It’s much like the people who memorize the phrase “fail to reject the null hypothesis” instead of “accept the null hypothesis” but then immediately say “and therefore we know that there is no casual relationship whatsoever between the two quantities; the variation is just due to chance.” That is, the actually say the null hypothesis is true, but they refuse to say that in words, instead uttering the magic words “fail to reject” without understanding what they mean.

    Similarly with “significant” meaning only “p < .05", etc. The field is full of people diligently memorizing words and phrases as incantations to appease the magical statistical calculations. Honestly I don't know if there is much hope for reform in the field because so many of the practitioners don't have enough logical ability to go beyond memorizing approved phrases and algorithms.

  7. That makes K = “‘bunch of other premises related to h’ & ‘e has been observed’.”

    But since logically eK = K, then Pr(h|eK) = Pr(h|K).

    Is eK = K? Does eK represent set-theoretic intersection? If it does, then logically eK = e. Right?

  8. JH,

    Yep. In shorter form, K = ek, where k = ‘bunch of premises’; thus eK = K, since the initial is redundant.

  9. Regarding the background information –

    “The requirement of deductive closure is quite unnecessary anyway within a Bayesian theory, since the
    probability function P_A relativized to any body A of information will necessarily assign the value 1to any deductive consequence of the sentences in A whether or not that consequence itself is explicitly included in A (note that P_A assigns 1 because the notion of tautology is widened in effect to include consequences of A, not because P_A(.) is regarded – it is not – as a conditional probability P(.|A)). “

  10. So as long as
    [ Pr(e|hB) / Pr(e|B) ] > 1,
    which is to say, as long as the evidence e is more probable under hB than under B alone, then e supports or confirms h.

    If [ Pr(e|hB) / Pr(e|B) ] > 1, i.e., Pr(e|hB) > Pr(e|B), then the correct conclusion is that h confirms e.

  11. Briggs, but… K = “‘bunch of other premises related to h’ & ‘e has been observed’.” 🙂

  12. JH,

    No, it’s not correct that h confirms e. It’s as I said, if [ Pr(e|hB) / Pr(e|B) ] > 1 then e confirms h, as Bayes theorem has it.

  13. Briggs,

    Given B, h confirms e if P(e|hB) > P(e|B).

    Or if you rather use different notations, see

    “In looking for an explicatum for this concept, we will be guided by the idea that E confirms H given D iff (if and only if) the degree of belief in H that is justified by E and D together is higher than that justified by D alone. The corresponding statement in terms of our explicatum p is p(H|E.D) > p(H|D).”

  14. JH,

    No, sorry, that’s not how Bayes theorem works. It starts with Pr(h|B), adds e, then forms Pr(h|eB), which is proportional to Pr(e|hB)/Pr(e|B). If Pr(h|eB) > Pr(h|B), it’s because Pr(e|hB)/Pr(e|B) > 1. If you can’t see that, I’m afraid we’ll have to disagree about this.

  15. Briggs, ah… I see. How does Bayes theorem work? It involves manipulations of probabilities of sets. That how one places the conditionals decides how it is interpreted accordingly. Definition (or math) is dentition (or math), no room for disagreement. While it is true that if Pr(e|hB)/Pr(e|B) > 1, imply that Pr(h|eB) > Pr(h|B). Now e already happened and P(e|B) = 1, which leaves P(h|e B) > 1. Now correct me, if I am wrong.

  16. (With corrections. I had to go so I submitted without reading what I had typed.):

    Briggs, ah… I see. How does Bayes’ theorem work? It involves manipulations of probabilities of sets. How one places the conditionals decides how it is interpreted accordingly. Definition (math) is definition ( math); no room for disagreement. Right or wrong only.

    While it is true that if Pr(e|hB)/Pr(e|B) > 1, where e, h, and B are sets, mathematically implies that Pr(h|eB) > Pr(h|B). In the context of e = evidence and h = hypothesis, how would a Bayesian interpret P(e|hB)? Is it meaningful? No, at least I don’t see it. Now e already happened and P(e|B) = 1, which leaves P(e|h B) > 1. Taking B to be the entire necessary background information set, doesn’t this circle back to the problem of old evidence? What am I missing here? (I’d appreciate it if you can correct my errors or give me answers… so I don’t have to spend time doing my own research. Thanks. )

  17. JH,

    I admit my powers of explanation are inadequate to convince you.

    This might help. In any introductory probability book, you will find Bayes rules, which has these parts:

    1 Pr(h|eB) = 2 [ Pr(e|hB)/Pr(e|B) ] * 3 Pr(h|B).

    1 is the posterior, the probability of the hypothesis after seeing evidence e starting from knowing only B (which e is not in).

    2 is what statisticians call the “likelihood ratio.” I won’t explain it, but it’s in every introductory book. It’s as I say. If the LR > 1, e confirms h, and so on

    3 is the “prior”, what we know about h accepting all we know is B (which does not have e in it).

    That should help. I promise all this is explained in great detail in any probability book that has Bayes theorem in it.

  18. Briggs, and I have to admit that I failed to explain my point or the problem clearly since you are repeating the rudimentary (1), (2) and (3) to me. So by definition, e confirms h implies h confirms. and it doesn’t solve the problem of old evidence. Ha. (Last comment?)

Leave a Reply

Your email address will not be published. Required fields are marked *