The Lady Tasting Tea: Bayes Versus Frequentism; Part II (update)

Read Part I: Again, the text (up to this part) has been corrected and expanded.

Recall our overarching—our only—goal. We want to know whether the sweet old lady “has the ability” is true or false, or if not true or false, then with what probability it might be true. Never lose your grip on this. Repeat it to yourself after each paragraph.

To judge this probability we have the evidence of our experimental setup, and whatever facts may be deduced from these premises. We also have the evidence of the experiment itself: how many cups she got right and wrong. Can we agree that we should only use this information and no other? I mean, we should only use the evidence of what happened. What didn’t happen and what we cannot deduce from our experimental setup is information which is entirely irrelevant. So for example if we gave the lady N = 8 cups, it is irrelevant that we could have given her N = 50 cups, or whatever. We gave her 8 and we have to deal with just that information. We do not want to fool or distract ourselves.

These are of course is trivial requirements, but I put them there to focus the mind on the question.

Now, if we accept that “has the ability” means “She always guesses correctly”, then the probability that the lady correctly identifies any cup placed before her is 1, or 100%. This phrase is also our model. I mean, “She always guesses correctly” is our model, our theory, our hypothesis.

Why did we assume this particular model? Well, the choice was up to us. It is one interpretation of—it naturally follows from—“has the ability.”

Given this model/hypothesis, and before putting her to the test, what is the probability distribution for guessing correctly none right, just 1 right, just 2 right, etc., up to all N right? It is 0 (or 0%) for all numbers except for N, where it is 1, or 100%. Think about this.

But suppose we run our experiment and she correctly identifies only 3 < N cups. Given just our model, what is the probability that she guesses 3 correct? Again, 0. This proves the principle that any (logical) argument can only be judged by the premises given, and by no other information. However, suppose we conjoin our model with our observation       "She always guess correctly" & "She guessed 3 < N correctly" and, conditioning on this joint statement, re-ask what is the probability that she guess 3 correct? It is unanswerable because we are conditioning on a contradiction, a statement which is necessarily false. Actually, given this necessary falsity, we could derive any numerical value for guessing 3 correct, but this is obviously absurd. We have two probabilities, the first of which is:       (1) Pr("She guesses 3 < N correctly" | "She always guess correctly") = 0. But we can turn the question around and ask (the question):

(2) Pr(“She always guess correctly” | “She guessed 3 < N correctly”),

which is obviously 0. This is a rare instance where we have falsified a model—a situation only possibly when a model says “X cannot be” yet X obtains or occurs. That cannot is dogmatic, a logical word: it means just what it says, X is impossible—not unlikely—but impossible.

Now, the question is this:

(3) Pr(“She has the ability” | “She guessed M out of N correctly” & Experimental set up),

where “has the ability” is for us to define (such as “always guesses correctly”), M and N are observations of the experiment, where we also take care to consider the Experimental set up (from this we know what N is, etc.).

Asking (3) the probability a model is true is a natural question in Bayesian probability, but not in frequentism where any statement/question must be embedded in an infinite sequence of “similar, but randomly different” statements/questions. It is difficult, perhaps impossible, to discover in what unique infinite sequence this (or any) model-statement lies. I hope you understand how limiting this is. Of course, it is possible to develop non-theory-dependent rules-of-thumb for deciding a model’s truth or falsity, but any true theory of probability must be able to answer any question put to it in a non-ad hoc manner.

For example, Bayesian probability can handle the following situation, whereas frequentist probability cannot. Given the premise, “Only 1 out of all M green men from Mars are Y”, the probability that this green man from Mars is Y is 1 / M. Bayesian probability can also answer all counterfactual questions (“If Hillary did not cry at that press conference, she would have been the Democrat nominee for president”), whereas frequentist probability can answer none. In both instances, frequentism cannot because the statements cannot be embedded in a unique infinite sequence. There cannot be sequences of little green men, nor can there, by definition, be any counterfactual situations, let alone sequences of them.

What about the rest of our models/interpretations of “has the ability”? We last time outlined several possibilities, each of them consonant with the phrase “has the ability.” Which of these is the correct model and which are incorrect? That is up to us. It is an extra-logical, extra-probability question—at least with respect the premises we have allowed ourselves in this experiment.

Now, we could go through a similar procedure as above and calculate the probability each interpretation is true. That is, if we do not have a fixed idea in advance which interpretation (model) is true, we could use the evidence from the experiment to tell us which is most likely than any of the others.

However, we must start from somewhere: some external evidence must tell us how likely each of these models is before we begin the experiment. It doesn’t matter what this external evidence is; it merely must exist. The most common evidence allows us to derive that each is equally likely (before the experiment commences). After taking observations, we could recalculate the truth of each model given this new evidence. Once more, this scheme is natural in Bayesian probability, but not in frequentism.

Let us now assume a definite model structure and see where it gets us. We suppose the lady guesses each cup correctly or not, that she knows she will see an equal number of tea-first and milk-first cups, and that she is provided no feedback about the correctness of her guesses; we assume her palate never fatigues and that her “hit rate” is the same for either cup type. We will not assume perfection, but we allow its possibility. Indeed, it might even be that she always get every cup backwards; i.e. she is always wrong, but in a very useful way. This is as bland a set of premises as possible. In advance of the experiment, we will assume merely that she can get any number of cups right, from 0 to N.

Next installment: you didn’t think it would be easy, did you?

1. DAV

Ahh! Things go better with deux.

Your model description seems to be troubled by word play which would have been better resolved in part one IMO. The last paragraph comes close but you are still stumbling over “has the ability”. Are you being paid by the column inch?

Also missing (or maybe I missed it) is what your model is attempting to achieve. Are we merely assessing the lady’s ability today ignoring future performance or measuring her predictive power?

Interesting notation “33
Interpretation 2: able to guess N-3 of N

The most common evidence eventually allows us to derive ‘each is equally likely’ (before observations).

Yet past experience indicates all cases (i.e., 0 of N, 1 of N ….) are NOT equally likely. For example, it’s harder to predict 9 of 10 than at least 1 of 10. Not so intuitive is that exactly 1 of 10 is identical to exactly 9 of 10 as it’s just the other side of the coin so to speak implying exactly 9 of 10 incorrect guesses.

2. DAV

hmmm… the html parser changed my “3 of N gt 3″ to something else. Computers!
Preview would be nice.

Interesting notation “3 lt N”
Interpretation 1: 3 of N gt 3
Interpretation 2: able to guess N-3 of N

3. SteveBrooklineMA

DAV- I believe the interpretation intended is “She guessed 3 correctly and 3<N"

4. DAV

SteveBrooklineMA,

Thanks. Also deleted by the parser was “it’s obvious after reading further”. I’m a lazy typist and don’t see all that well. My editing suffers as a result.

5. Briggs

DAV,

Yeah, after I did the bit about embedding in infinite sequences (which is a key criticism of frequentism), I had to some back to the main thread, but realized I had run out of space.

Your last comment is not quite right. I was speaking about the likelihood of the models being true, not of outcomes of the experiment. It will turn out that many do assume equally likely outcomes (frequentists must, Bayesians usually do). But this assumption is orthogonal to model truth.

6. Briggs

Steve, DAV,

It is “3 < N”. I had originally used the native less-than sign, forgetting HTML reserves this symbol.

7. POUNCER

In the initial description “N” = 8. So it seems to me that if she conforms to expectations and identifies 4 milk and four tea, the number correct can NOT be 1, 3, 5 or 7. Each error must generate a complementary error.

Tautologically a lady is no gentleman. And axiomatically, a person who is not a gentleman must be regarded skeptically in participation in games of chance. (Some ungentlemanly persons, to be blunt, cheat.) The lady might identify, say, five of the eight cups as tea in first. But no gentleman would so confound the expectations of the players.

As it happens my no-skill emulation of the challenge produced six matching pairs. So the lady must get eight (or 7, if she cheats) to better, by her skill, my single skill-less performance.

8. Briggs

POUNCER,

Not so. She can guarantee a score of 4 (if N = 8 ) by just saying “tea first” every time, even though she knows there are only four tea-first cups. But she could easily get the first one right, say, and then get every other one wrong. She knows there are four each of tea-first and milk-first, but that does not mean that she must answer tea-first four times and milk-first four times.

9. It seems that the experiment is too easy on the lady if she is offered 4 cups of each variety. Her claim of “ability” is not predicated on a controlled experiment with an equal number of samples of each variety. She claims she can discern milk first or tea first in any social setting, with a single cuppa.

Compare it to a claimant of ESP ability. You could show him/her all the cards first, 4 each containing one of two symbols, and then perform the experiment. Or you could show him/her the back of one card without revealing what kind of symbol is on the other side. A clever (strategic) guesser might pass the first test successfully, but only a true psychic could guess the symbol in the second test.

It really does go to the meaning of “has the ability”.

10. Briggs

Uncle Mike,

Amen, brother.

11. pouncer

Briggs replies to me: ” Not so. She can guarantee a score of 4 (if N = 8 ) by just saying ‘tea first’ every time, even though she knows there are only four tea-first cups.”

No true gentleman would do so.

The person who says that 3, or 5, or any other number than 4 of the offered cups is “milk in first” is necessarily implying that the gentleman pouring is a liar.

The upshot of this is to rule out “7” correct, and agree ,actually, that in the experiment as described the lady must identify all 8 cups correctly to out-perform the expected no-skill result; which for me, by experiment, is 6.

12. SteveBrooklineMA

I have a simple question. A Lady walks into a room containing a bag of a million numbered marbles. “I’ll pull out number 526218!” she declares, then reaches into the bag without looking and indeed pulls out marble 526218. “Amazing!” says a statistician nearby, “the odds of pulling that marble out were 1 in a million!” A Gentlemen walks into the room, reaches into the bag, and pulls out marble 128245. “Amazing!” says a statistician nearby, “the odds of pulling that marble out were 1 in a million!” Looking at these events, we think what the Gentleman did wasn’t amazing at all, he didn’t say which marble he was going to pull out beforehand. But what is the difference from the perspective of the statistician? How should the statistician’s analysis be different for the Gentleman and the Lady? From a hypothesis testing perspective, how would a statistician test if the Gentlemen has the special ability to pick marble 128245 out of a bag, based on the single event described? How would it be different than testing the hypothesis that the Lady has the ability to pick out a ball she identifies beforehand?