We have our model in hand. “Has the ability”, our model, says (see Part II) that the lady can guess any number of the N cups correctly. All the lady knows is that N is divisible by 2, that she will see an equal number of milk-first and tea-first cups. She will receive no feedback on her guesses. Thus, we do not assume (initially) she will employ an optimal guessing strategy.
What is an optimal guessing strategy? Suppose we gave the lady feedback and told her whether her guesses were right or wrong as the experiment progressed. If, say, the first four cups were all milk-first and she knew she got these all correct, even if she has no ability and did so just by guessing, then (if she was paying attention) she ought to get the last four correct, too (even before tasting!). My experience with ESP testing suggests most people not use optimal guessing strategies, but if they did we can account for it, though it’s not easy to do so. So for ease, we’ll forbid feedback.
Recall that, in Bayes, all probabilities are conditional, so that we need to be clear about what premises we are conditioning on. All probabilities are conditional in frequentism, too, but this is not acknowledged, so the premises are often hidden (which is one path to over-certainty).
Question 1 Given this model (and only our other premises), and before running the experiment, what is the probability the lady guesses 0 right, 1 right, 2 right, up to N right? This question is equivalent to asking what fraction of cups she will guess correctly: 0/N, 1/N, up to N/N. It is not equivalent to asking what sequence of correct and incorrect guesses she will evince. The fraction of correct guesses is easily answered, for 0, 1, …N is 1 / (N+1), 1 / (N+1), …; that is, the probability that she guesses j cups correctly is 1 / (N+1) for j = 0, 1, …, N.
Stated yet one more way, since we have assumed as a premise the model that she may guess any number of cups correctly, the probability that she does so is 1 divided by the number of possibilities. (That last statement is not assumed, but is derived: those who want the full-blown mathematical details may download this paper, which itself relies on this paper.)
Question 2 Suppose we run our experiment for 2 < N cups and are interrupted (in the paper linked above, we use testing nuclear reactors instead of cups of tea, where interruptions are common). Given our model and premises, but also given her guesses up to this point, what is the probability that she guesses 0 cups right, 1 right, up to N – 2 cups right? The exact answer has a simple mathematical form (given in the first paper linked). But the real point of interest for us is that this answer exists naturally in Bayes, but not in frequentism, another major criticism.
Question 3 The experiment is finished! She has guessed M correct out of N (M is a sum of the correct milk-first and correct tea-first cups). Here is a non-trick question: Given our model and given M, what is the probability that she guessed a fraction K / N correct, where K does not equal M? It is 0, or 0%. A silly question to ask, yes, but let’s expand it. Same premises: what is the probability she guessed a fraction M / N correct? It is 1, or 100%. Another silly question, trivially answered. So why bother?
Frequentist theory would have us ask something like this: what is the probability that she guessed (M + 1) / N correct, and the probability she guessed (M + 2) / N correct, and (M + 3) / N correct, up to N / N correct? In Bayes, the sum of these probabilities is 0, as we just agreed. But not in frequentism, where the meaning of the word “guessed” is changed. It no longer means “guessed” but “Might be guessed were we to embed the experiment in an infinite series of experiments, each ‘identical’ with the first but ‘randomly’ different; we also hypothesize that if we were to average the correct guesses of this infinite stream, the result would be precisely N / 2 correct guesses.”
In other words, frequentist theory demands we calculate a probability of what could of—but did not—happen in “repeated trials” (where “repeated trials” is shorthand for “embedded in a sequence of infinite repetitions”). The theory must also hypothesize a baseline, a belief that the infinite sequence converges to some precise average (here, N / 2 correct guesses). Stated differently, frequentist theory asks the probability of seeing results “better” or “worse” than what we actually saw, given the model is true, a value for the baseline, and M.
This violates our agreement that we should use only the evidence from the experiment (and knowledge of the experimental set up) to just the truth of our model. Frequentism does not make statements about what happened, but what might have happened but did not in experiments that will never do conducted.
This probability is the P-value. If the P-value is “small”, the hypothesis that the baseline is N / 2 is “rejected”, i.e., it is believed to be certainly false. I mean certainly in the sense of certainly. The P-value does not give a probability that the baseline is false: it instead asks you to believe absolutely in the truth or falsity some contingent hypothesis (i.e. that the “baseline = N / 2”). In other words, a decision based on the P-value implies that the probability of “baseline = N / 2” is 1 or 0 and no other number. A subtle, but damning, criticism is that (except in circular arguments) no contingent hypothesis can be certainly true or false, so the use of the P-value is immediately unsound.
Harold Jeffreys (homework from Part I) said, “What the use of P [values] implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.” Your homework this time is to explain this quote in the tea-tasting context. Why ask for probabilities of events that did not occur? Why is the P-value (see Parts I and II) not the answer to the question, “What is the probability she has the ability?”
In Part IV: “Hey, what about Fisher’s exact test! Surely that fixes frequentism?” It does not, and don’t call me Shirley.