Read Part I, Part II, Part III. This is the missing Part, which was promised a year ago. We’re all tired of this subject, and there are so many other things to talk about, so today is the last installment. I’ll even try and make it interesting.
It is finally time to reveal what happened. Our good lady guessed M = N = 8 cups: she got them all right. (Though some reports claim she got M = 6 right, missing one milk-first and one tea-first guesses). Remember our goal: we want to know whether or not she “has the ability.” Repeat that before reading further.
The frequentist calculates this:
(4) Pr ( T(M,N) > = t(M,N) | she does not have ability),
where T(N) and t(N) are the same mathematical function of the data, but where the t(M,N) is the value of the statistic we actually observed and T(M,N) is the value of the statistic in repetitions of the trial, where these repetitions are embedded in an infinite sequence of trials.
T(M,N) and t(M,N) are called “statistics”; they are not unique; their use is not deduced. Indeed, for this experiment we have (at least) our choice of the binomial and Fisher’s exact statistics. For the former, (4) = 0.0039 and for the latter (4) = 0.014. We could have easily expanded this list to other popular test statistics, each providing different solutions to equation (4). Fishing around for a test statistic which gives pleasing results is a popular pastime (we want the statistic or statistics which give 0.05 or less for (4), this being a magic number).
Which of these is the correct test statistic? Neither. Fisher’s test could be used if the lady knew she was getting exactly 4 cups of each mixture, the binomial could be used if she didn’t; but other choices exist. (It is the lady’s evidence that matters, not yours.) In any case, we have two p-values. Can they help answer our original question? They cannot. Equation (4) is not equation (3), which again is:
(3) Pr ( “She has the ability” | “M = N” & “Experimental set up”).
In no way is (4) a proxy for (3); it is even forbidden in frequentist theory to suppose that it is. Classical theory merely says that if (4) is less than the publishable limit we “reject” the theory “she does not have the ability”. That is, we claim that “she does have the ability” is false, which necessarily makes “she has the ability” true.
But recall that “she has the ability” had multiple interpretations. Which of these is the frequentist saying is the right one? Well, none of them and all of them. Actually, the answer the frequentist will give when posed this question is usually a variant of, “Is that the bus? I must run.” However, there is still the “agnostic” model; see below.
Incidentally, if she got two wrong, (4) is 0.24 for Fisher’s and 0.14 for the binomial.
The Bayesian cannot answer (3) without first deciding what “She has the ability” means. If he decides, in advance, it means “She always guesses correctly” then as long as M = N this theory has probability 1, i.e. (3) = 1. If M < N then (3) = 0. And that is that.
If we decide it means “She always guesses at least N/2 correctly” then as long as M > = N/2, (3) = 1, else it is 0. And similarly for any other interpretation.
That means that if we have one fixed interpretation and are willing to entertain no other, then as long as the observations are consistent with this theory, we must continue to believe it is certainly true. And if the evidence is not consistent, we will have falsified our interpretation and thus it must believe it is certainly false. But if we have falsified it, this does not mean we have given a boost to some other theory because, of course, we have already said that there were no other theories.
Please pause here an ensure you understand this. It is a serious and fundamental point.
In order to have non-extreme probabilities attached to a model’s truth, we must have more than one model in contention. One model alone is either true or false: this is a tautology, which is why it does not provide additional evidence (a tautology attached the premises of any argument does not—cannot—change the probability of the conclusion).
So suppose we have decided that “has the ability” means either M1 = “always guesses correctly” or M1 = “guesses at least N/2 correctly”. Good arguments, after all, can be made for both. Before we see the experiment, based on these arguments, we must assign a probability either is true. If our evidence is only that we have these two to pick from, then we would assign probability 1/2 to each (this can be derived through the symmetry of individual constants, a subject for another day).
Now if we see M = N – 1 (which is > N/2) then we have still falsified M1; this necessarily makes the probability of M2 as 1. And if we see M < N/2 we have falsified both—leaving no alternative. But if M = N then since this evidence is consonant with both models, we have not changed the probability that either is true.
This is it; this is the answer no matter how many interpretations we initially consider.
The one possibility left is the agnostic model of Part III. Suppose the lady got M = 0 right in N = 40 cups (say). Would you say she “has the ability”? Sort of: she appears to be a perfect negative barometer. If you knew somebody who was always wrong about picking stocks, he would be as useful to you as somebody who was always right.
So we leave ourselves agnostic about her ability and say it could be guessing anything from 0 to N, as described in Part III. At the end, we remain agnostic but we are able to predict how well she will do in N’ new trials. This is important because even if we are agnostic, there are different forms of agnosticism. That is, we are assuming uniform agnosticism, but it may be that a better model might be one which allows different performance for milk-first and tea-first cups (as described in Part I). And it could be that milk-first and tea-first cups differ, but her palate fatigues after W cups. And so on and on for all the other possible models.
Do you see? Being agnostic has not excused us from formulating a model—which we can test and verify on new data. This is natural in Bayes and not so in frequentism (see the papers linked in Part III). But enough is enough. On to something new tomorrow!