This example is derived from ongoing conversations with a colleague, and portions of this post (mathematified) might show up in a paper.
The genesis of this example is from Ron Christensen’s 2005 paper “Testing Fisher, Neyman, Pearson, and Bayes” in American Statistician (vol. 59, pp. 121–126). He says “The example involves data that have four possible outcomes, r = 1, 2, 3, 4. The distribution of the data depends on a parameter that takes on values θ = 0, 1, 2. The distributions are defined by their discrete densities f (r|θ) which are given in Table 1.”
What does this θ mean? Christensen doesn’t say, but it it is a marker, a stand-in or shorthand for evidence, and not any ontological thing. One possible way to get to the marker is to imagine we have three bags, each with 1,000 balls numbered 1–4, as Table 1 indicates (multiply each entry by 1,000). Then if θ = 0, we deduce the probability of drawing out a 1 as 0.98; etc. (with replacement). If θ = 1, we deduce 0.1, and if θ = 2 we deduce 0.098. Thus θ is a marker or proxy for a bag. The bags and balls exist, and the marker points to bag, as a name. Of course, it does not have to be replaceable bags in balls, but it will be something like that in essence. We deduce that from Christensen’s evidence provided.
There are now two possibilities: (1) We want to predict what draws from bags will look like given either we know θ or we don’t; or (2) We want to predict which bag a given sample came from assuming the sample came from one bag alone (we could also predict which two bags a sample came from, assuming the sample came from two and not three).
We predict the unknown in both possibilities. Future samples are unknown, and the bag from which a sample came from is unknown. So both problems are predictions. Now you can call (2) a “test” if you like, but that makes (1) a test, too. It is better to stick with predict because we are less apt to make a mistake in what we are about.
What evidence have we about θ? One thing is that no observed value from 1–4 is impossible, regardless of its value. No sample is impossible for any θ value, so that if we see any sample no θ in (2) can be rejected. Falsification will never enter the picture here, as it almost never does.
A second thing is that θ can and must take one of three values. From that, using the statistical syllogism, we deduce Pr(θ = i | Christensen’s evidence) = 1/3, for i = 0, 1, 2. We can write C = “Christensen’s evidence” for shorthand.
Prediction (1) is now easy. We don’t need to use simulations, which anyway aren’t magic. The probability of 1 is calculated with ease: 0.980 x 1/3 + 0.10 x 1/3 + 0.098 x 1/3 = 0.393. And so on for the other three labels. From this we deduce a multinomial distribution (where using the notation in that link we deduce p_1=0.393, p_2=p_3=0.0687, and p_4=0.47). Predicting future samples is thus simplicity itself.
What about predicting which bag a given sample came from; i.e. which θ is correct? Also easy, if a bit more mathematically cumbersome (where S = the observed sample (n_1, n_2, n_3, n_4)):
Pr(θ = i | C, S) = Pr(S|θ = i, C)Pr(θ = i|C) / sum_j Pr(S|θ = j, C)Pr(θ = j|C)
where Pr(S|θ = i, C) is the “likelihood” computed from Table 1, for i, j = 1,2,3, 4.
And that’s it. We calculate Pr(θ = 0 | C, S), Pr(θ = 1 | C, S), and Pr(θ = 2 | C, S) and then we feed those numbers into our decision process and predict which θ was right.
A tacit premise, which was after all the same in (1), is usually that we are indifferent about the kind of mistake, and that being right about one θ has the same consequences for us as any other θ. If that’s true, then we pick the largest Pr(θ = i | C, S), or if there is a tie, we are indifferent to which θ.
Nowhere did we use a p-value, Neyman-Pearson nulls and alternates, or Bayes factors1, or any other kind of traditional test. We stuck entirely with probability—probability quantifying the uncertainty in the unknown (either (1) or (2)), where all probabilities were deduced from the evidence given. This process is thus entirely objective, and entirely justified. There is nothing ad hoc about, nor does it contain extra-probability evidence as testing does.
It is plain that we might be wrong about our guess which bag the sample was drawn from, but that’s so much tough luck. If after we take a sample we want to make predictions about new values of the observable, then we need only replace each Pr(&theta=i|C) with Pr(&theta=i|SC), and we’re back to simplicity itself. We do not need to pick which θ was right! we recognize we should make predictions that incorporate all the uncertainty we have.
There you go. That’s the predictive method, clear and consistent.
1Bayes factors are mathematically equivalent here. Prove that. The reason they are not preferred is because BFs are ratios, which are hard to interpret. And that the motivation for BFs is backwards; they start by speaking of probability of samples (data) and not by the probability of interest.
You’ll have noticed the main posterior equation can be simplified (if not, notice it). Write the R code for a sample S = (n_1, n_2, n_3, n_4), and for predicting future samples given we saw S.
The equation 0.980 x 1/3 + 0.10 x 1/3 + 0.098 x 1/3 = 0.393 was given. If you don’t understand it, derive it.
To support this site using credit card or PayPal click here
Categories: Class - Applied Statistics, Statistics
Yet another controversial subject that apparently has everyone stunned speechless.
“Nowhere did we use a p-value, Neyman-Pearson nulls and alternates, or Bayes..”
But Christensen’s did, and I think those approaches are much more convincing to me.
“If that’s true, then we pick the largest Pr(? = i | C, S), ..”
How do you justify that if the sample is random and you will observe different draws given different samples? Also, this is in the best case where you do know the distributions. What if you don’t know them? Where do you take all this uncertainty into account?
I’ll convert you yet. The probability approach, assuming symmetric cost-loss (which NP or p-values ignore), says, “Pick one which has highest chance of being correct”. Which is plain English and optimal.
Random only means unknown, and nothing more. Yes, the case I illustrated knew the distributions, which is whey the choice is optimal. If you don’t know, you must assume something. About how, see this award eligible book: