Dr B asks a question about what to do about sample sizes and hypotheses after the scourge of p-values and null hypothesis significance testing are recognized for what they are. Dr B gave us a long and amusing scenario, but I’ve cut it down to the essentials.
I have been trying to understand the modern contention that p values and hypothesis testing are useless.
But what about the value of hypothesis testing in modern medicine?
Consider the following scenario. A new disease – pleiomeiosis of the albandigular sulcus (PAS) – has been reported in the past ten years.
Dr. Strangeway believes that a chemical called floccypaucynihilipificon, marketed under the name organimycin, has the potential to cure the disease.
So Dr. Strangeway proposes to test the drug organimycin in PAS patients. He believes that the drug will cure the vast majority of patients in whom it is used. But will it perform better than a placebo?
Dr. Strangeway has heard that randomized trials are no longer considered the “gold standard” for testing potential curative therapy but he can think of no other way than to use a randomize trial to compare organimycin and placebo.
He has heard that hypothesis testing like this is no longer favored as it involves “stupid” and “illogical” statistical tests and arbitrarily chosen values such as 95% confidence intervals and arbitrarily chosen “significance levels” such as p = 0.05.
Instead, Dr. Strangeway proposes to simply measure the length of life of the test group (i.e., group of PAS patients receiving organimycin) and the placebo groups.
But then Dr. Strangeway becomes aware of a problem: How large should each group of patients be? Should the test and control groups contain 10, 100 or 1000 patients?
Dr. Strangeway realizes that he must have some form of hypothesis in order to determine how large each group should be. So he is back to where he started. He needs a randomized trial with a hypothesis and statistical tests using p values to determine the value of organimycin.
Any time you can spend helping me to understand this issue would be appreciated.
I had a touch of pleiomeiosis of the albandigular sulcus a while back. I found rubbing tobacco juice on it cleared it up nicely.
But if you’re determined to try organimycin versus a placebo instead, you must necessarily have a hypothesis, whether or not p-values intrude. One is that organimycin leads to greater survival times than placebo. If you didn’t believe that that hypothesis was at least possible, you’d never bother doing the test.
Here’s where it gets weird.
How big a sample do you need to prove organimycin lengthens lives compared to placebos? Two. One patient to give organimycin, and one to hit with the placebo.
If the organimycin patient lives longer, you have proven your hypothesis. Further, you can count how much more time he lives, and announce, “Organimycin lengthens lives of PAS sufferers X years.”
Simple yes. If the only—only as in only—difference in cause of life length is taking organimycin or not, then all we need is this sample of two.
Alas, we can with ease imagine many causes of life length beside organimycin, and we can even imagine that there are many causes we cannot imagine.
So what can we do? Assume. Guess. Hope. Suppose that the causes, the ones we can imagine and those we cannot, are found in equal measure or strength in the two experimental groups we propose.
This is why randomization does nothing for you. If you have known causes, you can (and should) form groups for every known cause. This is not randomization, but control—real control, and not the word as it is used in statistics, which isn’t control but something else. Control, real control, happens all the time in physics and chemistry experiments.
But if you have unknown causes, which of course you will, there is no way—as in no way—to know “randomizing” has split the unknown causes equally. Indeed, the more unknown causes you have, the greater the chance the groups will be uneven in causes.
Randomization is only to keep you from cheating, to stop you from letting some of the known, or suspected, causes juice the results. But controlled experiments are better than observations because at least some of the known causes are controlled. So the “C” part of RCT is good.
Which brings us to how many to test? Well, one answer is: the more the better. Because the more you test, the sharper your predictive ability. By sharper I don’t mean more accurate. There may be so many other causes, known and unknown, that even if organimycin was a cause, knowledge a person took it only barely increases accuracy of predictions of life length. But with more observations you will come to a better grasp of the uncertainty involved.
“More” is not satisfying when plans must be made. If you want to plan and formally quantify how many, a model necessarily intrudes.
Usually this model will be ad hoc—normality, or some such thing—but on occasion one can be deduced. More are not deduced because folks aren’t used to doing that. We’ll leave that subject for another day, and suppose you have a model of the uncertainty in life lengths—for that is your stated observable.
This model will be parameterized in some way to indicate drug type (organimycin or placebo), and perhaps other ways for controls of known causes.
With me so far?
When the data from your experiment comes in, it will be analyzed in accord with this model. The result will be statements like this:
Pr(life length organimycin > life length placebo | MD) = p,
where “MD” are your model (including “priors”) and data premises, including all the tacit and unspoken premises you didn’t bother writing down, but which are always there. (Statements like these in non-deduced ad hoc models are based on things called predictive posterior distributions.)
That’s not necessarily an interesting statement, because the length of life greater under organimycin could be only a few hours—even if p is (say) 99%! A few hours is technically longer, but not interestingly or usefully longer. So why not write statements which are interesting!
Pr(organimycins live at least 10 years longer than placebos| MD) = p,
where p is set by you as worthy. Your patients will select different p than you, and number of years, and probably should.
But it’s your experiment, and you have to pick values you think are important. Call p, say, 80%. Whatever.
Or you could (and this is best) plot up the probability distributions of life lengths for the two groups, to which anybody can put their own questions. I.e. make predictions which anybody can verify.
All right, suppose you have all that in hand. There are then two things you can do.
(1) The easy way, but not the usual way. Compute for every new patient (called “i”) this:
Pr(life length organimycin > life length placebo | MD_i) = p_i.
This p_i will (experience says) likely change for every additional i. But eventually it will settle down, if you were right about your assumptions of causes, and the other non-drug causes don’t change on you midstream in any appreciable way; or they can change, but the host of them together don’t change in an important way. Which you will never know with the unknown causes. Tough luck.
Anyway, once p_i settles down, stop. Close enough. You have your predictions.
(2) The hard way, and more like the usual way. Same as the first, but you have to introduce a second model of how the data will be collected, and what that data will look like. Which you can’t know, just like the first model if that was ad hoc, but you can model the uncertainty of—again, usually in an ad hoc way.
Then you can model how that p_i will behave as the sample increases. You think pick the “n” (sample size) when p_i is settled down “enough”—where enough means something to you, the model user. Like p, it may or may not mean anything to me. Same cautions and caveats about changing causes.
The second approach is similar to the frequentist approach of assuming you know all about the parameters, and working backwards with some statistic (this has a formal definition) in mind. But where there is a complete confusion about causes.
We don’t care a whit about some stinkin’ parameters or statistics! We want answers to real questions, questions that have some use, and say something about observable people care about.
So, to answer your meta-question, no. We don’t need p-values, we don’t need hypothesis testing, we only need plain language questions and probability.
Subscribe or donate to support this site and its wholly independent host using credit card click here. For Zelle, use my email: email@example.com, and please include yours so I know who to thank.