Dr B asks a question about what to do about sample sizes and hypotheses after the scourge of p-values and null hypothesis significance testing are recognized for what they are. Dr B gave us a long and amusing scenario, but I’ve cut it down to the essentials.

I have been trying to understand the modern contention that p values and hypothesis testing are useless.

But what about the value of hypothesis testing in modern medicine?

Consider the following scenario. A new disease – pleiomeiosis of the albandigular sulcus (PAS) – has been reported in the past ten years.

Dr. Strangeway believes that a chemical called floccypaucynihilipificon, marketed under the name organimycin, has the potential to cure the disease.

So Dr. Strangeway proposes to test the drug organimycin in PAS patients. He believes that the drug will cure the vast majority of patients in whom it is used. But will it perform better than a placebo?

Dr. Strangeway has heard that randomized trials are no longer considered the “gold standard” for testing potential curative therapy but he can think of no other way than to use a randomize trial to compare organimycin and placebo.

He has heard that hypothesis testing like this is no longer favored as it involves “stupid” and “illogical” statistical tests and arbitrarily chosen values such as 95% confidence intervals and arbitrarily chosen “significance levels” such as p = 0.05.

Instead, Dr. Strangeway proposes to simply measure the length of life of the test group (i.e., group of PAS patients receiving organimycin) and the placebo groups.

But then Dr. Strangeway becomes aware of a problem: How large should each group of patients be? Should the test and control groups contain 10, 100 or 1000 patients?

Dr. Strangeway realizes that he must have some form of hypothesis in order to determine how large each group should be. So he is back to where he started. He needs a randomized trial with a hypothesis and statistical tests using p values to determine the value of organimycin.

Any time you can spend helping me to understand this issue would be appreciated.

I had a touch of pleiomeiosis of the albandigular sulcus a while back. I found rubbing tobacco juice on it cleared it up nicely.

But if you’re determined to try organimycin versus a placebo instead, you must necessarily have a hypothesis, whether or not p-values intrude. One is that organimycin leads to greater survival times than placebo. If you didn’t believe that that hypothesis was at least possible, you’d never bother doing the test.

Here’s where it gets weird.

How big a sample do you need to prove organimycin lengthens lives compared to placebos? *Two*. One patient to give organimycin, and one to hit with the placebo.

If the organimycin patient lives longer, you have proven your hypothesis. Further, you can count how much more time he lives, and announce, “Organimycin lengthens lives of PAS sufferers X years.”

Simple, no?

Simple *yes*. If the *only*—*only* as in *only*—difference in cause of life length is taking organimycin or not, then all we need is this sample of two.

Alas, we can with ease imagine *many* causes of life length beside organimycin, and we can even imagine that *there are many causes we cannot imagine*.

So what can we do? Assume. Guess. Hope. Suppose that the causes, the ones we can imagine and those we cannot, are found in equal measure or strength in the two experimental groups we propose.

This is why randomization does nothing for you. If you have known causes, you can (and should) form groups for every known cause. This is not randomization, but control—real control, and not the word as it is used in statistics, which isn’t control but something else. Control, real control, happens all the time in physics and chemistry experiments.

But if you have unknown causes, which of course you will, there is no way—as in *no way*—to know “randomizing” has split the unknown causes equally. Indeed, the more unknown causes you have, the greater the chance the groups will be uneven in causes.

Randomization is only to keep you from cheating, to stop you from letting some of the known, or suspected, causes juice the results. But *controlled* experiments are better than observations because at least some of the known causes are controlled. So the “C” part of RCT is good.

Which brings us to how many to test? Well, one answer is: the more the better. Because the more you test, the sharper your predictive ability. By *sharper* I don’t mean more accurate. There may be so many other causes, known and unknown, that even if organimycin was a cause, knowledge a person took it only barely increases accuracy of predictions of life length. But with more observations you will come to a better grasp of the uncertainty involved.

“More” is not satisfying when plans must be made. If you want to plan and formally quantify how many, a model necessarily intrudes.

Usually this model will be *ad hoc*—normality, or some such thing—but on occasion one can be deduced. More are not deduced because folks aren’t used to doing that. We’ll leave that subject for another day, and suppose you have a model of the uncertainty in life lengths—for that is your stated observable.

This model will be parameterized in some way to indicate drug type (organimycin or placebo), and perhaps other ways for controls of known causes.

With me so far?

When the data from your experiment comes in, it will be analyzed in accord with this model. The result will be statements like this:

Pr(life length organimycin > life length placebo | MD) = p,

where “MD” are your model (including “priors”) and data premises, including all the tacit and unspoken premises you didn’t bother writing down, but which are always there. (Statements like these in non-deduced *ad hoc* models are based on things called predictive posterior distributions.)

That’s not necessarily an interesting statement, because the length of life greater under organimycin could be only a few hours—even if p is (say) 99%! A few hours is technically longer, but not interestingly or usefully longer. So why not write statements which *are* interesting!

Pr(organimycins live at least 10 years longer than placebos| MD) = p,

where p is set by you as worthy. Your patients will select different p than you, and number of years, and probably should.

But it’s your experiment, and you have to pick values you think are important. Call p, say, 80%. Whatever.

Or you could (and this is best) plot up the probability distributions of life lengths for the two groups, to which anybody can put their own questions. I.e. make predictions which anybody can verify.

All right, suppose you have all that in hand. There are then two things you can do.

**(1)** The easy way, but not the usual way. Compute for every new patient (called “i”) this:

Pr(life length organimycin > life length placebo | MD_i) = p_i.

This p_i will (experience says) likely change for every additional i. But eventually it will settle down, if you were right about your assumptions of causes, and the other non-drug causes *don’t change* on you midstream in any appreciable way; or they can change, but the host of them together don’t change in an important way. Which you will never know with the unknown causes. Tough luck.

Anyway, once p_i settles down, stop. Close enough. You have your predictions.

**(2)** The hard way, and more like the usual way. Same as the first, but you have to introduce a *second model* of how the data will be collected, and what that data will look like. Which you can’t know, just like the first model if that was *ad hoc*, but you can model the uncertainty of—again, usually in an *ad hoc* way.

Then you can model how that p_i will behave as the sample increases. You think pick the “n” (sample size) when p_i is settled down “enough”—where enough means something to you, the model user. Like p, it may or may not mean anything to me. Same cautions and caveats about changing causes.

The second approach is similar to the frequentist approach of assuming you know all about the parameters, and working backwards with some statistic (this has a formal definition) in mind. But where there is a complete confusion about causes.

We don’t care a whit about some stinkin’ parameters or statistics! We want answers to *real questions*, questions that have some use, and say something about observable people care about.

So, to answer your meta-question, no. We don’t need p-values, we don’t need hypothesis testing, we only need plain language questions and probability.

*Subscribe or donate to support this site and its wholly independent host using credit card click here*. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.

Well that’s all fine and good, but does Dr Strangeway still have a Wee-P? Enquiring minds want to know …

Left to statistics the universe wouldn’t exist much less the exponential improbability

of matter that thinks. Try to remember we all got here in test tubes and we’re leaving

on PCR schwabs.

INVOICE

TO: Dr. B

FROM: Dr. Briggs Consulting, LLC

ITEM(S): Deliver blueprint for conceptual framework of clinical trial investigating new drug organimycin.

1.25 hours @ $875/hour = $1,093.75

Payable on receipt.

Thank you.

Briggs ==> A fine essay and a good description of why “length of life” (often considered as “All Cause Mortality”) is a lousy measure of the effectiveness of medical treatments (and nutrition studies).

Literally EVERYTHING in the world affects how long a person lives — and under normal circumstances, the list is at least very very long.

Far better to actually test for “does this treatment cure this disease?” or “does this treatment substantially reduce symptoms to the satisfaction if the patient?”

Interesting example. I’ve never done clinical testing so imagine I know roughly how hard that is, but have no experience to test that against. That said, what do you think of this approach:

1 – assuming I think O works and want to test that belief, it is realistic to assume that someone else will think it does not and want to test that. So step one: imagining myself having access to a large clinic or network of clinics, find someone with similar access who disagrees.

2 – I give O to every patient for whom this seems appropriate. Doc X never gives O.

3 – X and I agree on some quality of life (outcome) indicators and, once a month or so, compare results for my patients to his.

—

Obviously: no n, no P (wee or otherwise), no ethical issues in denying a treatment I belief will work to a patient. Equally obviously, more is better and the measurement is real world not theory with the only model hanging around being that embedded in the quality of life indicators used to evaluate outcomes – and they can be changed at any time.

Equally, but less obviously, the duration issue largely goes away along with the problem of controlling for life-style and related issues differentiating the drug vs placebo groups. The latter effect happens because the two patient pops will differ in many ways, but that won’t matter (may actually be a plus) because we’re not testing to see if O works for pat_i, we’re testing to see if might work for the population from which pat_i is drawn.

The probability distribution used to calculate “p,” where p is the probability patients on organowhatever live 10 years longer than patients not on it, – is that distribution calculated in the same way as the probability distribution used to calculate the other kind of “p” (the kind that may or may not be wee)?

The doctor gets done with his trial and his wife asks, “Well? Does it work?”

Will he calculate the average life of each group along with the variance and use those numbers to estimate:

1. How much longer a patient on the drug might expect to live on average, given what he knows.

2. The chance that extension of life is at least 10 hours (Briggs’s p).

3. The chance that the extension is less than 0 hours (traditional p).

It seems like we’re doing the same math, just understanding it in a different way. Am I right?

Darin,

No, not the same. My fault, I think, for using “p” to designate the numerical value of a probability. I’ll change to a “q”, maybe, in future versions.

The Wee p of infamy is this:

Pr(statistic larger than we saw given infinite repetitions of experiment | there really is no difference in life length).

Whereas we are asking for our (now) “q”:

Pr(life length at least 10 year longer for drug takers | model, data).

Not even in the same frame.

Damnit. No matter how many times I ask that question, I always get the same answer!

Brilliant. Like any great teacher, you do your best work when you’re answering a question.