The Difference In Means & Why P-Values Should Not Be Used — Excerpt From The Lake Michigan Dialogues

The Difference In Means & Why P-Values Should Not Be Used — Excerpt From The Lake Michigan Dialogues

“Say, Briggs. Since you’re Statistician to the Stars!, explain to me how I can tell if two means are different in a simple way that even I can understand.”

You got two groups of some measurable observable? And measures from items in each group?


Cinch. Calculate the mean for both groups. Got them?


Are they the same?


Then you know they are different.

“No. I mean I want to know if they are really different.”

Are they the same?

“They are not.”

Then they are really different.

“Yeah, okay, ha ha. But how can I tell if they are truly not the same? Isn’t there some kind of test?”

Sure. Look at them.

“I don’t understand this at all.”

I’m not sure how I could make it any simpler. If they’re the same, they’re the same; if they’re different, they’re different. End of story.

“Wait. This isn’t what other statisticians tell me.”

Ain’t it? I’ll be dogged. What do these other, classical statisticians tell you?

“Look. I want to know if the populations are different.”

Aha. You don’t want to know if these means are different. You want to know if the means in samples you haven’t yet collected will also be different.

“Yes. I think.”

And you want to know if they are different now because something caused them to be different?


I got you now. Yes, I understand. I follow you. No, sorry. Those other statisticians were right. They can’t tell you these things, or don’t. Best they can do is to calculate some bizarre number that tells you whether what you have already observed is different if you assume they are not different.

“Wait, what? They’re different if we assume they are the same?”

That’s it.

“Oh, wait. You mean the p-value.”

Yes, utterly useless.


Doesn’t answer any question anybody wants answered.

“They say it does.”

And politicians say they want what is best for you. Those statisticians are just as wrong. The p-value can’t tell you if the differences you saw have different causes. And it can’t tell you the chance new samples might be different. It also can’t tell you the chance they will be the same. It assumes they are the same.

“Then why do people calculate them?”

Why do people do anything? Magic, habit, custom, appeal to authority, ignorance of alternatives. Getting a wee p-value is like winning a science lottery. It is always a cause for celebration, but nobody knows why. It is silent on Reality. The p-value has no bearing on the two questions you want to know. At all.

“There has to be more to them than that. There wouldn’t be so many smart people using them otherwise. And no smart ass remarks about smart people, please.”

Well, think of it like this. If you have a fixed unchangeable unalterable determined set population, where every object in each group has a measurable number that will not change no matter what, then there are actual means for each group. Got that?


Okay, then if you take a sample, as you did, you can make a guess of what the remaining observations would be; and also what the means would be. Here’s an example. You know there are 10 in each group—this is a fixed unchangable number: it will and must be 10 forevermore—and you sample 9 of each. The two means you calculate in your sample will likely be close to the actual means of all 10 each. Yes?

“Yes, I can see that.”

You’re just one number away from knowing the true actual means, right?

“That’s so.”

All right. Most of the time, in real numbers and real situations the resulting means of all the population won’t be the same. Take ages of your 10 closest male and female relatives, for example. Yes?

“Not exactly the same. No. But they may be close.”

We’re not talking “close”. We’re talking the same or different. Close wins no prizes.

“I’m not following you.”

Is it that hard? You have two fixed populations that can be measured, and their actual means, experience shows, likely won’t be the same. Not precisely the same. That’s easy enough, isn’t it?

“I guess so.”

All right. Now calculate the p-value on the 9 you observed in each group. That, experience also shows, will produce a non-wee p-value.

“Non wee?”

One bigger than the magic number. Do I have to tell you what this magic number is?

“No, I guess not.”

Okay. That means the p-value, since it assumes the means are the same, forces you to say they are the same, even when we see they are different.

“Yeah, but sometimes means in these small samples can be equal.”

Make those populations smaller still. And look, even better, we can specify them in advance. The population is 2 in each group, and there are no more. Ever. You observe only 1 in each. You observe the number 5 in group A and 10 in group B. The other numbers, which are not yet observed, are 5 again in group A and 9 in group B. But the p-value doesn’t know or see these. The p-value, if you can even calculate is, and you might not be able to, will certainly be non-wee. We know the means are different. Yet the p-value insists they are the same. It is absurd.

“Yeah, okay, maybe. But your examples are tiny. What about large examples?”

It makes no difference. That’s the point. All samples and all populations in real life are in actuality finite. They are fixed. The same criticism thus applies. It’s worse, even, because as you increase the population or sample the chance of the means being equal with real numbers on real measurable things decreases fast. You see that, don’t you?

“Maybe. They still night be the same. And don’t p-value people say something different? I don’t think they’d agree with you.”

You’re right, they wouldn’t. Because they believe, and must believe, all populations are infinite, at least potentially. Then, being infinite, they are imbued with “real” or “true” means, which can never be known, but only “estimated.” The p-value is making statements about these forever unobservable so-called true means.

“Wait. Now I’m remembering. That sounds more like what I was taught.”

It’s even better, because these classical statisticians have the power of the gods. They create one of these “true” means every time they imagine a new population or use a p-value, which must have one of these p-values to justify its use.

“They don’t say that.”

They don’t, but that’s what the theory they claim to believe insists upon. That theory is frequency and it must have infinite sequences.

“I really don’t follow that at all.”

Doesn’t matter. Ignore it if you don’t get it. The point remains the p-value says nothing about the sample you have in hand. It says noting about the chance future samples will be the same or different. And it says zippety-do-dah about what caused any numbers.

“So what do I do?”

Investigate the cause or calculate the chance new samples are different.

“That’s it?”

That’s it.

“So I’ll ask again: what do I do?”

About what?

“About telling if the two means are different.”

I must be doing a poor job. I have already told you. Are they the same?


Then they are different. Look. It’s not complicated. That’s the answer. There is no other.

“Oh, right. No, I mean, how do I know they will be different in new observations.”

You could reason that since they’re different now, they’ll likely be different again. And be done with it. No need to quantify the uncertainty.

“But I want to have a solid number behind this. I want to do science.”

All right. Are you sure you only want to know if the means will be different? Or are you only asking about means because everybody else does?

“Let’s start with saying I really want to know about the means.”

It’s your party. It’s easy any which way. Propose some probability model for the two groups, condition it on the observations you already made, then calculate the probability new observations will be different.

“It’s that simple?”

It’s that simple.

“And the p-value doesn’t do that?”

Nope, no way. The p-value has nothing to do with anything, except for the belief the two means you already measured are the same, even if they’re different. Like we just saw, it’s so bizarre that nobody ever remembers what it means.

“Well, what about the Bayesian posterior? Isn’t that the same idea?”

Nope. Nuh-uh. The posterior says something about model innards. There may be technical, mechanical reasons to look at these, But they are of no real interest to man or beast. The posteriors don’t answer your questions either. They aren’t as nutty as p-values, but they are just as misleading. Don’t play with them.

“Let me get this straight. All I have to do to quantify the probability the means will be different in new observations is to use a probability model to just, what, give me that probability?”

Amazing, ain’t it? And not only that. You don’t have to do just means. You can do any function of the data—don’t forget the mean is just one function of an infinite number of them you can apply to observed data. Everything works in this universal probability scheme. Data can differ in more ways than the means. Like maximums, minimums, chance over or under this or that value. Whatever you want. Don’t look at means unless it’s means you really want.

“I want means. How do I do this?”

You mean, mechanically?


Oh, it’s easy enough. Lots of software out there. I can give you some tips once you show me what the data looks like.

“Isn’t there a true probability model I can use?”

Maybe. Depends. If you knew the causes of the measures, you have your true probability model. If you don’t, you still may be able to deduce one based on other considerations. Stuff you know about the observables, things like that. Or you can do like everybody else: Don’t think about it and use a standard model.

“That works?”

Maybe. Only way to find out is to try. If you can’t deduce a model, though, don’t get too excited about the results. You’re flying blind. Like everybody else.

“Okay, fine. I’ll wait until you show me which software. What about knowing the cause?”

We’ll leave that for another day.

Buy my new book and learn to argue against the regime: Everything You Believe Is Wrong.

Subscribe or donate to support this site and its wholly independent host using credit card or PayPal click here; Or go to PayPal directly. For Zelle, use my email.


  1. Jan Van Betsuni

    The Lake Michigan Dialogues || DAY TWO || Heteroscedasticity For Anglers

  2. Pk

    . . . and all the means are above average.

  3. john b()

    This is from a long time ago far far away…

    But I thought one used something called a t-test?

    (Or does pee, wee or otherwise come into it?)

    I believe there was a proposed confidence? 1 – confidence is pee? or 2 – confidence is s*h*i*t?

  4. Incitadus

    Most little kids want to grow up to be firemen, policemen, or nurses, statisticians
    on to the other hand wanted to be stage magicians. The oedipal overtones of ‘watch me
    pull a rabbit outta my hat’ are not always obvious to the lay observer.

  5. Rudolph Harrier

    P-values continue to get used because it is very easy to think of them improperly unless you are careful, and in practice people will be sloppy and use a worse definition than they could if they are careful, especially if the worse definition allows for publishable results. The levels go something likes this:

    -The worst level, but the one where the majority of people are at, is that it doesn’t matter what they mean. They are magic calculations that you can use to justify your conclusion somehow. Even people with a very fine understanding of p-values can end up in this level in practice if they are desperate to publish something. People who are at this level but aren’t fudging things usually have a mystical usage of p-values. For example they might refuse to use the word “significant” in any situation where the threshold of less than .05 is not reached, regardless of whether the context is statistical or not.

    -The next level is the belief that they measure “the chance that the results occurred due to chance.” This is completely wrong, but at least it’s a recognition that they measure a probability. Note that even at this level p-values can still be attacked: the usual threshold for significance is 5% which would mean under this mistaken understanding (but not in reality) that around 5% of “significant” results are just due to chance, which over millions of papers would be tens of thousands of results. But even here people will only object until they need to publish a result where p = .048.

    -The next level of understanding is saying that with a p-value calculation we assume that the data was generated by a random process and then calculate the probability that the data would be “at least as extreme” as the observations. That is, the p-value will be treated as a criteria for evaluating whether the statement “If the null hypothesis is true, then the observed data could happen.” A sign that someone is at this level is when he insists on using the terminology “we fail to reject the null hypothesis” rather than “we conclude the null hypothesis is true” and he can explain why (many people at the worst level of understanding will also use that language, but only as a mystical incantation that ensures that the results are valid.) The idea is that if we assume P is true and see that Q is false, then this must mean that P is false (reductio ad absurdum) but there is no way to assume P is true and conclude that P is true. Of course in practice people will still treat the null hypothesis as true due to a p-value test whenever it becomes convenient.

    -The next level of understanding is to realize the bigger weaknesses of the p-value test. First, that the random distribution and test statistic are chosen before the p-value is calculated. Changing either would change the value of the p-value. Thus the idea that it can allow us to reject the idea that the “results occurred due to chance” is ridiculous. Our “null hypothesis” is that the results came from a specific random distribution so at best we could conclude that they didn’t come from that distribution, but maybe they came from one of the infinitely many other distributions. But since we can manipulate the test statistic too we can arrange things to “reject” or “fail to reject” any hypothesis we want. And things are worse still: we can’t even really reject things. When the p value is small it means that the probability of Q happening given that P is true is small, but this doesn’t mean that P is false or unlikely. Examples abound. The simplest is this: if we flipped a die 100 times and asked the probability of getting that exact sequence of flips (you can even set up a test statistic for this) then the chance would be incredibly small. But that hardly means that it was impossible to flip the coin or even that it was unlikely to flip the coin; after all we knew that we flipped it. But even if you want to make the test statistic less arbitrary you can find plenty of situations where Q given P is unlikely, Q happened, but P by itself isn’t really “unlikely.” People who really know their stuff are at this level, but they never stay here. They either get to the next level, or they just wave their hands and say things like “while arbitrary test statistics can lead to whatever results you want, and while the p-value doesn’t measure the probability we are interested in, if you use the common tests then them being low will suggest that the null hypothesis is unlikely, even if if the exact numeric level is different.”

    -The next and final level is to realize that the question of “what is the probability that these observations occurred due to chance” is nonsensical without additional context, and that in most real world situations we can be sure that it wasn’t due to “chance” in the sense of being caused by a random variable. There were real causes for those observations, but we want to know if they were caused by the specific thing we are looking for, or at least if they were “unusual.” But probability can’t detect cause in this way even on a philosophical level, ignoring the math. So at best we can use statistics to narrow down where to look for causes, but not to find them. Once you start viewing things in this way it’s impossible to go back, but few people ever get close to this level.

  6. brad tittle

    If i was wise, I wouldn’t say anything, so I admit at the front that I am unwise. One should not listen to me.

    But I feel a need to try and say the same thing differently in the hopes that saying it differently might help someone who want to hear things differently.

    All studies are worthless unless the study tells you that there isn’t anything there. A study that links cutting your finger nails to hang nails is worthless. The RR might be significant, but there is no RR big enough to make the link really mean anything. If a study discovers that there is no link between cutting your finger nails and getting hang nails, it might be worth keeping.

    But then covid enters the equation. And then injections that might save you from the dreaded doom show up. People start dying soon after getting the defense against the doom and suddenly it sounds like I am saying that a big RR is worth looking at.

    But I am not looking at the RR. I am looking at the BODIES SHOWING UP IN THE MORGUE.

    My local mortuary is at twice the usual number of bodies. They are not stamped COVID. All covid death are stamped COVID.

    I am caught in a conundrum. I think relative risk and p-values are worthless while at the same time saying “these bodies stacking up in the mortuaries are not usual!”

    Save me…

    But my father and I are still talking even though we are on opposite ends of the injection debate. I think I managed to tell him today “some tell me that I need to do x to survive. Some tell me I need to not do x to survive. I do not want you to die. My data points at the opposite of your belief. “

  7. Hypothesis testing and p-values worked just fine in studying the Dahn yoga claims Briggs did:

    “One kid did 7 trials, the other two did 6 before the experiment was stopped . They were scheduled to do 12 trials each. They got 4 hits during these 19 trials, right what chance would predict: kid one got 1, kid two got 1, kid three got 2.

    Recall that before the trial started, KIBS staff members were confident each kid would get at least 10 out 12 hits.”


  8. Joy

    Brad, yes, perhaps try a second opinion. Who’s not censored by the alt right conspiracy thought police:

    This country now has 91% first vaccinated and rising. 67% have had three vaccines. Over 75’s are having a fourth. Autumn, over 50’s are going to be offered a seasonal vaccine along with flu.

    “something said another way”:
    Steve Macintyre has had all his jabs up to date. See his Tweet from December 3rd if it’s still there.
    I expect his work in other areas of science meet with people’s approval around here.
    As far as I’ve been listening, closely, to our commissioned experts who are in command of all the fact in the case of coronavirus. Nothing they’ve said so far, has been shown to be untrue. They have neither exaggerated or overstated risk. They were vindicated. All the evidence is available on public record

Leave a Reply

Your email address will not be published. Required fields are marked *