What Are P-Values?

Contributor William Raynor sets us a task:

Hi Matt, …I’d still like to see if you can write a column about what an empirical p-value really is without mentioning the word probability.

Since Raynor is a contributor, I consider this a consulting gig. It’s a difficult job because the ‘p’ in p-value stands for the-word-that-must-not-be-spoken. But let’s try. I won’t cheat, either, and use euphemisms like I just did.

We’re going to do this in two parts. The first is philosophical, and pertains to p-values as they are most often used in science and research. This should appeal to everybody. The second part takes on the “empirical” part of the request, where sometimes p-values seem to make more sense.

The p-value says something about what did not happen and asks you to believe that what did happen was caused as you believe it caused if the p-value is wee.

Way it works is like this. Guy thinks that the new drug profitol causes an improvement in health. Or he believes carbon dioxide caused temperature to increase. Or he thinks men and women respond differently to a question, the differences being caused by the sex of the people answering.

In any and all of these cases—you’ve seen a million new-research-has-shown headlines—the p-value is used to prove a cause has been found. Which cause? The cause the researcher thought was a cause, and not something else.

That means if the p-value is wee, he’ll say the drug caused the improvement, that carbon dioxide caused the temperature to increase, that sex differences caused the answers on a questionnaire.

Before you get too excited, understand that everybody who uses a p-value knows that p-values can’t prove cause. This is what the math says, and the math is right. P-values have nothing to say about cause. But everybody uses p-values to prove cause anyway. They can’t help themselves. P-values are magic.

This knowledge that a p-value can’t find cause has nothing on the overwhelming desire that it should, though. And it’s only natural in our society that desire wins over truth.

All right. So what is a p-value? It is a mathematical construction that takes for its premise the belief that the cause envisioned by the researcher does not exist. That is, it is assumed in the math that the cause the researcher wants to believe is false or a fiction.

Thus, it wasn’t profitol that caused the improvements, but something else. That is wasn’t carbon dioxide that caused the temperature to increase, but something else. That is wasn’t sex that caused the differences in the questionnaire, but something else.

The something else is nearly always “chance”. What is chance? Nothing. Chance doesn’t exist, as we have seen many times. Chance is not like gravity or electricity. You can’t use chance as you could gravity or electricity to bring about an effect. It can’t be measured. It can’t cause anything. It is a state of mind, relative to a set of beliefs.

Even if you don’t follow (or believe) that, never mind. It’s not important to understand p-values.

What is crucial is that the math behind the p-value assumes the cause the researcher was thinking of is nonexistent, non-operative, not around or weak or anemic to the point of vanishing.

This odd belief in the lack of the desired cause is called the “null hypothesis.” You’ve heard it. “The null is that profitol didn’t cause the improvement” and so on. We are supposing the improvement was observed. We are also supposing that temperature increased—but increase is tricky because of definitions of “trend”. And we are supposing there were observed differences in the scores by sex.

The math needs this null. This null premise is fed into the math, along with the data, and if the p-value is less than the magic number, a value so ubiquitous I don’t need to mention it, then everybody believes the null has been been disproved.

I’ll be blunter. People believe the null is false when the p is wee—unless they really, really want their cause to be true, then the p-value is ignored. Here it’s a good thing our patron has banned a certain word, because it’s utterly inapplicable here. The null isn’t maybe or perhaps false if the p is wee; it is decided it is false, or its decided to be false.

Well, if the null is false, what then? That the improvement, temperature increase and score difference were observed means they must have been caused by something. And the only thing on the researcher’s mind are the causes profitol, carbon dioxide, and sex.

If you accept the premise “Either profitol was the cause, or it was something else”, and you believe the p-value has proved the something else is false, you must believe it was profitol that was the cause.

But this is absurd. This is the p-value fallacy. The p-value calculation says nothing about cause. It assumes the something else is the cause. It is therefore impossible to move from believing profitol was the cause when the p-value says it isn’t, regardless of its value.

Well what if the p is not wee, and the observed changes were ambiguous, does that prove that profitol, carbon dioxide, and sex were not causes?

No. Obviously no. Why? Because the p-values assumes that profitol etc. were not causes! That’s not proof, that’s assumption.

The thing with a non-wee p-value is that profitol could have sometimes caused, or aided in the cause of, an improvement for some patients. Or carbon dioxide would have caused, in conjunction with other things, part of the temperature change. And sex could have caused some differences.

There! The forbidden word was never used but I hope you now have some idea of what a p-value is. To know it truly requires the forbidden word, and some math. You can read a series of proofs—iron-clad unbustable rigorous proofs—why every use of a p-value is a mistake or a fallacy, in the papers linked here. Next week we try and remove the p-value fallacy in those times where the causes in the “null” are known.

To support this site and its wholly independent host using credit card or PayPal (in any amount) click here


  1. Michael Dowd

    Tentative conclusion: p-values are fraudulent? Statisticians are criminals?

  2. I like the effort. I like Dowd’s conclusion. I will not send the police to your house because of Dowd’s conclusion because there is a p-value hidden in it.

    My conclusion: The only studies that actually tell me anything are studies that show that an affect is not there.

    Studies that show something is there are studies I can safely ignore. Maybe there are some studies I shouldn’t ignore that fall into this category, but statistically that number is so small, I really don’t need to worry about it.

    How many of the fully qualified statisticians who haunt this site feel the tug of a wee pee telling them they shouldn’t do x when they read a study? I am not fully qualified. I just came up with those rules of thumb to keep me from chasing after every little improvement that might show its head in the headlines of the news.

    It does not win me many friends. It does not help me influence people. It does point at the yokels who mutter words like “EFFING SCIENTISTS, BUNCH OFF MORONS” and suggest that I might be one.

  3. Ye Olde Statistician

    The real problem is not the p-value, but the sampling. It is often the case that there are more differences between the treatment group and the control group than the treatment itself. For example, a study comparing women who worked in front of CRTs all day vs. those who did not found a higher rate of cancer in the former. In line with the madness identified by Dr. Briggs, the studiers [or the media] then panicked over the cancer-causing cathode rays. However, the pink-collar workers in the former group [secretaries and clerks] also smoked more, ate fattier foods, exercised less than the white-collar women [lawyers, doctors, et al.] in the latter group. The statistics only said there was a real difference between them. It did not say what the cause was.

    That is why in quality control work, a significant fluctuation only indicates a need to search for a cause.

  4. “Hi Matt, …I’d still like to see if you can write a column about what an empirical p-value really is without mentioning the word probability.”

    I’d say the p-value is: a scaling of the distance your test statistic is from what you’d expect under your model.

    I don’t think about “cause” or “proof” (nor “magic”) at all, but only in terms of evidence for an effect, and as always, based off of stated assumptions and model(s).

    “…but increase is tricky because of definitions of “trend”.”

    This is not tricky at all. You state your model for trend, whether nonparametric, or using X13-ARIMA methods.

    Also, it helps to not think of data–>analysis–>p-value, but experiment–>data–>analysis–>p-value
    ie. you get more sense of “cause” when you have a well-designed experiment(s).


  5. C-Marie

    Am starting to understand this. Oddly, perhaps, I very much enjoy reading about the p-value. Guess I really like exercising my brain.
    God bless, C-Marie

  6. PK

    How do you distinguish between an effect and a cause? For example, you might get a very good correlation and low P value between X and Y. Presumably there is an effect. Did X cause Y? Did X affect Y? I’m just looking to clarify.


  7. Yonason

    Nutrition studies are some of the worst offenders. Quit red meat and live longer (if you restrict carbohydrates and stop smoking, instead of quitting meat.)

  8. Bill_R

    some preliminaries:

    1. In my experience, the data represent things or events, and are usually not “randomly sampled”.

    2. My statistic is some meaningful aggregation/reduction of those events. (Means on ratings are convenient but silly.)

    3. The “null” (as in a starting region) represents the status quo or indifference region, how I’d act absent any data. (e.g. “use the current system” or “sell the current drug formulation” or “the tau correlation between IQ index and yearly earned income after controlling for age is less than 0.4”)

    4. I construct a reference set that tells me what could happen under that assumption, typically at a boundary value (the worst case), a.k.a “the sampling distribution.” (by permutation and combinations of observations)

    @justin & I agree that the p-value is just a simple scaling of the percentile rank of the sample statistic compared to that reference set. Likewise that index is only a piece of the evidence, not proof. As @YOS points out, a “significant” result simply says “That’s weird, I wonder what caused it” or “Can I do this again?”

    @pk: I differentiate “cause” and “effect” simply because I’m the one that does the manipulation/treatment. (“I am the one who knocks”).

    @YOS: I eliminate/reduce other possible causes by design (multiple blocks/aliquots as internal replicates) and and by repetition under different circumstances. (internal and external control).

  9. PK

    How do you distinguish between an effect and a cause? For example, you might get a very good correlation and low P value between X and Y. Presumably there is an effect. Did X cause Y? Did X affect Y? I’m just looking to clarify.

    If I had good correlation and low p-value (no matter how low) between X and Y, I’d want to make sure the experimental design was good, and moreover that it was repeated several times, before claiming anything.

    I don’t really think in terms of X caused Y or Y caused X when I do this. I would think one would have to take background knowledge into consideration, like X=gender, Y=grades in school, and we understand Y can’t cause X. If X=height and Y=weight, then who knows. You just try to put everything in your model you understand to be pertinent and test accordingly. If, as Briggs rightly harps on, there could be other possible causes you did not consider (which no statistician denies could be the case), this is not a show-stopper. If you can think up other possible causes then you can design your next experiment to test for these.

    There’s also work by Rubin, Pearl, and others, on “causality”, which I don’t have much knowledge about.


  10. Bill_R

    “@justin & I agree that the p-value is just a simple scaling of the percentile rank of the sample statistic compared to that reference set. Likewise that index is only a piece of the evidence, not proof. As @YOS points out, a “significant” result simply says “That’s weird, I wonder what caused it” or “Can I do this again?””

    Exactly. And a major point is to actually do it again. A low p-value, no matter how low, from a single experiment, no matter how well-designed, is only an “indication”. There needs to be repetition of well-designed experiments and small p-values before claiming something like an effect is established. (Fisher mentioned this ~80 years ago).


Leave a Reply

Your email address will not be published. Required fields are marked *