American Statistical Association Statement On Statistical Significance & P-values, With My Comments

It's magic! You must have a wee p-value!
It’s magic! You must have a wee p-value!

Editorial Note: I had this originally scheduled to go Monday, but due to Stream’s Creatorgate piece showing up yesterday, I delayed this by a day.

I think I only have the email announcing the statement which is incomplete. I got the thing before the embargo date, but you are seeing it after. Hat tip to Steve Malloy for alerting us to it. In either case, here we go, interspersed with my comments.

For the final cut, and the much awaited death of p-values, see my upcoming book Uncertainty. (Last week editor said it went into production, which means copy editing first, etc.)

“The p-value was never intended to be a substitute for scientific reasoning,” said Ron Wasserstein, the ASA’s executive director. “Well-reasoned statistical arguments contain much more than the value of a single number and whether that number exceeds an arbitrary threshold. The ASA statement is intended to steer research into a ‘post p<0.05 era.'”

Amen, brother Ron, amen. But wee p-values is in most places taken as magic. I mean that word in its literal sense. Studies with wee p-values are blessed, those without cursed. Studies which produce wee p-values are “significant”. And what is “significance”? Wee p-values. Hello, Mr Circular.

“Over time it appears the p-value has become a gatekeeper for whether work is publishable, at least in some fields,” said Jessica Utts, ASA president. “This apparent editorial bias leads to the ‘file-drawer effect,’ in which research with statistically significant outcomes are much more likely to get published, while other work that might well be just as important scientifically is never seen in print. It also leads to practices called by such names as ‘p-hacking’ and ‘data dredging’ that emphasize the search for small p-values over other statistical and scientific reasoning.”

Amen, sister Jessica, amen. Again, p-values are magic. I have seen grown men cry and grown women grunt when their study does not produce a wee p-value.

My comments for this next block are inside each bullet point [inside square brackets like this.]

The statement’s six principles, many of which address misconceptions and misuse of the p-value, are the following:

  1. P-values can indicate how incompatible the data are with a specified statistical model. [No, they can’t. They can only say what the probability of some statistic taking some value is conditional on accepting the model and on (usually) setting certain parameters of that model to fixed values.]
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. [The first half of the sentence is true, the second half is wrong. Nothing in the universe is “produced by random chance.” “Random chance” isn’t actual and cannot acutalize potentials, i.e. it can’t be a cause.]
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. [Not only that, they should not be based on p-values at all. Unless you’re a urologist, ignore all p-values.]
  4. Proper inference requires full reporting and transparency. [Amen. Which is why the Third Way I advocate is the only way to report uncertainty. I have the full theory in my book. I have an abstract here.]
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. [Exactly so. And what it does measure is of no interest to man or beast.]
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. [Indeed, by itself, it provides no measure of evidence regarding a model. It assumes not only that a model is true, but that some parameters of that model take fixed values.]

More:

In light of misuses of and misconceptions concerning p-values, the statement notes that statisticians often supplement or even replace p-values with other approaches. These include methods “that emphasize estimation over testing such as confidence, credibility, or prediction intervals; Bayesian methods; alternative measures of evidence such as likelihood ratios or Bayes factors; and other approaches such as decision-theoretic modeling and false discovery rates.”

Likelihood ratios and Bayes factors make some of the same mistakes p-values do, and also should not be used. All these methods are parameter-centric and not observable-centric. Thus they all at a minimum produce over-certainty, and outright fallacy at a maximum. Fallacies are usually those that ascribe cause based on statistical measures. See that Third Way paper linked above, or see the paper “The Crisis Of Evidence: Why Probability And Statistics Cannot Discover Cause“.

Or see my book! (Maybe May, June?)

23 Comments

  1. JH

    Your third way uses a Bayesian setup, examines the model at an individaul data value for a particular variable (there might be an infinite number of possible values, countable or not), and then makes a decision based on the probablity value of an arbitrary interval (and the choice of the interval can be subject to misuses worse than p-value). It becomes messier and messier as the number of expanantory variables increases in the data set. And, it denfinitely cannot discover cause.

    Why do you call it the third way is a mystery, so is the state that you love your third way.

  2. AC

    I’ve decided on a common-sense definition (that people like me can easily understand) for what a very low p-value means:

    “A low p-value means that we should be suspicious about the proposed explanation or the assumptions that were made.”

    The proposed explanation is usually the treasured null hypothesis (e.g. “there is no difference between groups A and B”, which is then rejected on the basis of a low p-value, but the assumptions are rarely challenged in a study). Except for very controlled situations, such as testing whether one is flipping a “fair” coin or tossing a “fair” die, it is nearly impossible to address all the assumptions and be certain they aren’t unduly affecting the result.

    If scientists were more honest about all the assumptions that went in to their conclusions, it would be easier to not fool themselves about the consequences, and thus not attempt to fool us.

  3. Briggs

    JH,

    It’s linked at the top, but see this video or this paper for proofs showing no probability model can discover cause. You’ll have to define what you mean by “explanatory variables”. Sounds like you mean “cause”.

    I don’t rely on Bayes. See also this article: Bayes isn’t what you think.

    As far as the remaining criticisms, I’m not sure I understand them. Can you explain that “choice of interval” business?

    And I notice you did not trouble to rebut any of the criticisms I made about p-values. I take it you agree with these?

  4. JH

    Mr. Briggs,

    It’s tiresome to rebut some of your comments on p-value. I have done so throughout the years when I see somehing objectionalbe. Ron’s (I’ve met Ron. A great guy.) comments are known by statisticians.

    I am not sure why you refer me to your so-called proof of “no probablity model can discover cause.” Still, your third way cannot discover cause.

    There are many statistics articles on cause and explanatory power. I quoted a widely discussed one in this blog before. And no, not going to aruge or define what an explanatory variable is here.

    In you paper, you calculate the probablity that the GPA fallsin an interval, and then you make inferences accordingly. The setup employed in the paper is Bayesian – a prior, a likelihood, and posterior. Period. Not going to paly semantic game here.

  5. Briggs

    JH,

    Problem is you’re using terms differently than I am. An instance is “explanatory variable.” What I mean by it is well spelled out here and in many articles. But what you mean by it I don’t know. I’m guessing you’ll use causal language when defining them. I do not.

    I do not advocated Bayes per se, which is why I linked that Bayes article. The Third Way is not Bayes nor frequentist. It is a Third Way. Get it?

    And then I’m not quite sure what sure what you can possibly mean by saying things like this: “I am not sure why you refer me to your so-called proof of ‘no probablity model can discover cause.’ Still, your third way cannot discover cause.” Yes, true, the Third Way cannot discover cause. That’s repeating what I said. The Third Way is just a probability model and probability models cannot discover causes, so etc.

    Your comments about GPA are wrong. The “posterior” in that paper is not the parameter posterior. It is a predictive distribution, a quantifying of uncertainty of an observable, not a parameter. Don’t get fooled by thinking in the classic language. I do not need Bayes to speak of probabilities of observables (again see the linked Bayes article). But Bayes can be used.

  6. Briggs

    Lee,

    As opposed to, say, you?

  7. Nate

    Related to p-values and over-certainty, I was looking into a book I was recently pointed to:

    “Willful Ignorance: The Mismeasure of Uncertainty”
    http://www.amazon.com/gp/product/0470890444/

    Has anyone here read it and would recommend it?

  8. Briggs

    Lee,

    Although it’s best to avoid appeals to authority, I, like Mayo, have published in the field. Indeed, as I say above, I have a whole book on the subject coming out.

  9. The most important data acquired in data collection fields are the data that have no correlation and wickedly high p values. They are what is NOT.

    Pile up all the what is NOTs are what remains is still a big pile of to be determined what is nots.

    I can never quite know what is.

    I can sort of see better what is from the big giant pile of what is not that I stand on. Unfortunately, what is not, generally resides in file drawers or more likely circular files.

  10. James

    I enjoy the backlash whenever someone suggests that maybe prediction is more important than parameters with arbitrary cutoff limits. I think it indicates how some statisticians and scientists have bought into the idea that they are selling certainty. Telling them that there isn’t that much certainty really rustles the jimmies. Also, telling them that you don’t trust them until they predict out of sample just amps up the rustles.

  11. JH

    Your comments about GPA are wrong. The “posterior” in that paper is not the parameter posterior. It is a predictive distribution, a quantifying of uncertainty of an observable, not a parameter.

    (1) Your response has no basis at all. (So S****. ) What are my comments about GPA? You make no justification as to how the interval of interest is chosen, and then make conclusions based on the probabilities. Did I say the probability of GPA falling in some interval (given…) is the posterior probability of the parameter? No. You imagined it. (Again, So S****. Hahaha.)

    (2) Your example clearly is set up using a Bayesian framework, in which a hand-waving prior and an incorrect normal likelihood are used. You do need the posterior distribution of the parameter given data to compute the (posterior) predictive distribution. The (posterior) predictive distribution is a result of the Bayesian framework. Part of Bayesian analysis.

    Well, you can call it the third way. It contains calculations of probabilities (whether you want to call it predictive distribution or posterior predictive distribution doesn’t matter.) via Bayesian framework and then makes conclusions based on arbitrary comparisons of probability values. Get it?

    Yes, I am repeating what you said. Let me repeat again, probabilities cannot discover cause. The third way cannot discover cause. No need to refer me to your proof of such a profound conclusion.

  12. JohnK

    What’s fascinating to me is the apparent number of working statisticians who are plainly incapable of reading Matt’s words and comprehending them even a little.

    Yet (for example) years ago (YEARS ago!), Matt pointed us to: Berger JO, Selke T. Testing a point null hypothesis: the irreconcilability of p-values and evidence. JASA 33:112-122, which proved that a low p-value does NOT necessarily mean what we think it means.

    But much of the framework for classical statistics relies on the belief that if a p-value is very low, then it’s very unlikely that the given assumptions (about the assumed parameters, etc.) are incorrect. Yet this 1987 paper by Berger and Selke proved that a low p-value might mean that – but also, that it might not. And for me, this was the kicker: there is NO, ZERO mathematics by which the p-value, within itself, can tell you which is which. We flat cannot tell, using whatever mathematics you wish, whether any given p-value actually tells us anything at all.

    And a p-value tells us hardly anything, to begin with. Matt also, YEARS ago, gave us the actual definition of a p-value, the same definition implicitly referred to and used in all classical statistics texts: “The probability of seeing the statistic T(x) as large or larger (in absolute value) than the statistic t(x) I did see given µA = µB, if I repeated the experiment an infinite number of additional times.”

    These were a few of the things that began to open my eyes to the farrago of misconceptions being foisted on credulous, innocent researchers.

    I wondered: Why haven’t we been told of the true definitions, and the implications thereof, of things like p-values and confidence intervals? Why haven’t we been told of the gigantic difference between the value of a parameter and actual predictive uncertainty? Because we’re not smart enough? Because we don’t need to know? Because we shouldn’t trouble ourselves with these mythic niceties that are obviously delusional, according to all the proper authorities?

    Today, what really busted my gourd was the evidence that at least some working statisticians apparently don’t see the jaw-dropping difference between probabilities that are nothing more than values of parameters, and probabilities where all the parameter values have been ‘integrated out’.

    I just don’t get it. The working statistical ‘priests’ won’t tell us the truth, can’t see the truth, scorn the evidence and the argumentation, won’t even do the simple reading and thinking that somebody like me did, and tell Matt that he’s an unprofessional crackpot loon.

    So who’s on our side? Who’s on the side of people who want more than an artfully contrived tissue of mathematical half-truths, unexamined or ignored assumptions, blank appeals to authority, and bullying the innocent?

    Who’s on our side?

  13. Anon

    JohnK—–Amen.

  14. Joy

    My Comment earlier, thanks to my resident grue, was wiped when the page refreshed.

    “grasping activities of daily living”
    There are no other types of grasping activities.
    Let’s be clear there are grasping activities only.

    Articles make a Rubixcube out of the simple and others shamelessly simplify the complex.
    I particularly hate over complex explanations for the simple. There’s no excuse.

    Mechanical engineering is a repetitive process. Computer aided design masks the repetitive nature of design and people take design for granted.

    Design of a simple mechanism or housing of a mechanism takes place with redesign and tweaking to more than one component of the design to accommodate a change.
    There are many moving parts of not just the hand but a single complex joint that are precision engineered in appearance. They do not imply piecemeal step change because the rubbing out and starting again feature doesn’t appear in the natural selection process. Perhaps for lengths and colours and simple shapes just gross differences.
    The hand is but one example and an old favourite because of thumb opposition.

    If you know you’re designing a hand to knit or play guitar, crack an egg?
    that’s bad enough. If this took place by a mindless unguided process where the outcome was not preplanned but accidental and that all accidents were retained which were positive this is to me fanciful.
    Furthermore, some of the aspects of human design appear to be brilliantly convenient but often trivial. It’s hard to think of a reason for natural selection to have retained these small not life or death features.

  15. Joy

    OOpse! wrong room,

  16. JH

    John K,

    What’s fascinating to me is the apparent number of working statisticians who are plainly incapable of reading Matt’s words and comprehending them even a little.

    Apparent number? So, what is the number of such incapable working statisticians who read this blog? Evidence? You know, under President Trump’s libel laws, you might be sued for making such statement.

    Though there are many statisticians participating in the interesting and helpful discussions of p-value here throughout the years – An Example. And Another. If you want more, just search.

    Below is what Briggs wrote https://www.wmbriggs.com/post/9338/
    .Given… and assuming the null hypothesis (tied to a parameter) is true, the p-value is the probability of seeing a test statistic larger (in absolute value) than the one actually seen if …
    And you wrote
    “The probability of seeing the statistic T(x) as large or larger (in absolute value) than the statistic t(x) I did see given µA = µB, if…”
    Both definitions are inaccurate. A precise definition is just a click away or can be had by combing both of the above. I seem to have corrected Briggs’ definition of p-value more a few times. Tiresome.

  17. Joy

    JH, If you have an apple or have access to one in a shop. Have a
    listen to that comment with voice over on and see if you can understand what you’re saying. Just for a laugh!

  18. Joy

    With ‘voice over on”
    seems the same thing happened to me! The typing changed after i pressed post.

  19. Joy

    “W i t h v o i c e o v e r o n ”
    Oh dear!

  20. Scotian

    The lower case “I”s are gone from all comments. Some sort of vIrus?

Leave a Reply

Your email address will not be published. Required fields are marked *