Philosophy

The Rise Of Bayes

The man himself.

Thanks to reader Frank Kristeller we learn that the far left New York Times yesterday ran an article by F.D. Flam praising the rise of Bayesian statistics: The Odds, Continually Updated.

The replacement of frequentist statistics is, if true, moderately cheering news. And Bayes is the next step in the removal of magical and loose thinking from statistics. But far from the destination. That, I argue, is logical probability, which you can think of as Bayes sans scientism and subjectivism.

However, baby steps:

Bayesian statistics are rippling through everything from physics to cancer research, ecology to psychology. Enthusiasts say they are allowing scientists to solve problems that would have been considered impossible just 20 years ago. And lately, they have been thrust into an intense debate over the reliability of research results.

Nothing like a little hyperbole, eh? I don’t think our frequentist friends would agree they couldn’t solve the same problems as Bayesians. And of course they can. But so can storefront psychics solve problems. What we’re after is good solutions.

Flam got this right:

But the current debate is about how scientists turn data into knowledge, evidence and predictions. Concern has been growing in recent years that some fields are not doing a very good job at this sort of inference. In 2012, for example, a team at the biotech company Amgen announced that they’d analyzed 53 cancer studies and found it could not replicate 47 of them.

This is what happens when you base your decisions on p-values, little mystical numbers which remove the responsibility of thinking. P-values aren’t the only scourge, of course, willful transgressive thinking (especially in fields like sociology) and false quantification are just as, and probably even more, degrading.

False quantification? That’s when numbers are put to non-numerical things, just so statistics can have a go at them. Express your agreement with that statement on a Likert scale from 1 to 5.

Again:

“Statistics sounds like this dry, technical subject, but it draws on deep philosophical debates about the nature of reality,” said the Princeton University astrophysicist Edwin Turner, who has witnessed a widespread conversion to Bayesian thinking in his field over the last 15 years.

This is true. But just try to get people to believe it! Most academics, even their Bayesian variety, feel the foundations are fixed, that most or all that need be known about our primary premises is already known. Not true. Philosophy in a statistician’s education is put last, if at all. The error here is to assume probability is only a branch of mathematics.

One downside of Bayesian statistics is that it requires prior information — and often scientists need to start with a guess or estimate. Assigning numbers to subjective judgments is “like fingernails on a chalkboard,” said physicist Kyle Cranmer, who helped develop a frequentist technique to identify the latest new subatomic particle — the Higgs boson.

This isn’t really so. The problem here is blind parameterization, which is the assigning of probability models for the sake of convenience without understanding where the parameters of those models arise. This is an area of research that most statisticians are completely unaware of, so used are they to taking the parameters as a given. Logical probability removes the subjectivism and arbitrary quantification here, so that the true state of knowledge at the beginning of a problem is optimally stated.

Others say that in confronting the so-called replication crisis, the best cure for misleading findings is not Bayesian statistics, but good frequentist ones. It was frequentist statistics that allowed people to uncover all the problems with irreproducible research in the first place, said Deborah Mayo, a philosopher of science at Virginia Tech. The technique was developed to distinguish real effects from chance, and to prevent scientists from fooling themselves.

Mayo (our friend) is wrong. It was the discordance between scientists’ commonsensical knowledge of causality and the official statistical results that allowed us to see the mistakes. Statisticians do causality very, very badly. Indeed, frequentism is based on a fallacy of mixing up ontology (what is) with epistemology (our knowledge of what might be). Bayes does slightly better, but errs but introducing arbitrary subjective opinion.

Uri Simonsohn…exposed common statistical shenanigans in his field — logical leaps, unjustified conclusions, and various forms of unconscious and conscious cheating.

He said he had looked into Bayesian statistics and concluded that if people misused or misunderstood one system, they would do just as badly with the other. Bayesian statistics, in short, can’t save us from bad science.

Simonsohn (whom I don’t know) is right, mostly. The problems are deep. But you notice he left out p-values.

Flam missed that resistance to Bayes is still strong in many traditional fields, like medicine, where p-values are demanded. Still, that Bayes is becoming more available is good. But since we’re at the start and let’s try and do it right, and not, say, re-introduce old notions (like p-values!) into new theory.

Categories: Philosophy, Statistics

26 replies »

  1. One downside of Bayesian statistics is that it requires prior information — and often scientists need to start with a guess or estimate.

    Complaining about the priors in Bayesian methods is quite odd, given that you can’t do any kind of science or statistics without prior knowledge or assumptions. What scientists hasn’t started with a guess or an estimate? Isn’t that the hypothesis formation step in the scientific method? Make the hypothesis, test it, use it for prediction, and then re-evaluate.

  2. Sometimes Bayes theorem is conflated with Bayesian statistics to the detriment of understanding statistics as a discipline. Bayes’ theorem reaches a logical conclusion but Bayesian statistics is plagued by the need for a prior probability density function (PDF). Bayes and Laplace thought the uniform PDF was the proper one to use when there was no empirical data but it is usually true that non-uniform prior PDFs of infinite number are equally non-informative. To pick one of them is to violate the law of non-contradiction. The violation of non-contradiction invalidates much of what is called “Bayesian statistics” in the literature of statistics and necessitates an approach to model building which, while consistent with Bayes theorem is inconsistent with much of what is taught to students about Bayesian statistics.

  3. A blog article on statistics here. Such a novel idea.

    Paul Murphy,
    if it isn’t, Flam likely acquired it as a nickname in grade school.

    James,
    If past experience says one outcome is more likely then give it a higher prior. E. T. Jaynes remarked that he would set the prior probability against ESP quite high. After an iterative update process, the prior is what you got from the previous iteration. Often though, when starting out with little to go on, the prior should be uniform for all outcomes. The problem is there are those who think the prior can be anything.

    Terry Oldberg,
    The way I look at it Bayes theorem describes how to convert P(E|H) to P(H|E). Give a table of outcome counts where the rows represent H states and the columns the E states, the row sums become the counts for estimating the prior, P(H). Frankly, I think it should be taught using this illustration before going on to anything else. Then perhaps, people would view the prior as less mystical.

  4. DAV,

    That many of the other posts here aren’t seen as posts about statistics in yet another indication of the problem!

  5. DAV:

    I’m aware of only one way in which a unique prior PDF can be generated without violation of non-contradiction. It is based upon a property of a sequence of Bernoulli trials. In one trial the relative frequency of a specified outcome will surely be 0 or 1. In two trials the relative frequency will surely be 0 or 1/2 or 1. In N trials, it will be 0 or 1/N or 2/N or… or 1. Let N increase without lime. The relative frequency becomes known as the “limiting relative frequency.” As information about the limiting relative frequency is missing it is appropriate to maximize without constraings the entropy of the distribution function over the limiting relative frequency possibilities. This procedure yields a uniform prior PDF over the interval between 0 and 1 in the limiting relative frequency. Use of this prior yields Laplace’s rule of succession and to generalizations from this rule that are of use in the assignment of values to probabilities.

    This procedure may be contrasted to the one used by climatologists in generating posterior PDFs over the equilibrium climate sensitivity. Uninformative prior PDFs are of infinite number, each generating a different posterior PDF. Consequently, non-contradiction is violated.

  6. Briggs,
    Agreed but the posts on philosophy and religion in the last year or two seem to be more predominate. I admit I never counted them though.

  7. “False quantification? That’s when numbers are put to non-numerical things, just so statistics can have a go at them. Express your agreement with that statement on a Likert scale from 1 to 5.”

    Amen, brother! This statement (properly attributed of course) is going into every one of my lectures that refers to ordinal data. Then I’ll ask “What’s the average of two second lieutentants, three captains, one major, two colonels, and a major general?”

  8. Terry,

    Assigning anything but a uniform prior without any experience is introducing a bias that may not be warranted and if too far off from “reality”, this bias would be difficult to overcome. Wasn’t this the gist of your posts on J. Curry’s blog?

    One of the problems in a regression (say, temperature vs. time) is trying to quantify the uncertainty in outcomes when the number of outcomes is huge. For some problems, a uniform prior doesn’t seem reasonable. For example, we expect temperatures to cluster around certain values. Too often a normal distribution of the uncertainty is used — largely because it is convenient for calculation ease — even though it may allow for values impossible to achieve (such as a negative absolute temperature). Using a Bernoulli process isn’t much different than assuming a normal distribution. What would be the justification for using it to represent uncertainty in temperature values other than it provides a convenience for calculation?

  9. DAV:

    I’m confused at your response to my comment. I never said anything about the process of picking a prior, just that using the need for prior information as a negative (against Bayesian methods) is odd since science needs and builds from prior information. The downside isn’t the need for prior information, there’s just difficulty in some circumstances picking a prior that everyone could agree upon.

    In my view, that last part shouldn’t be seen as a difficulty, but rather part of the process of building a hypothesis, then testing for predictive value, and discussing/iterating with others. Additionally, as long as a prior doesn’t disallow certain values, then the likelihood (with enough data) can ‘overcome’ a prior.

  10. James,

    I misread the intent of your comment. Sorry.

    Yes, it’s true with enough data one can overcome a bad prior but (given a bad prior) it may require a LOT of data. The one most easily overcome is a uniform one.

  11. DAV:

    Maximization of the entropy assigns equal values to probabilities NOT probability densities. As the distance between adjacent temperature values is not a constant, an infinite number of prior PDFs over the temperature value possibilities are non-informative. As the distance between adjacent limiting relative frequency values IS a constant there is only one prior PDF over the limiting relative frequency value possibilities. Thus, the Bayesian method works without violation of non-contradiction for limiting relative frequency values but not for temperature values. I hope this helps.

  12. Terry,

    Not sure if it helps. You seem to be saying that the Bayesian method doesn’t work for some types of values such as temperature. Are you sure about that? If you are merely saying the Bayesian results are only about the uncertainties, well then, I agree. But in any regression the values of Y spit out are the most probable values and are based on the uncertainties inherent in the model and I thought we wee talking about the uncertainty in Y values.

  13. DAV:

    I claim that the Bayesian method doesn’t work when the variable is temperature but does work when the variable is limiting relative frequency. You may be able to convince yourself of this by drawing a pair of histograms. The vertical axis of the first histogram is the probability density while the horizontal axis is the limiting relative frequency. The histogram consists of a set of vertical bars. Each bar has the same width. The probability of the limiting relative frequency that is associated with each bar is the width of this bar times its height. The width of each bar is the change in the limiting relative frequency. It is a property of the limiting relative frequency that the change in it is a constant; thus the widths of the various bars are identical. Because the entropy is maximized the heights of the various bars are identical. Thus, the probability density is uniform in the interval between 0 and 1. You have built the entropy maximizing (uninformative) prior PDF over the limiting relative frequency and proved that it is uniform on the interval between 0 and 1.

    Now draw a similar histogram but replace the limiting relative frequency by the temperature. Keep the area of each bar constant thus keeping the values of the probabilities constant but vary the widths. If you reduce the width of a bar you increase the height. If you increase the width you decrease the height. The height is the probability density. By varying the widths you can create a bunch of different entropy maximizing prior PDFs over the temperature. You have proved that the uniform prior PDF over the temperature is not uniquely uninformative. Prior PDF that are non-uniform and uninformative are of infinite number.

  14. To Mike Anderson:

    re: I’ll ask “What’s the average of two second lieutentants, three captains, one major, two colonels, and a major general?”

    Hint: the answer is a kind of cluster.

  15. Did anyone notice that Flam misidentified odds? He said:

    “A Bayesian calculation would start with one-third odds that any given door hides the car, then update that knowledge with the new data: Door No. 2 had a goat. The odds that the contestant guessed right — that the car is behind No. 1 — remain one in three. Thus, the odds that she guessed wrong are two in three”

    He is conflating the probability of 1/3 with odds. Odds is p /(1 – p) or 1/3 divided by 2/3 here, so the odds are 1 to 2.

  16. In teaching radiology residents and technologists the elements of statistics (In the country of the blind, the one-eyed man is king?), I’ve found that Bayes’ Theorem is best illustrated by diagnostic tests, particularly a case of a rare disease, with a highly sensitive and highly specific test. let D+ be positive for disease, D- negative, T+ positive for test of disease, T- negative, P(T+|D+ ) sensitivity (= 0.99, say), P(T-|D-) specificity (= 0.98, say) and let the disease be rare, with a small prevalence, P(D+) = .001 …now in pedagogical practice this is best worked out with a 2×2 table and some arbitrary populations size, N =100,000, say.
    All you folks here can work out P(D+|T+) , the probability for having the disease if one tests positive (even with a very fine test, high sensitivity and specificity) and you’ll see it’s remarkably small. It’s even nicer if you use a 2×2 table, calculating
    N(D+), N(D-) , N(T+|D+), N(T-|D-) , N(T+|D-), N(T-|D+) and marginal quantities N(T+), N(T-) . You’ll see that either using the table or Bayes’ Theorem, the probability of having the disease if you test positive is small, about 5% (if I didn’t goof on the calculations.

  17. Here’s a nice quote on whether Bayesian methods are useful in scientific argumentation:
    “One question in science is not ‘ is this hypothetical model true’ but ‘is this model better than the alternatives’…If we believe dogmatically in a particular view, then no amount of contradictory data will convince us otherwise…” John Skilling, “Foundations and Algorithms” in Bayesian Methods in Cosmology

  18. ““What’s the average of two second lieutentants, three captains, one major, two colonels, and a major general?””

    The Warrant Officer who takes all the practical decisions and organises all the actual work….:-)

  19. Terry,

    I claim that the Bayesian method doesn’t work when the variable is temperature but does work when the variable is limiting relative frequency.

    How would you apply that to one-time events such as the Lakers winning their next game? The temperature tomorrow is a one-time event. There are no relative frequencies.

    You could perhaps change the questions to: Lakers winning their next 10 games or temperature being such and such for the next N periods but this seems to be the frequentist notion of probability which would indeed be a relative frequency.

    In any case, we (I think) are discussing how to represent the initial uncertainty (the prior) while you seem to discussing the uncertainty of the result and whether the result implies the uncertainty in prediction was justified. No one except the modeler cares about this. It’s just like the p-value of a parameter. More useful would be a prediction that says the result at X will be Y ± e and if it is not then the model is incorrect.

  20. DAV:

    For the Lakers, the probability of winning their next game is the expected value of the limiting relative frequency. I’m foggy on the rest of your question. Please clarify.

  21. Terry,

    I didn’t ask one beyond how you would apply what you’ve said to a one-time event but I am curious. The frequency of what exactly is being limited ?

  22. DAV:

    The spread in the probability distribution function over the various possible limiting relative frequency values is constrained by the available information about the limiting relative frequency. In the limit of perfect information, this function becomes a delta function at a particular value of the limiting relative frequency.

Leave a Reply

Your email address will not be published. Required fields are marked *