Classical hypothesis testing is founded on the fallacy of the false dichotomy. The false dichotomy says of two hypotheses that if one hypothesis is false, the other must be true. Thus a sociologist will say, “I’ve decided my null is false, therefore the contrary of the null must be true.” This statement is close, but it isn’t the fallacy, because classical theory supports his pronouncement, but only because so-called nulls are stated in such impossible terms that nulls for nearly all problems are necessarily false, thus the contrary’s of nulls are necessarily true. The sociologist is stating something like a tautology, which adds nothing to anybody’s stock of knowledge. It would *be* a tautology were it not for his *decision* that null is false, a decision which is not based upon probability.

To achieve the fallacy, and achieve it effortlessly, we have to employ (what we can call) the fallacy of misplaced causation. Our sociologist will form a null which says, “Men and women are no different with respect to this measurement.” After he rejects this impossibility, as he should, he will say, “Men and women are different” with the implication being this difference is *caused* by whatever theoretical mechanism he has in mind, perhaps “sexism” or something trendy. In other words, to him, the null means the cause is not operative and the alternate means that it is. This is clearly a false dichotomy. And one which is embraced, as I said, by entire fields, and by nearly all civilians who consume statistical results.

Now most statistical models involve continuity in their objects of interest and parameters. A parameterized model is Y ~ D(X, theta)—the uncertainty in the proposition Y is quantified by model D with premises X and theta—where the theta in particular is continuous (and usually a vector). The “null” will be something like theta_{j} = 0, where one of the constituents of theta is set equal to a constant, usually 0, which is said to be “no effect” and which everybody interprets as “no cause.” Yet given continuity (and whatever other premises go into D) the probability theta_{j} = 0 is 0, which means nulls are always false. Technicalities in measure theory are added about “sets of measure 0” which make no difference here. The point is, on the evidence accepted by the modeler, the nulls can’t be true, thus the alternates, that theta_{j} do not equal 0, are always true. Meaning the alternative of “the cause I thought of did this” is embraced.

If the alternates are always true, why aren’t they always acknowledged? Because decision has been conflated with probability. P-values, which have nothing to do with any question anybody in real life ever asks, enter the picture. A wee p-value allows the modeler to *decide* the alternate is true, while an unpublishable one makes him *decide* the null is true. Of course, classical theory strictly forbids “accepting”, which is to say deciding, a null is true. The tortured Popperian language is “fails to reject”. But the theory is like those old “SPEED LIMIT 55 MPH” signs on freeways. Everybody ignores them. Classical theory forbids stating the probability a hypothesis is true or false, a bizarre restriction. That restriction is the cause of the troubles.

Invariably, hunger for certainty of causes drives most statistical error. The false dichotomy used by researchers is a rank, awful mistake to commit in the sense that it is easily avoided. But it isn’t avoided. It is welcomed. And the reason it is welcomed is that this fallacy is a guaranteed generator of research, papers, grants, and so on.

Suppose a standard, out-of-the-box regression model is used to “explain” a “happiness score”, with explanatory variable sex. There will be a parameter in this model tied to sex with a null that the parameter equals 0. Let this be believed. It will then be announced, quite falsely, that “there is no difference between men and women related to this happiness score”, or, worse, “men and women are equally happy.” The latter error compounds the statistical mistake with the preposterous belief that some score can perfectly measure happiness—when all that happened was that a group of people filled out some arbitrary survey. And unless the survey, for instance, were of only one one man and one woman, and the possible faux-quantified scores few, then it is extremely unlikely that men and women in the sample scored equally.

Again, statistics can say nothing about *why* men and women would score differently or the same. Yet hypothesis testing always loosely implies causes were discovered or dismissed. We *should* be limited to statements like, “Given the natures of the survey and of the folks questioned, the probability another man scores higher than another woman is 55%” (or whatever number). That 55% may be ignorable or again it may be of great interest. It depends on the uses to which the model are put, and these are different for different people.

Further, statements like these do not as strongly imply that it was some fundamental difference between the sexes that caused the answer. It keeps us honest. Though, given my past experience with statistics, it is likely many will still fixate on the possibility of cause. Why isn’t sex a cause here? Well, it may have been some difference besides sex in the two groups was the cause *or causes*. Say the men were all surveyed coming out of a bar and the women a mall. Who knows? *We* don’t. Not if all we are told are the results.

It is the same story if the null is “rejected”. No cause is certain or implied. Yet everyone takes the rejection as proof positive that causation has been dismissed. And this is true, in its way. Some thing or things still *caused* the observed scores. It’s only that the cause might not have been related to sex.

If the null were accepted we might still say “Given the natures of the survey and of the folks questioned, the probability another man scores higher than another woman is 55%”. And it could be, after gathering a larger sample, we reject the null but that the difference probability is now 51%. The hypothesis test moves from lesser to greater certainty, while the realistic probability moves from greater to lesser. I have often seen this in, particularly in regressions. Variables which were statistically “significant” according to hypothesis tests barely cause the realistic probability needle to nudge, whereas “non-significant” variables can make it swing wildly. That is because hypothesis testing often misleads. This is also well known, for instance in medicine under the name “clinical” versus statistical “significance.”

It may be—and this is a situation not in the least unusual—that the series of “happiness” questions are *ad hoc* and subject to much dispute, and that the people filling out the survey are a bunch of bored college kids hoping to boost their grades. Then if the result is “Given the natures of the survey and of the folks questioned, the probability another man scores higher than another woman is 50.03%”, the researcher would have to say, “I couldn’t tell much about the difference between men and women in this situation.” This is an admission of failure.

The researcher was *hoping*—we have to admit it—to find a difference. He did, but it is almost surely trivial. How much better for his career would it be if instead he could say, “Men and women were different, p < 0.001”? A wee p provided freedom to speculate about what * caused* this difference. It is a good question to you, dear reader, whether the realistic approach as advocated here will be preferred by researchers.

**Update** Forgot to mention this is a reworked introduction to a section on hypothesis testing in my book. My deadline for finishing it is fast approaching (mid June).

Categories: Statistics

Fantastic. You’ve put your finger on it. The points you make seem to me to be both fundamental and unassailable. They can be wished-away, but not refuted. Those who have ears to hear, will hear. It may still be a long road even for those, but you will definitively have planted the seed.

I remember in one of my first stats classes, when covering hypothesis testing, and the teacher was talking about how the null was mu_0 = 0. I asked a question about how that could be true, given that we had just talked about how the probability of any point on a continuous distribution was zero.

The teacher gave some answer about how we were going to look for mu_0 ‘sufficiently likely to be around 0’ (or something like that). That didn’t sit well with me, but I figured it wasn’t worth pressing the matter on. I wish I had this article back then!

James,

That “sufficiently likely to be around 0” is nonsense, except in the measure theoretic sense of “sets of measure 0”. But like I said, that distinction is of no value in any real decision.

No. The real way classical theorists get out of the pickle is to forbid forming probabilities of parameter values. They say, “We can’t say the probability of theta_j =0 is any number, because that is to put a probability on a parameter.”

With a mighty leap, Jack was free (of his self-dug pit)!

JohnK,

“In the long run, we shall all be dead.”

But Briggs:

When I first was exposed to frequentist analysis, i was propagandized that one either rejected the Null Hypothesis, or failed to reject the Null hypothesis…

Sort of like a Scottish jury: the verdict allows for “Proven” and “Not Proven.” But a verdict of “Not Proven” does not mean “Innocent.”

Much of this seems to work well enough in quality control. But in most quality control, the quality to be controlled is required to remain on a target value, so the question of interest is whether the central tendency of the process has remained at the target value (or “the past average”). When points are found beyond the statistical control limits, one searches for the cause. (And the

wayin which the process is out of control can help narrow the search: things that cause sudden shifts are different from the things that cause gradual trends or repetitive cycles, etc.)The more I think on’t, the more I’m inclined to suppose that the problem lies less with frequentist statistics than with treating sociology et al. as if they were sciences. When we would find, for example, a difference in corrosion between pipes coated with a new experimental formula versus those coated with the original formula, we could not ipso facto conclude that it was the coating that accounted for the difference. However, if the pipes were buried side by side and each pair buried in different soils, some factors could be ruled out. Even so, the results of any experiment had to be confirmed by repeated trials under the alleged superior conditions to see if the results were replicated. My old boss, Ed Schrock, used to say that you had to confirm the cause by turning it on and off a couple of times.

A great deal that we took for granted in electrical-mechanical-chemical testing, including the use of controls, randomized trials, and so on, are simply not done in the “soft sciences”. A friend of mine recently noted that in “education testing,” there is often no control group and the comparison is made to a prediction of what the results “would have been” had the “treatment” not been applied.

YOS,

Exactly so. In QC you’re constantly exposed to causes, and so you know when things change that the causes have changed. Experimentation in the hard sciences: causes are sought.

But in the intellectual hinterlands, any correlation is said to be “caused” by whatever takes your fancy.

YOS,

I think that guilty, not guilty, and not proven are the three verdicts in Scotland. Not proven meaning that we know you did it scum but the evidence isn’t there. From what I’ve read there is political pressure to remove the third option in order to increase conviction rates in rape cases.

Fisher liked to say that a small p value meant that either the null hypothesis is false or the data is atypical. This seems just about OK, but extremely unhelpful because it does not tell us which is more likely. To make that judgement one needs to take into account background information, either in the form of bayesian priors, or if you prefer to use logical probability, by conditionalizing on any relevant data.

One simple example is to consider whether a particular pound coin in my pocket has a significant bias when tossed. Suppose I follow Fisher methodology and form a null hypothesis of no bias, devise an experiment to test it, say by tossing it 10 times and counting the heads, and use a simple model to calculate a p value for the result. Then I toss it 10 times, and say I get 9 heads. This gives a p value of slightly less than 1%, or perhaps 2% if you consider the two-tail case (i.e. you consider that it would have been equally interesting if I had obtained 1 head). This is a small p value, but should I reject the null hypothesis? Of course not. This coin was made by the Royal Mint, which has quality control procedures to help ensure defective coins don’t get into circulation. I’ve no idea what the prior probability is of a pound coin being biased, but I would guess somewhere between 10exp-6 and 10exp-8. Factor that in and it is still very unlikely that the coin is biased.

I’m fond of this example, because I remember some correspondence in the letters column of a major British newspaper back in the early 90s, after the England cricket team had toured Australia, played 10 matches, and lost the toss 9 times out of ten. There was much discussion about whether the Aussies had cheated. I was highly amused to note that not one of the correspondents even knew how to calculate the probability of this result assuming the null hypothesis of no cheating and an unbiased coin. Of course, given how many international tours of this kind have happened in the history of cricket, the result is not unusual.

Nor when we remember the coin has no memory and each toss is still 50/50,

no matter how many heads have previously been obtained.(Heck, it isn’t even the same coin each time!)Again, in QC work we might use six in a row on the same side of the median or 7 out of 8, as in the Western Electric Handbook; but these are triggers to investigate the process, not to conclude to a scientific law of nature.

Example: a pasting machine applies lead oxide paste to a lead grid. The paste weight is measured by first weighing the grid, running it through the pasting machine, then re-weighing the pasted grid. Samples were taken in subgroups of five every fifteen minutes. During a study of the machine’s capability, the mean weights of the first 12 subgroups were all below the grand mean while those of the last 12 were above. The transition was abrupt, not gradual. This was sufficiently unlikely to be worth looking into. The assignable cause was found to be that the oxide paste batch from the paste mill had been changed and the new batch had a higher density than the first.

It seems to me that these methods work well enough when dealing with inanimate objects (like battery grids) and not so well when dealing with things that might talk back.

False dichotomy… hmmm… You mean like, “Revenues increased because taxes went down?”

JMJ

JMJ,

No. Revenues increased after taxes went down is an observation, and observations are contingently true. Here is a real false dichotomy: “The globe must be warming because of pollution.”

briggs:

i thought the false dichotomy was more like…”he is either stupid or greedy”.

David,

That’s a good example, sure; but conditional.

The false dichotomy is also this: “I can’t think of any other reason except that X caused Y, therefore X caused Y.”

Briggs,

An even more common version is “I don’t want there to be any reason other than X, so it is X”.

When the null and alternative hypotheses divide all possibilities into two non-overlapping sets, there is no false dichotomy. In this case, indeed, when one is false, the other must be true. An alternative cannot be acknowledged as true because data evidence cannot prove its

absolute truth,which has nothing to do with p-values. P-values come into play when discussing ways to access the uncertainty or possible error, which is a different story.FYI – women have surpassed men on IQ test scores in their averages. This is a descriptive conclusion.

True that Dr Briggs.

True that.

JH,

Briggs said as much two sentences after that quote.

James,

Please quote those those words for me.

The null and alternatives hypotheses are not false dichotomy. The classical theory doesn’t support the following

And the following statement

is wrong. A mean effect is either positive or non-positive. The setup of the null and alternative hypotheses doesn’t make the contrary’s of nulls are necessary true. Statistics clearly teaches that data evidence does not prove the null is true!

The socialist doesn’t know statistics, and he or she is an imaginary idiot.

The “sociologist”, not “socialist”.

This is pretty ironic. Hypotheses tests involve exhausting the parameter space, whereas it’s the Bayesians who play with Bayes factors and the like that not only consider two exhaustive hypothesis, they’re prepared to give lumps of priors such as .5 to each! What happened to everything in between. And giving a lump of prior to the null is strictly Bayesian. Even in special cases where Fisherian tests have a single null with a directional alternative, interpreting them according to the hypotheses severely warranted leads to considering the alternative to H (namesly, not-H). Moreover, since Bayesian posteriors require a “something else” or Bayesian catchall, and different people invariably identify “something else” differently, their posteriors are not comparable over different examples or agents. This was Barnard’s point to Lindley and Savage in Savage 1962.

Oh, and anyone who takes statistical significance as indicating causes is a statistical ignoramus.

Mayo,

Another false dichotomy: if frequentist theory is wrong, classical Bayesianism must be right.

But I’m 100% with you when you say, “anyone who takes statistical significance as indicating causes is a statistical ignoramus.” A group which is the vast majority of users of statistics.

I meant to say that Bayesians

“not only consider two non-exhaustive hypotheses”