Classical hypothesis testing is founded on the fallacy of the false dichotomy. The false dichotomy says of two hypotheses that if one hypothesis is false, the other must be true. Thus a sociologist will say, “I’ve decided my null is false, therefore the contrary of the null must be true.” This statement is close, but it isn’t the fallacy, because classical theory supports his pronouncement, but only because so-called nulls are stated in such impossible terms that nulls for nearly all problems are necessarily false, thus the contrary’s of nulls are necessarily true. The sociologist is stating something like a tautology, which adds nothing to anybody’s stock of knowledge. It would be a tautology were it not for his decision that null is false, a decision which is not based upon probability.
To achieve the fallacy, and achieve it effortlessly, we have to employ (what we can call) the fallacy of misplaced causation. Our sociologist will form a null which says, “Men and women are no different with respect to this measurement.” After he rejects this impossibility, as he should, he will say, “Men and women are different” with the implication being this difference is caused by whatever theoretical mechanism he has in mind, perhaps “sexism” or something trendy. In other words, to him, the null means the cause is not operative and the alternate means that it is. This is clearly a false dichotomy. And one which is embraced, as I said, by entire fields, and by nearly all civilians who consume statistical results.
Now most statistical models involve continuity in their objects of interest and parameters. A parameterized model is Y ~ D(X, theta)—the uncertainty in the proposition Y is quantified by model D with premises X and theta—where the theta in particular is continuous (and usually a vector). The “null” will be something like thetaj = 0, where one of the constituents of theta is set equal to a constant, usually 0, which is said to be “no effect” and which everybody interprets as “no cause.” Yet given continuity (and whatever other premises go into D) the probability thetaj = 0 is 0, which means nulls are always false. Technicalities in measure theory are added about “sets of measure 0” which make no difference here. The point is, on the evidence accepted by the modeler, the nulls can’t be true, thus the alternates, that thetaj do not equal 0, are always true. Meaning the alternative of “the cause I thought of did this” is embraced.
If the alternates are always true, why aren’t they always acknowledged? Because decision has been conflated with probability. P-values, which have nothing to do with any question anybody in real life ever asks, enter the picture. A wee p-value allows the modeler to decide the alternate is true, while an unpublishable one makes him decide the null is true. Of course, classical theory strictly forbids “accepting”, which is to say deciding, a null is true. The tortured Popperian language is “fails to reject”. But the theory is like those old “SPEED LIMIT 55 MPH” signs on freeways. Everybody ignores them. Classical theory forbids stating the probability a hypothesis is true or false, a bizarre restriction. That restriction is the cause of the troubles.
Invariably, hunger for certainty of causes drives most statistical error. The false dichotomy used by researchers is a rank, awful mistake to commit in the sense that it is easily avoided. But it isn’t avoided. It is welcomed. And the reason it is welcomed is that this fallacy is a guaranteed generator of research, papers, grants, and so on.
Suppose a standard, out-of-the-box regression model is used to “explain” a “happiness score”, with explanatory variable sex. There will be a parameter in this model tied to sex with a null that the parameter equals 0. Let this be believed. It will then be announced, quite falsely, that “there is no difference between men and women related to this happiness score”, or, worse, “men and women are equally happy.” The latter error compounds the statistical mistake with the preposterous belief that some score can perfectly measure happiness—when all that happened was that a group of people filled out some arbitrary survey. And unless the survey, for instance, were of only one one man and one woman, and the possible faux-quantified scores few, then it is extremely unlikely that men and women in the sample scored equally.
Again, statistics can say nothing about why men and women would score differently or the same. Yet hypothesis testing always loosely implies causes were discovered or dismissed. We should be limited to statements like, “Given the natures of the survey and of the folks questioned, the probability another man scores higher than another woman is 55%” (or whatever number). That 55% may be ignorable or again it may be of great interest. It depends on the uses to which the model are put, and these are different for different people.
Further, statements like these do not as strongly imply that it was some fundamental difference between the sexes that caused the answer. It keeps us honest. Though, given my past experience with statistics, it is likely many will still fixate on the possibility of cause. Why isn’t sex a cause here? Well, it may have been some difference besides sex in the two groups was the cause or causes. Say the men were all surveyed coming out of a bar and the women a mall. Who knows? We don’t. Not if all we are told are the results.
It is the same story if the null is “rejected”. No cause is certain or implied. Yet everyone takes the rejection as proof positive that causation has been dismissed. And this is true, in its way. Some thing or things still caused the observed scores. It’s only that the cause might not have been related to sex.
If the null were accepted we might still say “Given the natures of the survey and of the folks questioned, the probability another man scores higher than another woman is 55%”. And it could be, after gathering a larger sample, we reject the null but that the difference probability is now 51%. The hypothesis test moves from lesser to greater certainty, while the realistic probability moves from greater to lesser. I have often seen this in, particularly in regressions. Variables which were statistically “significant” according to hypothesis tests barely cause the realistic probability needle to nudge, whereas “non-significant” variables can make it swing wildly. That is because hypothesis testing often misleads. This is also well known, for instance in medicine under the name “clinical” versus statistical “significance.”
It may be—and this is a situation not in the least unusual—that the series of “happiness” questions are ad hoc and subject to much dispute, and that the people filling out the survey are a bunch of bored college kids hoping to boost their grades. Then if the result is “Given the natures of the survey and of the folks questioned, the probability another man scores higher than another woman is 50.03%”, the researcher would have to say, “I couldn’t tell much about the difference between men and women in this situation.” This is an admission of failure.
The researcher was hoping—we have to admit it—to find a difference. He did, but it is almost surely trivial. How much better for his career would it be if instead he could say, “Men and women were different, p < 0.001”? A wee p provided freedom to speculate about what caused this difference. It is a good question to you, dear reader, whether the realistic approach as advocated here will be preferred by researchers.
Update Forgot to mention this is a reworked introduction to a section on hypothesis testing in my book. My deadline for finishing it is fast approaching (mid June).