We discussed this before, but since it has come up recently in personal discussions, I wanted to offer this clarification.
Suppose we’re in a standard epidemiological situation, or even a planned experiment, where we have two groups, (a) exposed to some horrid thing, and (b) not exposed. It should be clear that the people in group (b) were not exposed. Where by “not” I mean “not”. We track the incidence of some dread malady, or maybe even maladies, in the two groups.
We collect data and “submit” it to some software, which spits out a wee p-value for whatever “test” you like between the groups; and where we can even grant—and upon this I insist—group (a) shows the higher rate of whatever dis-ease we track. It also shows a healthy predictive probability difference between the groups.
Lo, all statisticians would say the exposure and malady are “linked”, by which all of them would at least secretly mean “cause”. Whatever it was those in group (a) were exposed to caused the malady or maladies.
If you press them, and tell them they will be quoted and held accountable to their judgement, the statisticians may well lapse into “link”, and shy away from “cause”. But they will secretly believe cause.
Now here is what happened. Not everybody in group (a), the exposed group, will have developed the malady (or maladies; after which I use the singular to save typing), and some people in (b), the not exposed group, will have the malady.
It thus cannot be that the people in group (b) had their disease caused by the exposure. It then necessarily follows that their malady was cause by something other than the exposure. This is a proof that at least one more cause than the cause of the exposure exists. There is no uncertainty in this judgement. Not if it is true none of the people in the not-exposed group were not exposed.
Of course, it could be that every person in the not-exposed group had a different cause of their malady. All that we know for certain is that none of these causes were from the exposure.
It’s worse, because even though we have proved beyond all doubt that there must exist a cause that was not the exposure, we have not proved that any people in the exposed group had their malady caused by the exposure. Why?
Because it could be that every person in the exposed group had their disease caused by whatever caused the disease in the not-exposed group—or there could even be new causes that did not affect anybody in the not-exposed group but that, somehow, caused disease in the exposed group.
It could be that exposure caused some disease, but there is no way to tell, without artificial and unproven assumptions, how many more maladies were caused by the exposure.
It’s worse still for those who hold statistical models—in which I include all artificial intelligence and machine learning algorithms—can discover cause. For what about all those people in either group who did not develop the disease?
Even if exposure causes disease sometimes, and the other (unknown-but-not-exposure) cause which we know exists only causes disease sometimes, we still do not know why they exposure or non-exposure causes disease only sometimes.
Why did these people develop the malady and these not? We don’t know. We can “link” various correlations as “cause blockers” or “mitigators”, but we’re right back where we started from. We don’t know—from the data alone—what is a cause and what is not, and what blocks these (at least) two causes sometimes but not in others.
Once again I claim over-certainty in medicine, and in epidemiology in particular, is rampant.
This is why I insist that cause is in the mind and not the data.