Our title, which is indistinguishable from a flood of others1, might read, “Reading Articles About The Misuse Of Statistics Increases Risk Of Apoplexy.”
Yes, for every article you read like this one, your risk of becoming apoplectic over the improper use of statistics increases 2.0 times.
What does that “2.0-fold increase in risk” mean? Not just for this finding, but for any which reports results in the form of “increased risk” of suffering from a malady after being exposed to some “risk factor.” In this study, “exposure” is reading this blog, which is the risk factor, and “non-exposure” is not reading.
Suppose (somehow) you knew the probability of developing the malady given you were not exposed to the “risk factor.” Call it probnot exposed. You also have to know (somehow) the probability of developing the malady given you were exposed; called probexposed. Relative risk is
RR = probexposed / probnot exposed.
You could also calculated the odds ratio. First know that odds are a one-to-one function of probability, viz:
Odds = prob / (1 – prob).
The odds ratio is like the risk ratio, but the ratio of the odds, not probabilities:
OR = oddsexposed / oddsnot exposed.
Now suppose that probnot exposed = 0.000001, which is a one in a million chance of developing the malady given you were not exposed. If you then hear that being exposed “increases the risk by 2.0 fold”, then this means the risk ratio must be 2.0. Back solving gives the probability of developing the malady after exposure as 0.000002. (Similar calculations can be done for odds ratios.)
In this case, exposure drove your risk from one in a million to just 2 in a million. We can already see that presenting results in raw probability will not be as pulse pounding as speaking in terms of risk or odds. Information is also lost in giving the risk ratio: the customer has no idea what the risk is in the control group. So one fix would be to give emphasis to the actual probabilities of suffering, and not just the risk ratio.
But even if that is done, something would still be wrong. Can you spot what?
For the apoplexy finding, we do not know what the probability of apoplexy is for this blog’s readers. Nor do we know what the probability of apoplexy is for non-readers. Therefore, we cannot know the risk. We can, through statistical formula, estimate it. But that estimate will exaggerate the true risk.
For example, suppose that we witnessed 18 cases of sputtering apoplexy in 40 readers of this blog, but we only found it only 9 times in 40 non-readers. That gives an estimated “statistically significant” risk ratio of 2.0. But this exaggerates risk in the following sense.
Now, we can guess probexposed is about 0.45 and probnot exposed is about 0.225 for new groups of readers and non-readers “similar” to the ones sampled here. (Incidentally, those probabilities and that RR are, however, exact for these 80 readers and non-readers.)
We are not interested in these folks anymore, but in new ones. That is the point, after all, of doing this study. The actual probability2 that the next, new blog readers develops apoplexy is 0.452, which is close to, but just over, the raw estimate of 0.45. And the actual probability that the next, new non-readers develops apoplexy is 0.232, which is also higher than the raw estimate of 0.225. This puts the actual risk ratio at 1.95, which is under the raw estimate of 2.0.
Not a huge difference in this fictional example, to be sure, but the difference between the raw and actual difference will always be in the direction of exaggerating the risk. Taken over the tens of thousands of studies reporting risk, the overall effect is large.
The reason these differences exists is because the traditional method reports parameter estimates, and not actual probabilities or actual risk ratios. Parameters are the internal, unobservable parts of the probability models which are used to quantify uncertainty in the data. They are also the focus of nearly all statistical methods (because of inertia, custom, and lack of knowledge of alternatives).
Reporting in terms of actual observables not only gives a true impression of the probabilities and risks, but allows us to answer more complicated questions about the data and to provide richer information. For example, reporting on observables we can picture the probability that each of 0, 1, 2, …, 40 new readers/non-readers develop apoplexy. That’s done in the picture.
This kind of picture is extraordinarily important because it will give superior estimates for cost and benefit analyses, which are guaranteed to be exaggerated using parameter-based methods.
1Search for the terms “increase(s)(ed) risk of”; millions of hits.
2See the “modern stats” at this link for how to calculate these. The actual probabilities will always move closer to 0.5 than the raw parameter estimates.