No, the title of today’s post is not a joke, even though it has often been used that way in the past. The title was inspired by yesterday’s Wall Street Journal article “Analytical Trend Troubles Scientists.”
Thanks to the astonishing fecundity of the p-value and our ridiculous practice of reporting on the parameters of models as if those parameters represented reality, we have stories like this:
In 2010, two research teams separately analyzed data from the same U.K. patient database to see if widely prescribed osteoporosis drugs [such as fosamax] increased the risk of esophageal cancer. They came to surprisingly different conclusions.
One study, published in the Journal of the American Medical Association, found no increase in patients’ cancer risk. The second study, which ran three weeks later in the British Medical Journal, found the risk for developing cancer to be low, but doubled.
How could this be!
Each analysis applied a different methodology and neither was based on original, proprietary data. Instead, both were so-called observational studies, in which scientists often use fast computers, statistical software and large medical data sets to analyze information collected previously by others. From there, they look for correlations, such as whether a drug may trigger a worrisome side effect.
And, surprise, both found “significance.” Meaning publishable p-values below the magic number, which is the unquestioned and unquestionable 0.05. But let’s not cast aspersions on frequentist practices alone, as probelmatic as these are. The real problem is that the Love Of Theory Is The Root Of All Evil.
Yes, researchers love their statistical models too well. They cannot help thinking reality is their models. There is scarcely a researcher or statistician alive who does not hold up the parameters from his model and say, to himself and us, “These show my hypothesis is true. The certainty I have in these equals the certainty I have in reality.” Before I explain, what do other people say?
The WSJ suggests that statistics can prove opposite results simultaneously when models are used on observational studies. This is so. But it is also true that statistics can prove a hypothesis true and false with a “randomized” controlled trial, the kind of experiment we repeatedly hear is the “gold standard” of science. Randomization is a red herring: what really counts is control (see this, this, and this).
There are three concepts here that, while known, are little appreciated. The first is that there is nothing in the world wrong with the statistical analysis of observational data (except that different groups can use different models and come to different conclusions, as above; but this is a fixable problem). It is just that the analysis is relevant only to new data that is exactly like that taken before. This follows from the truth that all probability, hence all probability models (i.e. statistics), is conditional. The results from an observational study are statements of uncertainty conditional on the nature of the sample data used.
Suppose the database is one of human characteristics. Each of the human beings in the study have traits that are measured and a near infinite number of traits which are not measured. The collection of people which make up the study is thus characterized by both the measured traits and the unmeasured ones (which include time and place etc.; see this). Whatever conclusions you make are thus only relevant to this distribution of characteristics, and only relevant to new populations which share—exactly—this distribution of characteristics.
And what is the chance, given what we know of human behavior, that new populations will match—exactly—this distribution of characteristics? Low, baby. Which is why observational studies of humans are so miserable. But it is why, say, observational astronomical studies are so fruitful. The data taken incidentally about hard physical objects, like distant cosmological ones, is very likely to be like future data. This means that the same statistical procedures will seem to work well on some kinds of data but be utter failures on others.
Our second concept follows directly from the first. Even if an experiment with human beings can be controlled, it cannot be controlled exactly or precisely. There will be too many circumstances or characteristics which will remain unknown to the researcher, or the known ones will not be subject to control. As good as you can design an experiment with human beings is just not good enough such that your conclusions will be relevant to new people because again those new people will be unlike the old ones in some ways. And I mean, above and here, in ways that are probative of or relevant to the outcome, whatever that happens to be. This explains what a sociologist once said of his field, that everything is correlated with everything.
If you follow textbook statistics, Bayesian or frequentist, your results will be statements about your certainty in the parameters of the model you use and not about reality itself. Click on the Start Here tab and look to the articles on statistics to read about this more fully (and see this especially). And because you have a free choice in models, you can always find one which lets you be as certain about those parameters as you’d like.
But that does not mean, and it is not true, that the certainty you have in those parameters translates into the certainty you should have about reality. The certainty you have in reality must always necessarily be less, and in most cases a lot less.
The only way to tell whether the model you used is any good is to apply it to new data (i.e. never seen by you before). If it predicts that new data well, then you are allowed to be confident about reality. If it does not predict well, or you do not bother to collect statistics about predictions (which is 99.99% of all studies outsides physics, chemistry, and the other hardest of hard sciences), then you are not allowed to be confident.
Why don’t people take this attitude? It’s too costly and time consuming to do statistics the right way. Just look how long it takes and how it expensive it is to run any physics experiment (about genuinely unknown areas)! If all of science did their work as physicists must do theirs, then we would see about a 99 percent drop in papers published. Sociology would slow to a crawl. Tenure decisions would be held in semi-permanent abeyance. Grants would taper to a trickle. Assistant Deans, whose livelihoods depend on overhead, would have their jobs at risk. It would be pandemonium. Brrr. The whole thing is too painful to consider.