How to Fool Yourself—And Others—With Statistics

See the news box to the left. I wrote this long ago and never used it. I do not love it. But since I am so busy, I haven’t the time to write something new. Feel free to disparage.

Remember how much you hated your college statistics course? It made little sense. It was confusing, even nonsensical. It was an endless stream of meaningless, hard-to-remember formulas.

All that is true—it was awful—but you were wrong to hate it. Because it has been a balm and a boon to mankind, especially to researchers in need of a paper. Publish or perish rules academia, and no other tool has been as useful in generating papers as statistics has.

Statistics is so powerful that it can create positive results in nearly any situation, including those in which it shouldn’t. For example, this week we read in the newspaper that “statistics show mineral X” is good for you, only to read next week that “statistics show” it isn’t. How can statistics be used to simultaneously prove and disprove the same theory? Easy.

But first note that I am talking about how statistics as she is practiced by the unwary or unscrupulous. Statisticians themselves, as everybody knows, are the most conscientious and honest bunch of people on the planet.

How to prove your theory

Step 1: Start with a theory or hypothesis you want to be true.

Step 2: Gather data that might be related to that theory; more is better.

Step 3: Choose a probability model for that data. Remember the “bell-shaped curve”? That’s a model, one of hundreds at your disposal.

Step 4: These models have knobs called parameters which are tuned—via complex mathematics—so that the model fits.

Step 5: Now it gets tricky. Pick a test from that set of formulae you were made to memorize. This test must say how your theory relates to the model’s parameters. For example, you might declare, “If my theory is true, then this certain knob cannot be set to zero.” The test then calculates a statistic, which is some mathematical function of your data.

You then calculate the probability of seeing a statistic as large as you just calculated given that the relevant knob is set to zero. That is, the test says how unusual the observed statistic is given that the probability-parameter statement about your theory is true—and given the model you picked is correct.

You might dimly recall that the result of this calculation is called a p-value. It’s true definition is so difficult to remember that nobody can remember it. What people do remember is that a small one—less than 0.05—is good.

If that level is reached, you’re allowed to declare statistical significance. This is not the same as saying your theory is true, but nobody remembers that, either. Significance is vaguely meaningful only if both a model and the test used being are true and optimal. It gives no indication of the truth or falsity of any theory.

Statistical significance is easy to find in nearly any set of data. Remember that we can choose our model. If the first doesn’t give joy, pick another and it might. And we can keep going until one does.

We also must pick a test. If the first doesn’t offer “significance”, you can try more until you find one that does. Better, each test can be tried for each model.

If that sounds like too much work, there’s a trick. Due to a quirk in statistical theory, for any model and any test, statistical “significance” is guaranteed as long as you collect enough data. Once the sample size reaches a critical level, small p-values practically rain from the data.

But if you’re impatient, you can try subgroup analysis. This is where you pick your way through the data, keeping only what’s pretty, trying various tests and models until such a time as you find a small p-value.

The lesson is that it takes a dull researcher not to be able to find statistical “significance” somewhere in his data.

Boston Scientific

About two years ago the Wall Street Journal (registration required) investigated the statistical practices of Boston Scientific, who had just introduced a new stent called the Taxsus Liberte.

Boston Scientific did the proper study to show the stent worked, but analyzed their data using an unfamiliar test, which gave them a p-value of 0.049, which is statistically significant.

The WSJ re-examined the data, but used different tests (they used the same model). Their tests gave p-values from 0.051 to about 0.054; which are, by custom, not statistically significant.

Real money is involved, because if “significance” isn’t reached, Boston Scientific can’t sell their stents. But what the WSJ is quibbling, because there is no real-life difference between 0.049 and 0.051. P-values do not answer the only question of interest: does the stent work?

The moral of the story

No theory should be believed because a statistical model reached “significance” on a set of already-observed data. What makes a theory useful is that it can predict accurately never-before-observed data.

Statistics can be used for these predictions, but it almost never is.

I think predictions are avoided on the principle that when ignorance is bliss, tis folly to know that your theory can’t be published.

Incidentally, we statisticians have heard every version of “liars figure”, “dammed lies”, etc., so you’ll pardon me for not chuckling when in response you whip out your Disraeli.

Update If you thought this post was bad, you might try watching this video (I can think of at least two good reasons to): A Strange Tale About Probability.


  1. DAV

    Your right that there is little difference in 0.049 and 0.051 p-factors. They’re just a matter of judgment anyway so any value should suffice. In fact, I once came across a NIH study that used 0.2. Must be that Bayesian stuff — anything goes. You forgot to mention this as a last-straw technique.

    I conducted many tests as a youngster and determined that jellied bread falls jelly side down a significant number of times (p-factor 0.5). In subsequent tests, this occurred at the same rate as predicted by my model so I know it’s not just an illusion. Pretty darn good, huh?

  2. DAV

    hmmm … one chance in a bouillon and a model with two modes? I presume ‘Latch’ is Paul’s stage name although he did look quite at home.

  3. Neural networks and the closely related PCA method can be used to fit almost any dataset. Getting them to properly predict things is a “trick” much more difficult than the ones used by the researchers at CRU. I use MLP networks for classifiers and found that a full understanding of the statistics and also what is in the datasets is necessary to get good results. I’ve seen PhDs run away from the MLP because it was too tough to tame. I say not so if you are careful how you use it. I find it interesting that most of the text books I’ve read don’t go into some of the “tricks” I’m using, like they don’t care about real world accuracy and are more interested in the theory than practical aspects.

Leave a Reply

Your email address will not be published. Required fields are marked *