Still at conference, so just a short plug for learning about which you speak.
Stephan Lewandowsky, who believes JFK shot at the moon landings and that’s why the globe has passed the tippling point, or something like that, has said a few words about statistics:
However, our conclusion that the effect [in yet another silly study] is “real” and not due to chance is inevitably accompanied by some uncertainty.
Here is the rub: if the significance level is .05 (5%), then there is still a 1 in 20 chance that we erroneously concluded the effect was real even when it was due to chance—or put another way, out of 20 experiments, there may be 1 that reports an effect when in fact that effect does not exist. This possibility can never be ruled out (although the probability can be minimized by various means).
In his favor, a lot of people who publish too many papers aimed at audiences who are eager to nod their heads sagely at the foibles of their inferiors make the same errors Lewandowsky does. They are widely replicated errors. Which proves that replication in science can often reinforce distortions.
If the significance level is 0.05 it only means that if the p-value is less than or equal to that number, and that you are allowed to declare “success” for your experiment, no matter how silly it is (see the Statistics section on this page for some doozies). What is a p-value? Unfortunately, the definition of this destructive beast is very difficult to remember, so difficult that it is easier to remember what it isn’t.
The p-value is the probability of seeing a statistic as large (in absolute) value as the one you actually did see, given: (1) the values of certain parameters in a model you are using to quantify uncertainty in the numbers are set to a pre-specified number (usually 0), (2) the model itself is unambiguously true, (3) the experiment that generated the data is replicated indefinitely, and (4) the data at hand is measured without error (or if it is measured with error, this error is modeled).
Each word of this cumbrous definition counts, which is why it is so difficult to memorize and to use properly.
You are free to choose the model, the truth of which is usually unknown. For example, you are free to model your data using a hockey stick, even when that’s absurd. You will get a different p-value for every model. One model can give a non-publishable (i.e. significant) p-value, while a second model can give a publishable one. In statistics, there are many models one may choose in any situation. Their name is legion. Many scientists, psychologists in particular, tend to choose poorly.
Now, once you have the model in hand, you still have to pick a statistic. For any given model, there are many. Each statistic will give a different p-value. One statistic (inside a model) will give a non-publishable p-value, another statistic will give a publishable one.
On top of all this is the enormous latitude the scientist has to call the model/statistic pair he used to be relevant to the hypothesis he announces. It could be, and often is, this relationship is tenuous and that a direct reading of the model has little bearing on the “public” hypothesis. Almost always, the hypothesis about the real-life thing is confused and conflated with different hypotheses about the parameters of the model picked. This is not a small error: it is enormous and leads to wild over-certainty. Again, see that page for examples.
And then you are free to manipulate the data itself, tossing away “outliers”, usually defined as data that does not fit your preconceptions. You can do “sub group” analysis. You can say your hypothesis is true only for certain parts of your data. Oh my, it goes on and on.
So in addition to getting the technical definition wrong, Lewandowsky got the practical, boots-on-the-ground definition wrong. He would do well to read “Inappropriate Fiddling with Statistical Analyses to Obtain a Desirable P-value: Tests to Detect its Presence in Published Literature” by Gadbury and Allison for wisdom on this topic.
Conclusion: Especially in dicey areas, and psychology is certainly one of them, there is much more than a 1 in 20 chance that the finding does not confirm the stated hypothesis (about the real-life thing).
Epilogue: Lewandowsky advocates, as do we all, replication to smoke out queer p-values. As an example, Lewandowsky indicates the infamous climate hockey stick has been “replicated,” a sure view, he claims, that the p-values are leading us down a flowery path. Unfortunately, our man has forgotten to include the multiple studies that show the hockey stick is malarkey, as crazy Uncle Joe would say.
There is a psychological term for emphasizing only the evidence which supports your belief and ignoring everything else, but I’ve forgotten what it is.
Thanks to Dr K.A. Rodgers for alerting me to this topic.