Quick review: people’s heads were stuck in an fMRI; inside, they read questions to which they answered “true”, “false”, or “undecided” (these last were summarily tossed, as were other data). Questions where either “religious”, i.e. of a Christian nature, or “neutral” (many were anything but). Fifteen Christians and fifteen non-Christians took part. The time it took to answer was the key variable.
The first results:
Response time data were submitted to a repeated-measures ANOVA with belief (true, false) and statement content (religious, nonreligious) as within-subject variables, and group (nonbeliever, Christian) as a between-subject variable. Response times were significantly longer for false (3.95 s) compared to true (3.70 s) responses (F (1,28) = 33.4, p<.001)…
The language “data were submitted” is often used in papers. It is usually harmless, but other times it belies a misunderstanding of what classical statistical results imply. Those who use this language act as if a statistical package is a truth machine: stick your data in and out comes the answer “Your hypothesis is right” or “Your hypothesis is wrong.” No thinking required!
For those who don’t know ANOVA is regression-in-other-words; it concerns itself only with the central parameters (not means of data) of normal distributions which are used to quantify uncertainty in the response, which is here time. That means Harris’s results are an approximation because time cannot be normally distributed (time cannot be less than 0 in this experiment). How good or loose an approximation we never learn because Harris never checks.
Harris then says what cannot be claimed: “Response times were significantly longer for false (3.95 s) compared to true (3.70 s) responses (F (1,28) = 33.4, p<.001)” Again, this language is typical, but it is unwarranted given the statistical method used. What the results actually mean, and what the test means, is this: that the average of those who answered “false” (and whose results were not tossed) was 3.95 seconds (and not 3.94 s or 3.96 s), and that the average of those who answered “true” (and whose results were kept) was 3.70 seconds (and not 3.69 s or 3.71 s).
These averages we must assume were themselves averaged across all questions, of which there were 61—but Harris never says how many each person answered, nor how many were tossed for not fitting Harris’s preconceptions. There is a clue about the averaging in the notation “F (1,28)”, where the second number in technical language is “degrees of freedom”, which tells us that a sample size of 30 was used (any introductory statistics textbook will tell you how we know this). This must mean that each person’s “true” answers were averaged, then their “false” answers averaged. Then the 15 averages for “true” were compared to the 15 averages for “false.” That is to say, the average of all the “trues” was not taken, then the averages of the “false”, but that each person’s “trues” were averaged then their “falses”. Odd, no?
We have seen that these two averages of averages were indisputably different: 3.70 s vs. 3.95 s. The probability they are different is 1: because they are different! The F-test p-value (“p<.001”) says that if the central parameters of the normal distributions quantifying uncertainty in the averages of averaged times for “true” and “false” responses were exactly equal, then if the experiment were repeated an indefinite number of times, the chance of seeing an F-statistic larger than the one Harris actually saw is less than 0.001. That’s all it means, and nothing more.
Harris cannot claim that response times for answering “false” were longer than for answering “true”; all he can say is that the averages across people were different. Now, his was a bad way to do statistics: this averaging across all people’s “true” and “false” answers. In reality, each person will have had a distribution of response times for his “true” and for his “false” answers. Harris should have examined how these distributions differed, if at all, between the “true” and “false”. But if he did that, he would have had to reveal just how many questions each person got and how many were tossed. That was either too much effort or something he thought better left unexplained.
[Response times were] also significantly longer for religious (3.99 s) compared with nonreligious (3.66 s) stimuli (F (1,28) = 18, p<.001).
The same critique can be given. That same within-person averaging must have been done. Plus recall the surmise that Christians would take longer to answer, given the imbalance in the question wording.
The two-way interaction between belief and content type did not reach significance, but there was a three-way interaction between belief, content type, and group (F (1,28) = 6.06, p<.05).
In other words, even given the (excessively) loose criteria of classical statistics, the main hypothesis that there would be a difference in response times between “belief and content type” was not confirmed. Harris could have stopped there. But he added that the “three-way interaction between belief, content type, and group” gave a publishable p-value. This means that at least one of the 8 parameters in the three-way interaction produced the large F-statistic: that is all it means and nothing more. We do not even know which of the parameters!
Time for apologies
I stated at the beginning that I would go into exhaustive detail in this criticism. All this work and we have still not got to the fMRI results. These are coming next. But I warn you: it gets thick fast. Not because fMRI is difficult to understand, but because the statistical methods that Harris used are so involved. I will have to assume a certain level of familiarity with these methods, since I cannot explain in detail both advanced statistics and their misuse.