This is from Gerd Gigerenzer’s “Mindless statistics” in *The Journal of Socio-Economics*, **33**, (2004) 587–606.

Have a go before looking at the answers (I’m giving my own, not quoting Gigerenzer). Send this to anybody you see using null hypothesis significance testing.

Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is significant (t = 2.7, d.f. = 18, p = 0.01). Please mark each of the statements below as “true” or “false.” “False” means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct.

1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).

[] true/false []

2. You have found the probability of the null hypothesis being true.

[] true/false []

3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).

[] true/false []

4. You can deduce the probability of the experimental hypothesis being true.

[] true/false []

5.You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

[] true/false []

6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.

[] true/false []

No cheating.

No cheating.

No cheating.

No cheating.

No cheating.

No cheating.

No cheating.

1. FALSE. Obviously you have proved nothing about any “null” hypothesis. We don’t even know what the means in the two groups were, but we can *deduce* they were not equal. If they were equal, we’d have t = 0 (if you can’t recall why, look up t-test formula).

The obvious confusion begins here, thinking this is a “sample” from some “universe”, which it might be, or it might not be. Either way, we know everything about this data, so we don’t need to make any probability judgments about it. Unless we want to make predictions about new observations. If we’re not going to have new observations, again, we don’t need probability.

We don’t know the cause of every value. We can guess the treatment might be a cause. If it is, it joins the list of all the other causes operating on the measurement (whatever it is).

2. FALSE. The standard null hypothesis is some parameters representing the central value of normal distributions, parameters which have a *real existence* in some kind of Platonic realm, are equal to one another. The real existence of these parameters is taken as a given. They are never observed. Not ever. They cannot be observed. Not ever. So there is no way, not ever, to know whether they are equal or unequal. That they exist is a pure matter of faith.

3. FALSE. Here we go. By “population means” the authors mean those Platonic parameters of the normal distributions representing the uncertainty in the measure. We have proven nothing about them, even if perchance they do exist. We don’t know if they’re equal, unequal, or anything.

The p-value in particular says nothing about their value.

*Memorize this*: The p-value is the probability of seeing a t-statistic larger than 2.7, or less than -2.7, if the same experiment were repeated an infinite number of times, and if those Platonic parameters existed, and it they were equal to one another. Only that, and *nothing more*.

4. FALSE. You do not deduce it universally, but you can do it locally, in the following sense. If you assume the only two causes operating on the measure are your treatment and whatever it is operating on the control, because the measures were not equal (which we know because t does not equal 0), then we know the treatment is a cause.

Which isn’t learning much, because we started by assuming the treatment is a cause.

5. FALSE. The null, we saw, was the equality of two unobservable parameters. Rejecting this, and saying they are unequal, is an error when they are in fact equal.

We do not know the probability the two parameters are equal; thus we cannot know the probability they are unequal. The p-value is silent on both these probabilities. Thus we don’t know the probability of our mistake, assuming we made one, nor do we know the probability of correct act, assuming we made one.

6. FALSE. We know nothing about the reliability of the observations. Given how badly much of today’s The Science is done, we can make no assumptions, either. But, assuming scrupulosity on the part of the experimenters, the statement is still false.

It is not “a great number of times” the experiment has to be repeated to get to that 0.01—which only works *if the null is true*. It must be an infinite number of times. Frequentist theory is silent on all finite measures. It only works at the limit.

Also, it must also be true that the Plantonic realm holding the parameters is real, as are the parameters.

**Conclusion** Do not use p-values or testing of any kind.

*Subscribe or donate to support this site and its wholly independent host using credit card or PayPal click here*

It doesn’t require platonic realm, it requires a model that fits decently which can be easily checked. It doesn’t require literal infinity, it requires large enough n (see clt, strong law of large numbers, various finite frequentism discussions). Don’t need “absolutely” proved/disproved for science, just evidence at this time subject to errors and change (not actually unable to change like religious stories which are pretend absolute anyway). p-value is just (rescaled) distance observed test statistic is away from what is expected under model. Most important thing in your example is if experimental design was good and also if experiment gets replicated. And the analysis from a Briggsian analysis in this example of randomly assigned treatment and control group from a random sample showing some difference is..what exactly? I’m afraid if you just answer that there is some observed difference that is being descriptive only. Most people would want to know how a medicine (or political poll, or anything else) applies to a population (which we have pseudorandomly sampled) not only the results of those observed in the sample.

-Justin

I’m surprised that in relation to point 2 you didn’t talk about the way a p-value is actually calculated. That is, it is the probability that a certain statistic from the data would take on its observed value or greater, under the assumption that “the null hypothesis is true” (or more precisely that the observed data was generated from some specific random variable.) Since it assumes that the null hypothesis is true, it can’t possibly be used to say how likely it is the null hypothesis is true. This also affects points 5 and 6.

I was hesitant to take the test because I never studied statistics and can’t follow what you talk about half the time (I still don’t know what a p value is, although I now know I am not supposed to assume a small one is “good”), but I decided to anyway. Passed 100%. With 40 subjects and nothing else known about the possible number of variables, the sort of thing being measured, how well the experiment was run, or anything, it’s impossible to draw any conclusions other than it doesn’t disprove anything and “indicates further study may be warranted.”

Somedays the comments on Matt’s site make me wish that Comments sites had some sort of institutional memory.

As Matt has said out loud more than once, neither mere logic, nor pointing out that p-values have a definite mathematical meaning which, literally by definition, can have no relation to anything real, won’t stop “hypothesis testing.” “Hypothesis testing” continues not because it’s right — it’s provably incoherent — but because it’s lucrative.

Anyways, one of my pet peeves about p-values is that, since they’re “not even wrong,” to quote some guy, they can “mean” anything.

Apparently, everything “has” a “p-value.” Multitudinous are the number of published papers I alone have seen that assign p-values even to stuff like the count of the number of participants in the study. As in, “there were 522 participants in this study (P < .05)."

And yes, in this year of our Lord 2021, at least some reviewers (one's "peers") insist on p-values in your Table I. You just sprinkle p-values over everything, because Science.

There is no way to get one's head around any of this.

Gail – me too, I didn’t understand the givens but I got 100% correct anyway. Either I’m a humble genius, or the quiz is poorly written, or both.

Sorry.. I cheated. I didn’t scan below and get the answers. After the first question, I was able to use my WMBRIGGS model and predict that all of the answers would be false. As I progressed through the questions, my model continued to hold.

Here is my new question to ask people. Why is the job you are highly trained for through education rather than experience worthless?

Do I need an engineering degree to build a house? How much better is a carefully architected school vs a school that is just a General Steel frame with rooms thrown in?

Can a team of MIT engineers outperform a team of Mechanics at the Daytona 500? Can the brains from MIT within the framework of the rules of NASCAR beat the guys mechanics who deal with it day in and day out? Who will win? Will we have to bring statistics to do the analysis? If we do, those engineers have to start scratching their heads about their degrees.

Brad Tittle – engineering school was very, very hard, and students would ask basically the same question, am I really going to use this stuff I’m studying? Some profs would give the standard answer about needing a broad background, others would simply say “you’ve got to pay your dues first”. And to be perfectly honest, there’s more than a little crony capitalism at work, keeping fully capable non-degreed engineers out, warping the supply curve. But the supply-curve is nowhere near as warped as it is for other fields, such as medicine or longshoring; the vast majority of engineering doesn’t require a license, and unions have been unable to make significant inroads (not for lack of trying).

Engineering is a very diverse profession, I can only speak to my little corner of it. My designs use hundreds of parts and require thousands of small decisions. For >99% of the parts in a design, the part used is vastly overrated for the application, a “jellybean” part. It’s picked from a list of readily available parts with an investment of perhaps ten seconds of thought. Picking these parts is 100% experience and 0% training. For that remaining <1% of the parts, I might spend an hour, a day, a month or longer on each one. The value of experience drops to near zero, and value of the training really kicks in. This is why engineers are hired in the first place, to push the boundaries of what's been done before. Fresh out of school, I used to think "wow, I bet what I'm working on has never been done before, that's pretty cool". But after a while, you realize that there are a zillion useful things that have never been done before, it's not like you are curing cancer or anything, and the wonderment diminishes.

Anyway, that's my personal answer to your question, 99% versus 1%. For your Daytona 500 example, the only way the team of MIT engineers would come out on top is if there was some sort of engineering breakthrough that significantly pushes the state of the art, and even then it would take ten years and require the team of mechanics to implement it (although some engineers I know are also amazing mechanics). Unless the MIT team was funded by an eccentric billionaire, this wouldn't be done for a single car, engineering is expensive and the ROI would be pitiful.

Since you also mentioned architecture, I recently stumbled across this amazing (to me, anyway) CAD tool called "Chief Architect". This video gives a flavor:

https://www.homedesignersoftware.com/videos/watch/10071/kitchens-baths.html

Engineers have been using CAD tools for decades, but this architectural design tool takes things a few steps further, allowing a creative person without no formal architectural training to be very productive. I can't help but wonder if this a glimpse into the future of engineering, too. I won't be around to experience it, though.

(A little more searching reveals that there are many architectural programs besides the one I stumbled across, I have no idea how they compare to each other.)

Passed the questions as well. Had heavy statistics education – but a long time ago. I would love to see the experiment of MIT engineers versus mechanics of NASCAR. Experimentation versus theory.