This is part II of contributor William Raynor’s request at defining p-values without using the word probability. This assumes you read and assimilated Part I. First, a clarification from Ryanor:
A context if you want it: Product Development. I (We) have a product that works as is, but we’d like to improve it if possible (like Fisher at Rothamsted.) So the “null” is a real, working product, not some flight of academic fantasy. In my case, it was a real profitable working product that had been continuously optimized for decades. We do not want to mess it up. Product Developers are trying to find improvements, in an intensely repetitive cycle. (Tinker, Test, Repeat.) The test subjects are not “random” samples from anywhere, so the designs are usually blocked, balanced, and blinded before the test products go out the door.
This is a terrific example because the causes, most but not all of them, are known in a manufacturing process. The widget is made from certain materials, put together in known ways, packaged according to set rules, so that that main causes are not a mystery.
The small causes that are responsible for the small widget-to-widget variations are not as well known, or are unknown altogether. Perhaps the weather influences the assembly line in a more-or-less known way, but one that can’t be tracked perfectly.
Measures will be taken on the widget. For the purposes of example, suppose it’s weight. (It doesn’t matter what it is.) The known causes make the widget what it is, are responsible for its nature, its expected weight. The small untracked, or rather unmanipulatable, causes are responsible for the variations in weight. If it weren’t for these small causes, every widget would have identical weights, because of the known major causes.
(For the record, I don’t know what Raynor’s product is or what measure(s) he tracks.)
Anyway, there will be a characteristic weight due to the major causes and small departures due to the small causes. This characteristic weight is easily itself tracked or measured.
One fine day somebody says, “Why don’t we try X?” in an effort to improve the widget. Somebody in charge says okay. It costs something to do X, which may or may not bring out some benefit.
X introduces new causes: if it did not, it would be null. Some of the effects of X can be known in advance, deduced via external evidence. Suppose a new paint will be used, which has known properties. These will change the weight in mostly predictable ways. Still, surprises are possible. Or perhaps the new effects aren’t really well delineated, so experiments are performed. Weight is measured with and without X.
Has X caused changes in the characteristic weight?
This should be easy to tell. Check the characteristic weight before and after X. If these differ, X is responsible. Assuming no other causes intervened. This is not a shifty or weird assumption. You use it constantly in judging how the world works. It wasn’t gremlins that caused your car to start this time apart from all the other times, though it might have been, if you allow for the possibility. We don’t allow for that possibility most times, which is sane.
If the characteristic weight under X is the same as under no-X, then X is no better than an unmeasureable minor cause. If the characteristic weights are different, then X is a major cause. We deduce this on the assumption there were no other causes beside the known major and usual small causes, and X. Of course, there may be the possibility that the assumed mechanism of X is not right, and something else is causing the changes under the “X regime” which is not the assumed X but only something associated with it. That’s not important for us, because either the characteristic weight changed or it didn’t because of X or the X regime.
Now how much change is change enough? There is no answer to that, no general one. That depends on the cost and benefit of the weight changes, which are not statistical questions. The same is true for the measured changes caused by X (or the X regime). How much is a enough is not a question any statistical model can tell you. The answer is: it depends.
What about the empirical p-value? The reasoning is like that in the first part. The “null hypothesis” is that X is not a cause, big or small. If that’s so, then all the measures of the widget are due to the old known causes. So far, so good; no flaws in logic yet.
Second step in empirical values: some aspect of the characteristic weight will be signaled out, like the mean. We can take the mean of the widgets with known causes, and the mean of the widgets under X. There will be some difference in weight (which may even be 0). Memorize this difference.
Third step: many do something like this. They’ll lump all the widget measures together, non-X and X, and then grab out samples from this mixture of the size of non-X and X, compute the means of both of these, and the difference in that mean. This will be done repeatedly, the difference in these means being saved each time. The justification for this sampling is the idea that the “distribution” of actual means inside non-X and X are real things, and the picking mechanism is supposed to make this distribution come alive. Seriously. See the gremlins link above. This sampling make no reference whatsoever to causes per se.
After a while, the observed difference in actual means (which you memorized) is compared to the distribution of differences you got in the sampling. The fraction of differences greater than your own (in absolute value, usually) is the empirical p-value.
The idea is that if this is wee, then the “null” is false, and X has been proved to be a cause. If the p is not wee, then X has been proved to not be a cause.
Talk about doing it the hard way!
Of course, we don’t have to use the mean. We could have used the, say, interquartile range. This will give a different empirical p-value. We could have also used the standard deviation. A different empirical p-value. And so on. None of these are “the” correct measure, unless one of these is the main or sole measure that plays in the cost and benefit.
As in the first part, we use a part of the data that did not happen, i.e. the fraction of differences larger than we observed in that odd sample, to say something about the causes that were actually in play. This is bizarre.
We could have bypassed all of this by just comparing the characteristic weights: any change is assumed due to X (or X regime), a good assumption. The size of the change that’s important depends on the uncertainty we have in the characteristic weight, and in the measured difference between X and non-X. It also depends on what the weight means to the cost and benefit. The forbidden word helps us with the uncertainty. It does not help with the cost and benefit, which are unique to your situation.
To support this site and its wholly independent host using credit card or PayPal (in any amount) click here