A Simple (But Not Short) Stats Question

From reader JH:

Thank you again for taking the time to write your blog. It is exceptionally thought-provoking.

My job involves much of the six sigma “capability” studies and such. I have lots of tools available to “quantify” our measurement data. But I’m wondering now if much of this approach isn’t baloney.

Given my questioning of our corporate orthodoxy, I decided to try a different approach to testing one aspect of a part. I’m curious what embedded fallacies I may have indulged in doing so?

Let’s say I want to demonstrate that the torque required to damage something is > 80. I have a capable measuring system and a representative sample and my uncertainties are (to our knowledge) randomly distributed (or presumed to be, given they are unknown).

The first measurement comes in at 280. Statistical value? Worthless. It’s a single data point.

The second data point comes in at 270. Ok, perhaps slightly less worthless. I could theoretically calculate my mean (275) and standard deviation (5) from a whopping two data points. Indulging corporate orthodoxy, I could say that my goal of > 80 is a whopping 39 standard deviations below my mean, and generate some impressively high probability that all future parts are going to be > 80. From two little points.

Has anything actually been demonstrated? If I shift the “mean” down to the lower-95% T-value (which is ~230.077 for these two points), can I claim ‘there is at least 95% probability that the population mean is > ~230.077’? If so, that still doesn’t let me calculate P{x>80} unless I assume some kind (Gaussian kind?) of distribution.

It feels wrong, but I can’t clearly articulate the source of error beyond assuming a “normal” distribution just because. Who knows what the actual distribution is?

Terrific set of questions. First, there is no actual distribution. Statistical, or rather probability, distributions are purely epistemological, i.e. measures of information, and are not properties of actual things.

Now this torque you are measuring. It actually was, once, 280, and another time 270. You have thus conclusively demonstrated the torque is, or can be, greater than 80. That is, given the measurements and assuming the measurements are without error, the probability the torque can be greater than 80 is 1. You are certain.

Questions like this next one are entirely different: M = “The next measured torque will be greater than 80.” The probability of that given just your two measurements—and nothing else—is, as is obvious, greater than 0. How much greater than 0 cannot be quantified unless you are to make (what are probably) ad hoc assumptions. Or the probability doesn’t have to be quantified, but can be made sharper if you were to add implicit information about these kinds of torques. Something like, “Given my experience of these kinds of machines, 280 and 270 are way above 80”, and then the probability of M with that implicit premise and given the two measurements is “high”.

Again, to say how high is “high” is requires ad hoc assumptions. Saying a normal distribution represents uncertainty in torque measurements is one such assumption. Then you can say the probability of M given this ad hoc assumptions, and given the two measurements, but leaving out the implicit expert knowledge about your experience.

This is all fine because, as is proved in this one-day best seller, all probability is conditional. Probability is not a property of any system, which is why there are no correct distributions to use—unless the probability can be derived from information you know or assume to be true about the process at hand. That kind of information appears to be missing here.

So, yes, Pr(M > 80 | x = 280, 270, and assuming a normal distribution with certain known central and spread parameters 275 and 7.07) = 1 (or, rather, .999 with about 160 or so 9’s). That probability will be less if you assume you don’t know the parameters and they are instead estimated from the data (something like .999999).

These are the correct numbers given these assumptions—and no other assumptions.

Instead of a normal, you could have used your ad hoc freedom to use any of hundreds of other standard distributions, and none of these would be any more correct. That is, they all would have been correct. Conditionally correct. Since the distribution is not derived, or deduced, from known causal principles about the process, that’s the best you can do.

Unless you bring in outside, expert knowledge. We saw above how that works: and it works well. Problem is, the hunger for quantification. Management wants a hard number, not an opinion. It rarely matters were the hard number comes from; that it is hard is what counts. This is why Six Sigma is so beloved. It gives precision where precision is desired. Not that it is giving useful precision, or numbers from which excellent decisions will be made.

The final answer is—drum roll, please—there is no answer. Unless you’re willing to live with expert knowledge and non-quantified probabilities, there is no way to come to numerical probabilities without making ad hoc assumptions.

You can use history, too, as expert knowledge. For instance, you’ve found normal distributions to work well in the past, hence you use them again. This is weakly ad hoc.


  1. DAV

    I want to demonstrate that the torque required to damage something is > 80.

    Seems to me that the test is binary. The value is greater than 80 or it is not.
    P(M > 80 | x = 280, 270, and no evidence to the contrary) = 1

    Allowing for some future value <= 80, the beta distribution is more appropriate than a normal distribution (IMO). Setting a=number measurements 80, initially, a=b=1 but after the two samples, a=1 & b=3.

    P(M > 80 | x = 280, 270, and using the beta distribution) = 0.75

  2. DAV

    Hmmm…. Setting a=number measurements 80, initially, a=b=1

    Should have read: Setting a=number measurements 80, initially, a=b=1

  3. DAV

    Doesn’t like my symbols. Try again:
    Hmmm…. Setting a=number measurements 80, initially, a=b=1

    Should have read: Setting a=number measurements le 80 & b those gt 80, initially, a=b=1

  4. Ye Olde Statistician

    Like every other managerial fad, Six Sigma has been reduced to a mindless algorithm with “steps” to be (mindlessly) performed in the hopes that by imitating the outward behaviors we may reproduce the inner understandings of the original thinkers. This is called the Turing Fallacy. I saw this happen with one fad after another, from Statistical Quality Control to Zero Defects to Six Sigma. Each and every one of these contained Truth, but the hope that by merely repeating the externals one could achieve satori was pure delusion. I know of one major corporation that went into civil war because different departments adopted the strategies of different gurus: We follow Deming’s Methods, said one. No, Taguchi is best, said another. No, Juran is the Man, said a third. Crosby had it all, cried a fourth. They called in my now-late boss to negotiate a truce, and he had them actually study what each man was saying and do a point by point comparison. And Lo! They all said the same thing, only with differences of emphases and differences in packaging. Years later, I was able to show that Motorola’s “Six Steps to Six Sigma” was the same as Ford’s 8-D, Juran’s Two Journeys, Kepner-Tregoe, Ed Schrock, Kaizen, Joiner’s 7-steps, and so on. Also my own “Triads.”

    For one thing, what gets called “capability study” by many Six Sigma practitioners or [worse!] software packages, is nothing like what was meant by Western Electric when they invented the idea: cf. http://www.contesolutions.com/Western_Electric_SQC_Handbook.pdf
    Described at A-1 pp 45-62 with an example on A-7 66-72
    Aside: Note on the copyright page that the handbook was printed at the Mack Printing Company of Easton PA (1958). I used to work summers in the pressroom there, and my father was superintendent. The Western Electric Works were up the river in Allentown, PA. Now all of it, the printing company, Westen Electric, and even the very technology of the printing, are defunct. The old Allentown Works now houses the Lehigh County Welfare office. But I digress.

    So anyone who collects a one-time sample and calculates a Cpk index number should be horsewhipped.

    Or tries to calculate anything from two test results…

    I want at least eight data points, plotted on probability paper to see if they can be faired by a distribution in the first place, how good the fit might be, and how wide the prediction interval. [Or perdition interval, as some put it.]

    Old School CQE

  5. Kip Hansen

    JH’s questions are interesting, but…..

    The best part is the realization that NOTHING useful — as far as prediction of the quality of the future products — can be derived from the results of testing one or two examples.

    This cognitive error — that prediction or generalization can be performed using the results of very small data sets — has produced entire scientific fields whose broad base of foundational facts (research results accepted as true and valid) are in actuality probably false. See much of “social science” , fMRI, social psychology ……

  6. Michael 2

    If the parts came from the same batch even less can be predicted or assumed from it. In the world of computers, when building a RAID disk array one ought not to use disks from the same production batch. Of course I’m not the purchaser so I was completely unsurprised to get a box of disks with nearly sequential serial numbers. In such circumstances Mean Time Between Failure is almost meaningless; the entire lot could fail tomorrow.

  7. Ken

    IF one has good information regarding the material variability of the things being torqued (e.g. dimensional, metallurgical, etc.) one can often calculate a very good range within which a give torque might, or almost certainly will not, cause damage (for the scenario given). At least up to some point (in aerospace, tolerances for all variables can be maintained within very narrow bands, so tolerance stack-ups can be estimated somewhat reliably; for most items, the aggregation of these is such that the diversity itself can cause a variety of compounding other issues…).

    Still, actual testing is 2nd to nothing. I’m kind of old-fashioned that way.

  8. Ye Olde Statistician

    If all the sample parts come from the same batch, then one can make exquisitely precise predictions of the qualities of the remaining parts from that batch. But as to parts from other batches, that depends on the batch-to-batch variation — which has not yet been tested! One time the chemists where I worked ran a qualification test on a new glue formulation that consisted of the following test runs: lab tests, sample of 100 bottle carriers glued by hand, 1-hour on-line run, one-shift test run (8-hours), “steady state” (three shift) run, one-week run, and two-week run. Glue bonds on the bottle carriers were tested each time, to the tune of hundreds of such tests.

    But so little glue was used on each carrier — a little dab will do you — that all this testing did not even use up a single pot of glue. Variation from pot-to-pot or batch-to-batch was left for long term production to discover; which they did.

    Basic Overlooked Principle of Sampling: The results of a sample apply only to the population from which the sample was taken, and can be extended only by means of non-sampling information.

  9. Clay Marley

    I am one of those “Six Sigma Blackbelt” graduates, trained by a [former] large and respected US telecom company. I won’t fault the training that much, but they made two huge mistakes in this program.

    First, in order to graduate, each candidate had to devise a process improvement project that would save the company a minimum of $200,000. So what did we learn? Just how easy it was to design an analysis scheme, collect the data, analyze it with wee P-values, and present it, such that the final answer is what you wanted before you started.

    Second, they pulled out all graduates from their organizations and moved them to a new manager (a Master Blackbelt), and physically removed them from their teams into a separate building. We called it The Citadel. This segregated the Blackbelts from real-world problems and turned them into a useless think tank that generated P-values like wigits on an assembly line.

  10. Milton Hathaway

    I was taught that if you have a stationary process, and collect some number of data points output from that process, you can calculate estimates of things like the mean and standard deviation, but they are always just estimates, and you should also calculate the expected error in those estimates. (At that point, I wondered if one should also calculate the expected error in the calculation of the expected error of the estimates, and so on and on, which would create an infinite series of expected error calculations. If this infinite series converged, I suppose one could calculate it, but to what end, I couldn’t say.)

    In practice, I usually just collect enough data that the statistics calculated from sub-blocks of the data are ‘close enough’ from block-to-block that I trust that I have characterized that aspect of the process adequately for my purposes. Since there are usually dozens of data streams from a process, and no real-world process is stationary anyway, and next week it will be a different process needing attention, and management just wants us to make it work and not bother them with details, I’ll keep doing things this way.

    Anyway, I guess my point is that if you estimate the standard deviation from three data points, the error bounds on that estimate are very large. Common sense, right?

    YOS: Thanks for the pointer to the WECo SQC Handbook – awesome!

  11. Ye Olde Statistician

    Bonnie Small, who led the committee that prepared the SQC Handbook, once received a standing-O at the ASQC convention. The emphasis on practicality was unparalleled. No one was in it to discover fundamental laws of nature, just to get their processes to run smoothly and to understand when it was reasonable to search for a specific cause of variation versus or not.

    Putting all your eminences grises away in a citadel is like Hercules lifting the giant Antares into the air. Once he lost contact with earth, the giant lost his strength. I hired and trained quality engineers for each of our product lines, but I posted them in the buildings where the products were made, where they worked with the production, QC, and others. Black-belt shmack belt. Like they say in The Music Man, ‘Ya gotta know the territory.’

    I always get a big laugh watching the news when they tell us that this or that fluctuation in the stock market was “due to” this or that event in the outside world. They have never heard of assignable causes and common causes.

    Of course, I never had a head for high finance. I had a client one time — I was but a small cog in a much larger, lemming-like machine — charged with creating process maps for the AS-IS system. The client was engaged in buying and selling the mortgage debts of other people. How you make money buying what other people owe always did escape me. Anyhow, they were creating new financial ‘instruments’ to deal with new fed rules that said they could not let their reserve account run down to zero at the end of the day. And once I had mapped these instruments, I decided on my own to run a FMEA on them: what could go wrong at each step in the process. The sound you heard was that of jaws dropping all over the institution. They were awestruck. They had literally never heard of such an analysis before in their lives. Judging by what happened a year later in the mortgage industry, I can believe it.

    It was difficult to pry apart their judgements of likelihood and criticality. We settled on a three-note scale: low, medium, and high probability. Anything more finely tuned than that was a pipe dream — as Dr. Briggs has rightly noted.

  12. Kip, that happens in medicine, too. How many fads have come and gone there? Medicine, at least, does have a reality check, even though lawyers. in the USA, seem to want to override it with their own judgments or opinions.

  13. Bill

    Well, if you want a bound on the probability that the torque will be >80, there is always the sample version of Chebyshev’s inequality. Avoids all the distributional issues

    Of course, as YOS notes, you typically need to consider lot-to-lot effects.

  14. JH

    From reader JH:

    I know this comment won’t be published. However, just for the record, I did not email you the message with insufficient data information for the objective stated and with incorrectly calculated sample standard deviation value.

Leave a Reply

Your email address will not be published. Required fields are marked *