Discrimination Found By Statistics!

By Briggs October 25, 201442 Comments

Was it Justice Ginsberg who popularized the fallacy that statistics could prove discrimination? Somebody check me on that. Busy day here.

The fallacy is that statistical models which have “statistically significant” findings identify causes. Which they sometimes can, in an informal way, but probably don’t. Anyway, that’s for another day. Point now is that discovering “disparities” and “gaps” and “discrimination” via statistics is silly.

Headline from the New York Post: Goldman Sach’s differentiating stats dominate sex suit.

Standard story. Couple of dissatisfied women accused Goldman Sach’s of being boy friendly. “They are suing the financial powerhouse, alleging a pattern of underpaying women and promoting men over them.” The ladies’ lawyer and Sach’s each hired their own statistician. The fallacy is already in place, ready to be called upon.

If there was real discrimination against women because they were women what should happen is that it should be proved. How? Interviews with employees, managers, ex-employees, examination of emails, memos, that sort of thing. Hard work, which, given the nature of human interactions, may ultimately be ambiguous, useless to prove anything.

Statistics certainly can’t prove discrimination, because statistics don’t identify causes. And it’s what causes the alleged discrimination that is the point in question. Since statistics can’t answer that question—which everybody should know—why would anybody ever use it?

Laziness, for one. Who wants to do all that other work? For two, it’s easy to get people to accept “discrimination” happened because math. Lawyers working on commission therefore love statistics.

The Post said, “The bank’s expert, Michael Ward of Welch Consulting, said the pay disparities between men and women are statistically insignificant and said Farber [the ladies’ expert] was overly broad in his analysis.” “Significance” and “insignificance” are model and test dependent, so it’s easy for one expert to say “insignificant” and another “significant.” The data can “prove” both conclusions.

But something else is going on here, I think. Note that according to Farber, “Female vice presidents at Goldman made an average of 24 percent less than their male counterparts”. Ah, means. An easily abused statistic.

There more. Here’s the final two paragraphs (ellipsis original). See if you can spot the probable error. Hint: the mistake, if there is one, appears to be Farber’s. Of course, there might be no error at all. We’re just guessing.

Farber looked at divisions across the bank, rather than at smaller business units, which, according to Ward, muddied the statistical data.

“Breaking it up into these little pieces means you just won’t find these pay gaps,” Farber said. “It’s always a trade-off in this kind of analysis in getting lost in the trees…or saying something about the forest.”

Get it? Take a moment and think before reading further. It’s more fun for you to figure it out than for me to tell you.

This sentence is to fill space so you don’t easily see the answer.

So is this one.

This might be a case of Simpson’s (so-called) paradox. This happens when data looked at in the aggregate, such as mean pay for men and women at the Division level, shows (say) men with higher means, but when the same data is examined at finer levels like business units, it can show women with higher or the same means as men in each unit (or a mix).

The reason this happens is that the percent of men and women isn’t be the same inside each of the finer levels, and the mean pay differ by levels (no surprise). This link shows some easy examples. It’s more common than you’d think.

Farber looked at aggregates and Ward (properly) examined smaller units, a practice which Farber calls “muddying” the data. Well, it’s a strategy. The name-calling, I mean. Judges looking for an excuse appreciate it.

Even though it’s still looking at statistics, and to be discouraged, it’s better to look at the entire pay distributions, not just means, and at even finer levels, say business units and various years of experience in the same job title. But it’s chasing fairies. It can never prove anything.

And even if the data show a difference everywhere it could be that women in each unit are paid less, but not because they are women, but because women in negotiating their “packages” might do so inefficiently compared to men. Who knows?

Statistics is no substitute for hard work.

Discover more from William M. Briggs

Subscribe to get the latest posts sent to your email.

Briggs

Briggs is an internationally reviled thoughtcriminal, listed as One Of The Top 7 Dangerous Minds by the Hague.

View All Posts

42 Comments

John B

October 25, 2014, 9:11 am

How do you feel about using Benford’s (Frank) Law to indicate or even “prove” fraudulent accounting or tax cheats?

I searched your site in case you’d previously covered it, but all I found was Benford’s Law of Controversy. Certainly statistics could be used as an indicator of transgression, but shouldn’t be relied on without doing the leg work.

I might have confused you with a post on NumbersWatch.
Ray

October 25, 2014, 10:36 am

People still believe you can determine causality by statistics.
mike anderson

October 25, 2014, 11:22 am

“Statistics is no substitute for hard work.”

Ah, but it LOOKS like hard work, and we’ve convinced all our B school grads that it IS hard work, and if we’re clever, we get them to hire us and PAY us like it’s hard work. And we aren’t messing about with dead birds or frog entrails when we prognosticate, just scribbling some Greek letters on the board while muttering the latest incantation (“…ensemble estimate, support vector machine, meta-analysis, p-value, ….”). Ain’t technocracy grand?
DAV

October 25, 2014, 11:25 am

Ray,

I submit causality can ONLY be determined through statistics but not with two variables and most certainly not with p-values.
Uncle Mike

October 25, 2014, 2:50 pm

Pay discrimination by gender is confounded by numerous other variables including age, seniority, capability, effectiveness, efficiency, profit-generation, and other utility to the employer, as well as unit size.

So how about a least squares multiple regression with sensitivity analysis? Generate scads of peevees. I recommend Goldman Sachs (aka The Anti-Christ) and/or the Aggrieved Women (aka The Demonic Horde) handsomely hire The Statistician To The Stars, who could crunch the heck out of the data until there is nothing left but ashes.

But seriously, who cares which jackal wins the carcass? And anyway, I’m just guessing the AC can outbid the DH when comes to buying judges
Bob Kurland

October 25, 2014, 2:58 pm

Hey Uncle Mike, you’re giving me a chance to expound on my pet peeve–misuse of the word “gender” when what is actually meant is “sex”. Gender refers, or should refer, to the forms of articles, nouns and adjectives in languages where there is a distinction, “masculine”, “feminine” and (sometimes) “neuter”. Sex refers to whether a human (or animal) is complicated on the outside or on the inside (to paraphrase Mr. Rogers), i.e. whether there are XX or XY chromosome pairs.

Granted that the misuse of gender is common now; nevertheless, should we succumb?
Uncle Mike

October 25, 2014, 4:53 pm

Thanks, Bob, for the edumacashun.

My point, though lost in translation, was that if the PC gnatzies want to have a judicial snit (using statistics), it ought to be over whether the prison cells for the men and women of Goldman Sachs are equally devoid of amenities.
Bob Kurland

October 25, 2014, 5:19 pm

I agree with the point you were making Uncle Mike, and please join the crusade against “gender”.
Lynn Clark

October 25, 2014, 6:04 pm

Regarding the aleged pay gap. A Fox News contributor recently made the observation that if the pay gap is real, why are any men employed anywhere? Wouldn’t employers save a lot of money by employing only women?

Things that make you go, “Hmmmm……”
DAV

October 25, 2014, 6:21 pm

crusade against â€œgenderâ€

Gosh! A “gender” bigot. I’m certainly not. Some of my best “genders” are friends. Anyway, the War on “Gender” could be the next government “effort”. I can just see the ads: “This is your brain. This is your brain on ‘gender’. Just say ‘Hir”.”

My father must have been an earlier crusader. His favorite replacement was s/he/it.
Brandon Gates

October 25, 2014, 6:30 pm

Bob,

… whether there are XX or XY chromosome pairs.

What do you call 47 XXY or 45 X?

DAV,

His favorite replacement was s/he/it.

[chortle]
Bob Kurland

October 25, 2014, 6:46 pm

Brandon, “what do you call XXY or 45X?”
strange!

DAV, I like s/he/it!
How is that transposed to gender identity?
Brandon Gates

October 25, 2014, 8:17 pm

Lynn,

A Fox News contributor recently made the observation that if the pay gap is real, why are any men employed anywhere?

It need not be all or nothing, and indeed it’s probably not possible: the first thing I thought of is that there is more demand for labor than women alone could satisfy. That’s especially true in jobs which require physical strength — some women are as strong or stronger than the average man, but a lot more men are stronger than the average man.

Wouldnâ€™t employers save a lot of money by employing only women?

Again, it needn’t be all or nothing. There is some evidence in employment/unemployment figures suggesting that employers prefer to hire women (and/or lay men off) as a cost-cutting measure in downturns:

https://drive.google.com/file/d/0B1C2T0pQeiaSdm1FcldBZVRYQnM

In absolute terms, there are more men than women in the workforce, but over the past 10 years men also have a higher unemployment rate than women. As overall unemployment rate rose in 2008-09, so did the gap between men and women increase. Of course this could also mean that men are lazier than women.

These graphs use the non-seasonally adjusted data for men and women 16 years and older from here: http://www.bls.gov/webapps/legacy/cpsatab1.htm
Brandon Gates

October 25, 2014, 8:30 pm

Bob,

Perhaps strange in the sense of rare. Passing by on the street, you probably wouldn’t notice. Even if you did look twice, you’d probably call the former a male and the latter a female as a gender assignment. But the Rogerian complicated bits, while nominally functional, don’t lead to procreation when exercised in practice — so both karyotypes are sexually ambiguous. I find it difficult to argue that such occurrences are not “natural”. YMMV.
Brandon Gates

October 25, 2014, 8:35 pm

DAV,

I submit causality can ONLY be determined through statistics but not with two variables and most certainly not with p-values.

Here are some more variables to consider: http://www.forbes.com/sites/tykiisel/2013/03/20/you-are-judged-by-your-appearance/

There are likely wee pee values associated with all 7, so obviously these must also all be spurious correlations.
DAV

October 25, 2014, 8:44 pm

How is that transposed to gender identity?

Got me. I think it’s supposed to convey gender anonymity or ambiguity.

Like most PC rantings, it’s likely a product of that great corporate giant, General Semantics, which claims words control how one thinks. I made a vague reference to it back on the GRUE post. I suppose they might for those who think in words but I don’t so can’t really say. For those, perhaps words are reality. I do note that GS and its founder were institutionalized somewhere near NYC.

—
An alternate word might have been it/s/he but I guess that sounds like something needing to be scratched — like sex. Something to avoid apparently.
DAV

October 25, 2014, 8:46 pm

There are likely wee pee values associated with all 7, so obviously these must also all be spurious correlations.

p-values tell you absolutely nothing about the relationships between the variables.
DAV

October 25, 2014, 8:50 pm

Ironically, the slow Forbes link referred to something at fast.forbes.com. Hung for a long time getting what was wanted there.
Brandon Gates

October 26, 2014, 1:49 am

Briggs,

See if you can spot the probable error. Hint: the mistake, if there is one, appears to be Farberâ€™s. Of course, there might be no error at all. Weâ€™re just guessing.

We went from “probable error” to “might be no error at all … just guessing.” Which is it?

The reason this happens is that the percent of men and women isnâ€™t be the same inside each of the finer levels, and the mean pay differ by levels (no surprise) … Farber looked at aggregates and Ward (properly) examined smaller units, a practice which Farber calls â€œmuddyingâ€ the data.

Except that Farber didn’t look at aggregate data. The issue here is that Ward’s models effectively created multiple regressions over much smaller sample sizes:

http://www.goldmangendercase.com/pdf/20140812-plaintiffs-reply-memo.pdf

Ward analyzes 190 division/business unit/rank groupings. 62% of these groupings have sample sizes of less than 60,â€ the threshold below which regression analysis ceases to reliably produce statistical significance. Farber Reb. Rep., Â¶ 73. â€œAs a result,â€ Dr. Farber summarizes, â€œit is not possible to learn about the presence or absence of systemic discrimination from these analyses. Dr. Ward has constructed them in such a way that if statistical evidence of systemic discrimination exists in these data, it would not be detected by his tests because his tests have so little statistical power.â€ Id., Â¶ 58. The sample size problem predictably renders Dr. Wardâ€™s analysis unreliable. Id., Â¶Â¶ 59-69, 73. â€œThe sizes of the groups Dr. Ward analyzes are important because with such small sample sizes, he would not be able to observe statistical significance even if there were very substantial pay differences between men and women.â€ Id., Â¶ 61. â€œThese data show that it is not possible to learn about the presence or absence of systemic discrimination from Dr. Wardâ€™s analyses.â€ Id.,Â¶ 69. 28 Dr. Ward agrees with this point: â€œ[S]ome of these Business Units are small, and when the number of observations is too small, a regression model may not be mathematically feasible; alternatively, the model may be unduly affected by a few observations.â€ Ward Rep. at 16, n.30.

The takeaway here is that “slice and dice” can result in both false positives and negatives.

Even though itâ€™s still looking at statistics, and to be discouraged, itâ€™s better to look at the entire pay distributions, not just means, and at even finer levels, say business units and various years of experience in the same job title.

Both Ward and Farber looked at individual data, and both controlled for variables that would normally be expected to affect compensation. An oddity of Ward’s models is that he controlled for manager performance evaluations … sort of a no-no since gender discrimination in performance reviews is one of the plaintiffs’ key complaints.

But itâ€™s chasing fairies. It can never prove anything.

One often wonders if you think statistics are good for doing anything.
Brandon Gates

October 26, 2014, 2:30 am

DAV,

p-values tell you absolutely nothing about the relationships between the variables.

On this blog, the smaller the p-value, the harder the researcher worked to build a model supporting the desired conclusion.
Sheri

October 26, 2014, 12:35 pm

Statistics are good for many things, properly used. Just like hammers are good for a lot of things, but when you start using a hammer as a screwdriver, well, there goes the utility of the hammer. (You probably want examples of where the “hammer” should be used, right?)
DAV

October 26, 2014, 2:45 pm

On this blog, the smaller the p-value, the harder the researcher worked to build a model supporting the desired conclusion.

Not just on this blog. For some, it might as well be their job description. Their entire career is devoted to finding small ones.

Low p-values support no conclusions but are often used as springboards for leaping at one. Saying a low p-value supports the conclusion is a lot like saying the car you just built will perform very well because you used the best screws in its construction. Just plain silly.

Regardless, it just might perform well but no one will ever know because it is never driven to find out.
Brandon Gates

October 26, 2014, 6:14 pm

Sheri,

You probably want examples of where the â€œhammerâ€ should be used, right?

Oh dear, you’re on to me!
Brandon Gates

October 26, 2014, 6:25 pm

DAV,

Saying a low p-value supports the conclusion is a lot like saying the car you just built will perform very well because you used the best screws in its construction.

The way I know it, a p-value below the (hopefully predetermined) significance threshold only allows one to reject the null hypothesis.
DAV

October 26, 2014, 7:05 pm

Then you know it incorrectly. The p-value is a statement about a model parameter and not a statement about the model validity. It is not the P(model|parameter, data) but, at best, it is P(parameter|model,data). In the frequentist world, you aren’t even allowed to call it a probability because a frequentist is not supposed to assign probabilities to unobservables. Since it says nothing about P(model|parameter, data) then obviously it cannot tell you anything about the hypothesis behind the model.

There are many posts here concerning this. Go read them.
Brandon Gates

October 26, 2014, 10:11 pm

DAV,

The p-value is a statement about a model parameter and not a statement about the model validity.

Mmm hmm. That’s implicit in my previous statement. Perhaps it would have been better to say that a p-value above the pre-determined significance level only means the null hypothesis should not be rejected.

Since it says nothing about P(model|parameter, data) then obviously it cannot tell you anything about the hypothesis behind the model.

Which was emphasized numerous times in my stats classes and I have never forgotten it. The question in contention here is what a p-value does or does not say about the null hypothesis.

There are many posts here concerning this. Go read them.

I have, several times. This one is a fairly comprehensive list of why to not use p-values, ever: https://www.wmbriggs.com/blog/?p=9338 The relevant quote to this discussion: Few remember its definition, which is this: Given the model used and the test statistic dependent on that model and given the data seen and assuming the null hypothesis (tied to a parameter) is true, the p-value is the probability of seeing a test statistic larger (in absolute value) than the one actually seen if the experiment which generated the data were run an indefinite number of future times and where the milieu of the experiment is precisely the same except where it is â€œrandomlyâ€ different. The p-value says nothing about the experiment at hand, by design.

But then this comment from a different post: https://www.wmbriggs.com/blog/?p=4687#comment-53478 I googled â€œdefinition of p-valueâ€ and the first site that came up was Wikipedia. According to them, â€œthe p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.â€ This is false.

Why is the Wikipedia definition false?
DAV

October 27, 2014, 12:52 am

You are missing what I said above.

Assume the Wikipedia definition is accurate (although it is missing key elements, — reread the Briggs quote more carefully), you are getting the quality of a model component and not the quality of model performance. It’s like measuring the quality of the screws used in building a car then claiming knowledge about the car’s performance from that. It doesn’t work.

Also. it’s specific to the data used to building (i.e. establishing the parameters of) the model. It amounts to determining how good of a fit you got. Goodness of fit tells you nothing about future performance.

Truth is, you can hunt around and nearly always find a model that will fit. Not to mention that if you have enough data the p-value is guaranteed to be low for nearly any model as it is dependent on sample size. Try plotting the chi-squared densities for different sample sizes and look at what happens to it as sample size is increased. Almost every realistically obtainable statistic value has a p-value < 0.05 if N is large enough. For fun, try using sample sizes of 100,000 or more. Most software can't handle that. It's still a pointless measure though.

Assuming the model won't work because it has low quality parameters might not be completely unwarranted but it's still a guess. You really don't know. A car built with lousy screws could still be a great performer. Using P(rain tomorrow) for P(model|parameter, data) would be just as good as P(parameter|model, data) when judging model performance. Neither works.

The ONLY way to get a handle on model validity (and thus the hypothesis) is to test the model against never-before-seen-by-the-model data. It is a never-ending process.
Brandon Gates

October 27, 2014, 4:02 am

DAV,

I recognized off the bat that the Wikipedia definition did not have all of the elements that Briggs included. However, I did not find anything in the Wiki definition that directly conflicted with Briggs, so if anything I’d call the Wiki definition “incomplete” not “false”.

I more than get it that a wee pee value does not mean that the model is valid. Again, the way I learned it, p > Î± only means that that the null hypothesis cannot be rejected at the given level of significance. p is NOT the probability that the null hypothesis is false. 1- p is NOT the probability that the experimental hypothesis is true. p is NOT a measurement of model skill or statistical power of the results.

I think the best argument against p-values is that they’re misunderstood and/or wrongly used as THE one metric which says, “Eureka!”. Which is more a critique on statistical education and/or practice, especially since Fisher himself (attempted to) make clear that p-values were more for researchers to have a measurable way of ruling out non-promising variables/hypotheses, or identifying things warranting further research. That little nuance was something I don’t remember being hammered into me, so this has been a good review for me.

The ONLY way to get a handle on model validity (and thus the hypothesis) is to test the model against never-before-seen-by-the-model data. It is a never-ending process.

Well sure. If you’ve got a large enough number of observations, you can “randomly” divvy it up into multiple training/validation populations. As temptation to cheat will (does) get the best of most researchers, independent reproduction by a different team over different data is preferable.

It’s not always possible to perform multiple, iterative, independent research cycles. This particular court case is one such example. Drug trials are another. The question then becomes, what statistical approach might tend to give better answers than a Fisherian style null hypothesis test?
DAV

October 27, 2014, 5:48 am

Again, the way I learned it, p > Î± only means that that the null hypothesis cannot be rejected at the given level of significance.

If p-value tells you nothing about the validity of the model (which can ONLY be judged by its performance) then how can it tell you anything about the validity of the hypothesis that the model is supposed to encapsulate? P(parameter) ≠ P(model validity) ≠ P(hytpothesis) except in rare exceptional cases where they coincidentally are equal.

Itâ€™s not always possible to perform multiple, iterative, independent research cycles.

In which case, you wasted time creating a model. Why build something if you aren’t going to use it? For fun? Busy work to justify a paycheck? Amaze your friends and colleagues? I don’t see the point.
Sheri

October 27, 2014, 7:59 am

Brandon:
Wiki is not a definitive source of information, considering anyone can edit it and many politically motivated people do. Also, people who just want to be right and hope intellectually lazy people (not saying that’s you, but if you keep using wikiâ€¦.) will just read and believe. Stay away from Wiki. Only bad things come from Wiki.

Much of the time, you keep trying to shortcut the ONLY way to a good model and good scienceâ€”repeat, repeat, repeat. Science can only be verified through repeating experiments and introducing new data to models that accurately predict. Sure, you can short-cut but it changes everything and it’s not science anymore. Years ago, I saw a lot of broken down homes in Southern Arizona where the houses had no mortar between the bricks and were crumbling. Seems a time-honored tradition just to stack the bricks. Mortar was a foreign concept to these individuals. It seems to be becoming the same thing with scienceâ€”just stack up the data, spit out a fast and furious response, throw in a p value and declare victory. Except that’s no more science than the bricks without mortar were walls. You lose science that way. There has to be repetition and predictionâ€”patience and hard work.
Ray

October 27, 2014, 5:16 pm

DAV,
Causality is a deterministic process. How can you determine if a process is deterministic using statistics?
DAV

October 27, 2014, 8:54 pm

Ray.

That could take a very long answer. The short version: All of the evidence for causality requires statistics, in particular, correlation and independence of variables. Suggest Judea pearl’s book Causality: Models, Reasoning and Inference for the long version.
Brandon Gates

October 27, 2014, 9:04 pm

DAV,

If p-value tells you nothing about the validity of the model (which can ONLY be judged by its performance) then how can it tell you anything about the validity of the hypothesis that the model is supposed to encapsulate?

I keep writing that p-values don’t say anything about the research hypothesis. Ever.

Why build something if you arenâ€™t going to use it?

Who said anything about not using the model? In the context of this lawsuit, we have two competing statistical models. The question for the jury, if this thing ever reaches trial, is which one is the more credible.
Brandon Gates

October 27, 2014, 9:05 pm

Sheri,

Wiki is not a definitive source of information, considering anyone can edit it and many politically motivated people do.

Anyone can publish a blog too. We would know, yes? I vet Wikipedia the same way I do anything I read.

Much of the time, you keep trying to shortcut the ONLY way to a good model and good scienceâ€”repeat, repeat, repeat.

The only time I can remember saying something along the lines of “Itâ€™s not always possible to perform multiple, iterative, independent research cycles” is “we can only work with the data at hand”. Then, as now, those statements were contextual, as in not to be taken as generally applicable. Operating within the constraints of reality is not a shortcut, it’s a necessity.
DAV

October 27, 2014, 9:38 pm

I keep writing that p-values donâ€™t say anything about the research hypothesis. Ever.

Yet you keep claiming you can use them to reject an hypothesis. How can this be if they don’t say anything about the hypothesis?

Who said anything about not using the model? In the context of this lawsuit, we have two competing statistical models.

If p-values don’t say anything about the validity of the models how can they indicate which is more credible? Are they preferred over entrails because they are cleaner?
—
Really! Look at your contradictions. You are aren’t making any sense.
—
what statistical approach might tend to give better answers than a Fisherian style null hypothesis test?

There is none other than examining the results of repeated model runs against new data. If you can’t do that then you are being mislead and/or being misleading by using p-values. Sheri was right: there are no short cuts.

A jury is definitely not the proper venue for this and the lawyers in particular don’t really care which is better. For examples see the Vioxx transcripts for what lawyers do with the data. They only need to convince a jury using any means they think will work. The truth just gets in the way. Mathematical voodoo magic using p-values fits their goals quite well and looks ever so sciencey compared to smoke and mirrors.
Brandon Gates

October 28, 2014, 1:16 am

DAV,

Yet you keep claiming you can use them to reject an hypothesis.

Not just any hypothesis, the null hypothesis. Here is an interesting perspective:

http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf

… for Fisher, null hypothesis testing was the most primitive type in a hierarchy of statistical analyses and should be used only for problems about which we have very little knowledge or none at all (Gigerenzer et al., 1989, chap. 3). Statistics offers a toolbox of methods, not just a single hammer. In many (if not most) cases, descriptive statistics and exploratory data analysis are all one needs. As we will see soon, the null ritual originated neither from Fisher nor from any other renowned statistician and does not exist in statistics proper. It was instead fabricated in the minds of statistical textbook writers in psychology and education.

The six-item true/false quiz at the top of pg. 3 is interesting. If you take it, I predict you’ll get a perfect score.

If p-values donâ€™t say anything about the validity of the models how can they indicate which is more credible?

I asked an open-ended question pursuant to the quotation cited above: What statistical approach might tend to give better answers than a Fisherian style null hypothesis test? Your answer is: There is none other than examining the results of repeated model runs against new data. That wasn’t the answer I was hoping for, but it is an answer.

Look at your contradictions. You are arenâ€™t making any sense.

A potentially confounding factor here is that I am distinguishing between null and experimental hypotheses and you are apparently not. So, in order of appearance, here is specifically what I’ve said about p-values and the null hypothesis:

1. The way I know it, a p-value below the (hopefully predetermined) significance threshold only allows one to reject the null hypothesis.

2. Perhaps it would have been better to say that a p-value above the pre-determined significance level only means the null hypothesis should not be rejected.

3. Again, the way I learned it, p > Î± only means that that the null hypothesis cannot be rejected at the given level of significance. p is NOT the probability that the null hypothesis is false. 1- p is NOT the probability that the experimental hypothesis is true. p is NOT a measurement of model skill or statistical power of the results.

4. I think the best argument against p-values is that theyâ€™re misunderstood and/or wrongly used as THE one metric which says, â€œEureka!â€. Which is more a critique on statistical education and/or practice, especially since Fisher himself (attempted to) make clear that p-values were more for researchers to have a measurable way of ruling out non-promising variables/hypotheses, or identifying things warranting further research.

Note that (2) is a deprecation of (1), i.e., I caught myself having written something that might be incorrect. (3) is a restatement of (2), with an elaboration of what p-values DO NOT say about the null hypothesis, experimental hypothesis, and statistical model. (4) is a statement of my understanding of how p-values should be used by researchers who calculate them.

I echo Willis Eisenbach’s standard request: If you disagree with someone, please quote the exact words you disagree with. This allows everyone to understand what you think is incorrect.

Sheri was right: there are no short cuts.

Her comment was misdirected; I’m not advocating for shortcuts.

A jury is definitely not the proper venue for this and the lawyers in particular donâ€™t really care which is better.

Ok fine. If you were on a jury, I take it that no output of any statistical model would influence your decisions on the case. What do you think constitutes valid evidence in court cases such as this?

For examples see the Vioxx transcripts for what lawyers do with the data.

Compelling, but anecdotal. Do you believe the Vioxx case is representative? More to the point; why or why not?
DAV

October 28, 2014, 2:11 am

A potentially confounding factor here is that I am distinguishing between null and experimental hypotheses and you are apparently not.

Because the distinction exists only in the world of hypothesis testing. Think about it. In that world Y was caused by X or something else where something else is the null. You can’t determine this using just X and Y. It takes a minimum of three variables in the simplest cases and often more and certainly not by getting some statistic on the unobservable components of a model. The basis behind hypothesis testing this way is based on a fallacy.

If you disagree with someone, please quote the exact words you disagree with.

I have no intention of quoting everything you have said. That’s your shtick.

Iâ€™m not advocating for shortcuts.

Yes you are. You want a magic number that will answer a very tough question that can only be answered by hard work over time.
Sheri

October 28, 2014, 7:51 am

Brandon: The question is: How gullible is the jury and how persuasive are the lawyers. The jury has no expertise in statistics and will chose based on their lack of knowledge and love of a specific lawyer, not actual math. (Barring, of course, a miracle where one juror actually can do math and the other 11 will go along.) DAV makes a point on this also–read how people sue over drugs and how uninformed those outcomes are. The jury sees a sick person and mean Big Pharma with too much money and it’s not “fair”. One suspects cases of discrimination are decided much the same wayâ€”who looks like a victim and who looks like the rich, mean company with all males at the top.

If you are “vetting” Wiki, how about just using the information you found to support it??? Yes, anyone can write a blog and people who do write blogs spend a great deal of time supporting what they write with links and answering questions. I must have missed that “ask questions” part on Wiki.

Actually, if I had the time, I’d go back and find all the instances where you have told me that it is not possible to do repeat after repeat of an experiment, that it costs too much, etc. Luckily, I haven’t the time and I really don’t care that much. If you doubt me, go back over your comments on this and on my blog and find them yourself. You are definitely on the side of short-cutting when it suits your fancy (most often on the subject of global warming).
Brandon Gates

October 28, 2014, 9:55 am

DAV,

It takes a minimum of three variables in the simplest cases and often more and certainly not by getting some statistic on the unobservable components of a model.

Farber’s report isn’t available without registering for PACER, so I haven’t read it. From the other filings on that docket, I have been able to gather that he and Ward both looked at multiple Xs. Aside: I’ll be darned if I can find where I said that one X gets it.

The basis behind hypothesis testing this way is based on a fallacy.

At what point does any method of statistical inference become NOT based on a fallacy? See again also the paper I cited last post and note that I pretty much agree with all of it.

I have no intention of quoting everything you have said.

You also apparently have no intention of reading what I actually write. To wit:

You want a magic number that will answer a very tough question that can only be answered by hard work over time.

Nice fantasy. Now how about you address the questions you ducked from last post: Do you believe the Vioxx case is representative? More to the point; why or why not?

While I’m at it, here are a few others: Which non-fallacious hypothesis test did you use? How does your model do against never before seen data? How much hard work have you put in on it?
Brandon Gates

October 28, 2014, 9:58 am

Sheri,

The jury has no expertise in statistics and will chose based on their lack of knowledge and love of a specific lawyer, not actual math.

I’m not inclined to disagree, but this being a blog read by potential jurors I think an actual discussion might be instructive. But hey, if you and DAV just want to put words in my mouth and argue against strawmen on principle it’s no sweat off my brow.

DAV makes a point on this alsoâ€“read how people sue over drugs and how uninformed those outcomes are.

That’s awfully broad and subjective. But I’ll return the favor: if you want some flimsy stats, look at the clinical drug trials themselves. My opinion is this — what the pharmaceutical industry most cares about is how many people their latest designer molecule is going to kill or disable, and what their exposure is. I have little faith in efficacy trials … basically I look to see that the drug probably won’t give me an immediate coronary, and if it works as advertized without side-effects worse than the ailment when I take it, great. If not, oh well.

One suspects cases of discrimination are decided much the same wayâ€”who looks like a victim and who looks like the rich, mean company with all males at the top.

That’s a great get out of jury duty card right there.

If you are â€œvettingâ€ Wiki, how about just using the information you found to support it???

In this case, I was asking DAV what the conflict between Briggs’ and Wiki’s definition was … which is one technique in my vetting process. I did do my own research prior to asking, which I’m happy to share:

http://www.perfendo.org/docs/BayesProbability/twelvePvaluemisconceptions.pdf

The definition of the P-value is as followsâ€”in words: The probability of the observed result, plus more extreme results, if the null hypothesis were true; in algebraic notation: Prob(X>=x|Ho), where â€œXâ€ is a random variable corresponding to some way of summarizing data (such as a mean or proportion), and â€œxâ€is the observed value of that summary in the current data.

http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf

One only needs to understand that a p-value is the probability of the data (or more extreme data), given that the H0 is true.

http://mathworld.wolfram.com/P-Value.html

The probability that a variate would assume a value greater than or equal to the observed value strictly by chance: P(z>=z(observed)).

I found lots of wrong ones too.

I must have missed that â€œask questionsâ€ part on Wiki.

It’s called the talk page. Every article’s got one.

You are definitely on the side of short-cutting when it suits your fancy (most often on the subject of global warming).

You’re definitely on the side of doing a very poor job reading my mind when it suits your fancy. I do have you at somewhat a disadvantage … I also know what I was thinking when I wrote those comments. I’m real sorry the “Brandon is lazy and doesn’t care about truth” argument isn’t going so well for you so far, but if you work hard enough maybe, just maybe, you’ll “prove” it some day.
Sheri

October 28, 2014, 10:31 am

Brandon: Excuse me–no one is putting words in your mouth. You said “The question for the jury, if this thing ever reaches trial, is which one is the more credible.” Maybe try remembering what you wrote and stop invoking a straw man claim instead of answering what we actually wrote. If you can prove that juries actually have statistical expertise and are qualified to just the mathematics of the argument, you should have presented the data.

No argument on the clinical drug trials–they are seriously flawed, but you’re bailing off the cliff again in an apparent effort to avoid what is being discussed. Plus, I seriously hope you’re not arguing two wrongs make a right–bad drug trials make bad jury awards legitimate?

I don’t need a “get out of jury duty” card–rational people who make decisions based on facts are routinely excused from jury duty.

There’s a talk page on Wiki? I don’t know whether to be impressed or horrified. I guess that demonstrates my complete contempt for Wiki is indeed complete. I had no idea–not that I care, of course. I’m still not using Wiki to justify or define. “Anyone can write here” references are worthless. I do appreciate the additional references from sources that could be written by a complete fraud but probably aren’t (assuming there’s the possibility of actually calculating probably that a web writer is legit).

If you are too lazy to look up your own comments, I see no reason why I would bother looking. You’re becoming obnoxious and I’m finished here. I don’t waste my time on people who will not actually pay attention to what they write. I can guarantee you that you have said it is too expensive to repeat experiments and you have done so more than once. If your mind and your typing are not matching, that is a personal problem of yours and one I cannot deal with. I go by what you type.

The End.
Brandon Gates

October 28, 2014, 1:27 pm

Sheri,

If you can prove that juries actually have statistical expertise and are qualified to just the mathematics of the argument, you should have presented the data.

WT[bleep], I’m not attempting to “prove” that juries have statistical expertise. Did you not read where I wrote, “I’m not inclined to disagre …”?!!? Try reading what I write instead of telling me to remember what I wrote.

No argument on the clinical drug trialsâ€“they are seriously flawed, but youâ€™re bailing off the cliff again in an apparent effort to avoid what is being discussed.

Ok … what is it that I’m supposed to be discussing? I wasn’t the one who brought up the pharmaceutical industry … I simply ran with it and offered my opinion that the larger problem is the vast amount of drugs on the market with dubious benefit, some of which might be more harm than help. Ergo, litigation against the pharmaceutical industry — which according to you result in uninformed outcomes. Not a whiff of evidence to support your statement of opinion masquerading as fact, I might add. How do you know this???? The irony here is absolutely killing me. Do they make a pill for that? I need, like, ten of ’em, stat.

Plus, I seriously hope youâ€™re not arguing two wrongs make a rightâ€“bad drug trials make bad jury awards legitimate?

What is this … trot out every negative stereotype in the book and hope it sticks day? Guess again … by the law of very large numbers you’re bound to get one correct eventually.

I donâ€™t need a â€œget out of jury dutyâ€ cardâ€“rational people who make decisions based on facts are routinely excused from jury duty.

At my most cynical, I often lament that the people most qualified for jury duty are the ones best equipped to get out of it.

Thereâ€™s a talk page on Wiki?

Every article.

I guess that demonstrates my complete contempt for Wiki is indeed complete.

Interesting thing to loathe. I’m curious … was there a particular article that set you off? Negative perception of it on principle?

Iâ€™m still not using Wiki to justify or define.

I normally don’t either for the very reason that it has credibility issues, and in this thread I certainly wasn’t. For the life of me I don’t understand why you’re hammering on this point.

I do appreciate the additional references from sources that could be written by a complete fraud but probably arenâ€™t (assuming thereâ€™s the possibility of actually calculating probably that a web writer is legit).

Ahhh, well see, there isn’t. That goes for anything that has ever been published by any method, especially when it was first written. Some things age better than others.

If you are too lazy to look up your own comments, I see no reason why I would bother looking.

LOL. You apparently have not learned how this works: if you make a claim, it is your burden to provide evidence of it. Too lazy to remember what I wrote. I’m DYING here. HALP!!

Cary D Cotterman on The Intellectual Capacity Of Women by David StoveJune 25, 2026
Interesting, but he could have said everything he had to say in half as many words.
Brian (bulaoren) on The Intellectual Capacity Of Women by David StoveJune 25, 2026
Whatever women's intellectual deficiency, they make up for it with raw animal cunning.
Dan MacDougald on The Intellectual Capacity Of Women by David StoveJune 25, 2026
Paul Johnson, the great British Historian, in his "History of the American People," the US started down the road to…
Marbles! – William M. Briggs on Marbles!June 24, 2026
[…] I am on the road and am way behind on posts—I have a 3,500 word article on deriving Oughts…
Brian (bulaoren) on Marbles!June 24, 2026
For the past 48 hours, I've been considering a fundamental, structural, problem with marbles; " Marmorism". Some people will accumulate…

Discrimination Found By Statistics!

Related

Discover more from William M. Briggs

42 Comments

Leave a Reply

Share this:

Related

Discover more from William M. Briggs

42 Comments

Leave a Reply