Over-Certainty Of Polygenic Scores

This is a revamped and updated classic post with new material. It first appeared on 6 February 2020. It’s time to revisit this topic, as we’ll see below.

Polygenic scores (PS), sometimes called polygenic risk scores (PRS), are not difficult to understand. Statistically, anyway.

Let’s start with this picture from The Atlantic article “An Enormous Study of the Genes Related to Staying in School“. Pictures like this in the literature are hard to come by. Most researchers don’t enjoy showing their data in this raw, and criticizable, form. Usually we only see the results of models or of (see below) certain kinds of p-value plots. If you’ve seen more plots like this, let us know in the comments.

This “plots years of schooling against subjects’ polygenic scores”, but “controlled” for Sex and other things. “Control” means dumped into a regression and is not the English word control. It will be almost impossible, but at first try and ignore the blue line, which does not exist, and which is not the data. We don’t need to know yet what polygenic scores are except to understand they are measures taken on each person.

The polygenic scores were normalized, meaning subtracting the means and dividing by the standard deviation of the measures, which accounts for the x-axis note “in standard deviations”. This is just a simple transform and not of interest. The polygenic score is used to predict years of education (sort of; that “control” screws things around, but we’ll ignore it).

Let’s learn how to read this graph. Pick a PS of -2. Years of education (modified) for that level run about 4-5 to 18-19. Now pick a PS of +2. Years of education run from about 7 to about the same top, maybe 20. In other words, given a PS of -2 you’d predict, conditional on this data, there’s a 99% (or whatever) chance of years of education in the interval 4-19. Given +2 you’d predict the same chance for 7-20 years or so. A slightly, and only slightly, higher chance for greater years of education by moving from -2 to 2, or 4 standard deviations. Not enough of a difference to make a difference for most decisions.

What about PS scores of -3 versus +3? At +3 there are far fewer dots, and they run from about 11 to 20 years. There could be few dots because the sample size is small, or because not that many people really have PS scores this high. Similarly for the -3. There’s no way to tell the story of the missing dots until we take this model and try it to make actual predictions in real life. Just like any statistical model.

Look at the blue line now: it’s a regression with PS and years and many other things. Regular readers know better than to rely on R^2, which exaggerates evidence. That “11% of the variation” is statistical arcana which produces over-certainty. It is a measure of model fit and exaggerates predictive accuracy. In any case, you can I hope plainly see this is not a wonderful model. It might—just possibly might—have some utility at very high and very low PSs. But since predictive error always swells at the extremes where the sample is small, the uncertainty in the prediction would be large.

I mean predictive and not parametric uncertainty here! It would be an obvious error to use that line as a substitute for the prediction without its plus-or-minus stated in real terms. Or to say the plus-or-minus is in the line and not the prediction. Or to say the line is real: that is the Deadly Sin of Reification. Or to use the plus-of-minus of the parameter of the regression and not of the prediction of the observable (the most common generator of massive over-certainty).

This is not an unusual outcome, this model, for polygenic scores against some external measure. Some are better, many are worse. If we were only to go by the predictive performance of models like this, which are underwhelming to say the least, there wouldn’t be much interest in polygenic scores. But there is huge interest.

The reason is that polygenic scores are statistical encapsulations of genetic measures, and most people think genes cause things things like “IQ” or years of education. Or, more tangibly, they cause things like body height or heart disease. And so on. If this is so, if genes are direct causes, then it is thought that polygenic scores express the amount of “cause” certain genes have on an outcome of interest, like score on an “intelligence test”.

Regular readers will not be surprised to learn that I doubt all this, and while I think polygenic scores have some use, the evidence related to them is, as with many statistical measures, hyped and over-certain. I do not say wrong: I say over-certain. The reasons for being skeptical will come later. For now, let’s look at what these polygenic scores are. I’ll skip all niceties, caveats and cautions and give only the rough statistical outline.

Snip-snip

A single-nucleotide polymorphism (SNP) is a change at some specific position on the genome in which at least a certain fraction of the population have a different nucleotide than the others. Just one guy with a change out of 7 billion isn’t enough: they say. Most people have, for example, A in a certain position on the genome, but a bunch instead have C. There are other considerations about SNP types that aren’t of direct interest. We’re also ignoring measurement error. The SNP is in the end just a measure: a yes/no per person, a count/fraction per sample.

Enter the genome-wide association study, or GWAS, which look at SNP variations between people. These go one of two ways. The variation in SNPs in one group with, say, a certain disease is compared against a control group without the disease. Or an enormous number of people are sampled, and the variation in SNPs is statistically related to some outcome, like scores on an “intelligence test” or height or far looser things like years of education.

In the simplest disease-control case, a single SNP is used to produce an odds ratio (direct probabilities work, too). If more people with the disease have A in the SNP than C than people in the control, then this SNP gives some evidence of “association” with the disease. Some call (at least implicitly) this association a cause. But if it were a cause: then every person with A would have the disease (unless something exterior blocks it) and every person with C would not, unless there were other causes of the disease beside this gene.

This kind of analysis is no different whatsoever between noticing, say, more people in the disease group ate more bananas than people in the control group. Same statistics, same vague notions of cause.

Distressingly, the SNPs said to be “important” are identified by wee Ps. Wee p-values, that is, with all the over-certainty and mistakes typical of them. Why not use predictive probabilities instead? Why not indeed? P-values cannot discern cause and they certainly generate massive over-certainty, and this is true even with genetic measures.

Now, instead of looking at SNPs one by one, they can be combined in a regression-like fashion to produce a polygenic score or polygenic risk score, in the following way.

We first start with an outcome Y, such as disease presence, score on “intelligence test”, or height. Anything that can be measured on people is a candidate Y. A weight relating each SNP from the GWAS to the outcome Y is then produced via some kind of regression.

The weights are then combined:

$$S = \sum_i (X_i * \beta_i),$$

where $\beta$_i is the weight and X_i the presence of the marker genotype SNP. The S is usually normalized, as above. Now there are too many Betas usually and not enough data, so the regression is often some form of LASSO or Ridge regression. These are nice because they set many of the Betas to 0 and smooths the others. All that is of technical interest. All we need remember is that S is a statistical measure of the state of comparative biology of a person. Blood pressure is such a measure, too, so there is nothing strange or wrong about biological measures.

An excellent graphic of this is process at this site, about which more in a moment. In the end, we have for each individual an S and a Y. The S are used to predict the Y. Above, the S was given and Y years of education (sort of).

There are, of course, lots of SNPs in human DNA. Too many to use all at once, even with LASSO; usually only some are analyzed. Which to choose?

The strategy is to find the fewest X_i that give the best association, via some measure, with S and Y. Many use R^2—and not predictive skill. Again we have over-certainty. Anyway, after these SNPs are gathered, people stare at the X_i and say “Oho! Gene X_72F24BG has a high Beta and is therefore responsible for partly causing IQ!” Meaning ability, or score, on a certain test.

Everybody recognizes more than one gene is “associated” with or causes complex things like test scores or body height. But, except for a handful of people, everybody also believes these associations are causes (take Charles Murray, who says it outright). Well, they have to be, right? We’ve heard forever of selfish genes and heritability and evolutionary psychology with selection pressure (a cause) on genes. It is said genes are the causes of phenotypes. And brains are computers and we’re nothing but machines designed to promulgate our genes. On all that, see Limitations Of Biological Determinism: Ideas In Our Reenchantment & Rectification.

Can You See Me Up Here?

That site with the clever polygenic-score infographic has the article “New Turmoil Over Predicting the Effects of Genes“.

A key breakthrough was the recent development of genome-wide association studies (GWAS, commonly pronounced “gee-wahs”). The genetics of simple traits can often be deduced from pedigrees, and people have been using that approach for millennia to selectively breed vegetables that taste better and cows that produce more milk. But many traits are not the result of a handful of genes that have clear, strong effects; rather, they are the product of tens of thousands of weaker genetic signals, often found in noncoding DNA. When it comes to those kinds of features — the ones that scientists are most interested in, from height, to blood pressure, to predispositions for schizophrenia — a problem arises. Although environmental factors can be controlled in agricultural settings so as not to confound the search for genetic influences, it’s not so straightforward to extricate the two in humans.

Note that “the product of” is causal language.

What had also emerged from that research as an “obvious, beguiling offshoot,” according to Nick Barton, an evolutionary biologist at the Institute of Science and Technology Austria, was a specific prediction known as a “polygenic score.” Beyond the associations themselves, GWAS could provide estimates of how individual variants in the genome corresponded to measurable changes in a trait; polygenic scores constituted the sum of all those tiny effects. For instance, with height, having a guanine base instead of a cytosine one in a particular DNA region might correlate with being 0.1 millimeter taller than average. The polygenic score would take all those approximations, add them up and spit out a prediction for some individual’s actual height.

This was done to “explain” (cause again) the differences in heights between northern and southern Europeans. Or so everybody thought. Recall that p-values and other traditional statistical measures not based on prediction produce a lot of false signals.

Then came larger databases and recalculations and the signal for height differences disappeared.

“The new studies are really quite disconcerting,” Barton said, because they demonstrated that scientists had been mistaking biases in the polygenic score calculations for something biologically interesting. Their statistical methods of accounting for population structure were not so adequate after all…

Barton agreed. “The whole thing is tricky, because the origins of genetic variation in any population are really complicated,” he said. “Now you really can’t take at face value any of these methods over the last four or five years that use polygenic scores.”

“Maybe the Dutch just drink more milk, and this is why they’re taller,” Sunyaev added. “We can’t say otherwise with this analysis.”

The paper is here: “Reduced signal for polygenic adaptation of height in UK Biobank“. Nice title. Not for the first, and surely not the last, I’m reminded about early work is parapsychology. Early results showed big effects and had everybody juiced, but the closer people looked, the more the results faded into the distance.

What about predictive skill?

Given that some experts want to roll out polygenic scores in the clinic, it’s already clear that this flaw could deepen the disparity in health care. In a study published last month, researchers found that trying to translate insights gleaned from European data to make health predictions in people of African descent led to as much as a 4.5-fold drop in accuracy. Others have tried using polygenic scores to make poorly supported claims about differences in behavioral and social traits between populations (such as IQ and education attainment, which are far more difficult to define and unpack than height is, yet are being used to potentially inform future policymaking decisions). “It’s kind of scary,” said Sarah Tishkoff, a geneticist at the University of Pennsylvania who emphasized how critical it is to collect more underrepresented genomic information.

And cause?

“The methods developed so far really think about genetics and environment as separate and orthogonal, as independent factors. When in truth, they’re not independent. The environment has had a strong impact on the genetics, and it probably interacts with the genetics,” said Gil McVean, a statistical geneticist at the University of Oxford. “We don’t really do a good job of … understanding [that] interaction.” [ellipsis original]

Here is a quote from the Abstract of the Reduced-signal paper, noting a phenomenon common in published statistical findings. That is, wonderful results are claimed, but when tested independently, the signal fades away. This is because it is too damned easy to create models, but shockingly few ever think to test them, especially if their results have been blessed with the magic of a wee P. And it is magic. If the threshold is met, cause is claimed.

ere, we describe a new analysis based on the the UK Biobank (UKB), a large, independent dataset. We find that the signals of selection using UKB effect estimates are strongly attenuated or absent. We also provide evidence that previous analyses were confounded by population stratification. Therefore, the conclusion of strong polygenic adaptation now lacks support. Moreover, these discrepancies highlight (1) that methods for correcting for population stratification in GWAS may not always be sufficient for polygenic trait analyses, and (2) that claims of differences in polygenic scores between populations should be treated with caution until these issues are better understood.

We’re not nearly done. We have to look at that interaction and at such things as “IQ.” Height can at least be unambiguously measured (or near enough). We’ve seen it’s not so easy with something as difficult as intelligence.

Which Genes Do You Pick?

Suppose you are convinced that more years of education are good. (Don’t laugh.) You and the missus are thinking of conceiving (good pun!) a child. You head down to the nearest Design-Ur-Baby (same parking lot as Costco) where both you and your good lady surrender some gametes, which took the magic out of mating, but some things are more important, like utility maximization. Your gametes are thrust into the Gene-o-Matic 3000™.

The biologist trainee (who used to work at Costco) takes your request for a baby that will maximize its years of education. He feeds the request into the machine. Which genes, or SNPs, does it pick? Which does it cull? Hundreds and hundreds and even more hundreds of genes go into these scores. And did I mention the ones picked to go into PSs are chosen by wee Ps? Like this:

A transform of p-values is on the y-axis, where wee becomes WEE (large). P-values! The worst measure ever invented, and cause of more scientific grief than any other. The measure that says, via logical fallacy and only accidentally correctly, “My correlation is causation”, because that is all PSs are. Correlation, snapshots in time, where who knows what gene is producing what proteins, in a body which has a long and complex history. Then there’s another problem.

It seems none of your paired gametes have SNPs which give a polygenic score greater than 2 (from the figure above). You can get a 2.1, but that’s for a girl baby, and you want a boy, which is max 1.7. Turns out selecting boys versus girls is easy. Picking a high PS for years of education is not. Hold up. If you are willing to go low on the PS for height, you can tweak up education PS to nearly 2. Do you want a short well educated son?

Nine months and a large number of discarded zygotes later, you decide to adopt. Turns out none of the critters meeting your criteria were viable, and all crapped out. Maybe the best, according to your utilitarian criteria, would have been conceived by very slow or damaged sperm. And there, and this is no joke, scientists designing crutches, as it were, for these damaged sperm, the one men have an excess of.

Or—you do get a kid, and it’s a boy, because that’s easy, but he turns out average height, not as pretty as you would wish, two left feet (metaphorically, but beware), and, many years later, gains a useless Communications “degree” from the local state college. You head down to Costco to demand a refund, but you discover Design-Ur-Baby went out of business ten years ago.

It ought to be obvious by now that PSs are over-certain, and that finding genes which maximize any given score is anything but straightforward, though it is also beyond clear that many amazing claims will be made. There will be great hype, a degree of glee and hersteria meeting this hype in equal measure, much debate, but not much progress.

One reason, which we discussed at length before, is that, with only small exceptions, there aren’t many single genes exceptionally for particular traits and for none other. Sex is an exception with chromosomes, and so are the odd diseases which are gene-singular. Genes don’t exist as separate things in you anyway: you are not a machine and therefore not built like one. Genes and you and everything else are you. There are tremendous redundancies and feedbacks built into the whole genetic apparatus, which is to say, into you. Knock out one gene and others step up. This is not to say, or even suggest, genes are not important, because of course they are. But teasing out causality is most difficult, and control is not always going to be possible.

This isn’t the place to rehash all those arguments: go to the link for that. The takeaway point is that things are far more complex than generally believed. Over-certainty, as in AI, abounds. It would, however, be far worse if there was no uncertainty, and we were able to pick just those traits we thought we wanted. One minute’s surfing the internet ought to convince you of the abominations that await us if we really could design babies to our fickle and bizarre liking. “I’d like a dozen fat furry deaf non-white trans lesbians, please.”

Passing by the morality of ruthlessly killing all the unpopular, non-wee-P proto-babies, a true utilitarian act, there is anyway a superior and well tested version of eugenics to get the kind of babies you want that you can practice at home. For free! And a lot more enjoyable. Find yourself a mate that is as hot, healthy, intelligent, sane and compatible as you can manage. Eat well and have kids young. Don’t wait until you’re pushing forty, ladies.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.

Discover more from William M. Briggs

Subscribe to get the latest posts sent to your email.

1 Comment

hudbwu

June 11, 2025, 7:22 pm

I stopped midway to write this comment.

What exactly is the difference between polygenic scores and two-layer neural networks? Other than the difference in training, specifically that polygenic scores aren’t trained and neural networks are.

_Jim on Why British & EU Rulers Are Juicing War With RussiaJanuary 4, 2026
@John Pate re: "it is gaslighting to pretend the current state of the anglophone countries is anything other than deliberate…
Hagfish Bagpipe on Happy New Year!January 2, 2026
Perusing your list of popular linked articles, I am once again impressed at the quantity and quality of your output.…
Hagfish Bagpipe on Happy New Year!January 2, 2026
Briggs: "... then I pray to your confusion and conversion, and hope that your hangover teaches you a lesson." Ha-ha!…
hudbwu on Happy New Year!January 2, 2026
Happy New Year!
NLR on Happy New Year!January 2, 2026
Happy New Year

Share this:

Related

Discover more from William M. Briggs

1 Comment

Leave a Reply