Statistics

Why Principal Component Analysis Ain’t All That

My enemies managed to change the spelling of Principal in the title. They are everywhere.

Principal Component Analysis

There is an extremely popular analytic technique called Principal Component Analysis, or PCA. It is simple to understand, if you recall your linear algebra. I’ll assume you do (if not, click here and study). I’m going to wave my hands about the math, and focus on the philosophy.

A square matrix X can always be written in its eigendecomposition. Eigen is German for “funny math”. Non-square matrices can be similarly treated with singular value decomposition, which for us is philosophically the same thing, though the math is slightly different.

So square X = Q x D x Q^-1, where Q is the matrix of eigenvectors, and D the diagonal matrix of eigenvalues. It’s not important how this is done. The neat property of the eigenvectors is that each is orthogonal to each other. In probability and statistics orthogonal means independent. A bell should be ringing. The eigenvalues are kind of weights, with bigger meaning the corresponding eigenvector “explains” more of X than the others.

In P&S we say “explains the variance of X.” The bell should be louder.

Now in certain areas, like crime or population genetics or whatever, we measure things (little) w. For crime these might be w_1 = murder rate, w_2 = number of assaults, w_3 = number of rapes and so on. For genetics, we’d measure, say, w_1 = expression level of gene (or rather snip) 1, w_2 = expression level of gene 2, and so on. You get it.

These measures are repeated across individuals or places or whatever. Voila. We have a matrix, (big) W, composed of these measures as columns and their repetitions as rows. It will only be luck that W is square. But we can always make it square if instead of W, we take the variance-covariance matrix of W. This is every little w correlated with every other little w, even each with itself.

This leads to either C, which is composed of the variances-covariances, or R, which is just the correlations. Both are (obviously) square matrices.

Then either C or, more usually, R is given its eigenvalue decomposition. And there you are. Except in probability and statistics it’s called the principal components decomposition, or PC analysis. Why not call it eigendecomposition analysis? Don’t know. Maybe whoever named it didn’t like Germans. Skip it.

People then start plotting the PCs, usually just the first against the second, and they get pretty agitated about it, too. This is where it becomes interesting because suddenly the PCs take on a life of their own. Yet what do they mean?

Recall that correlations are linear correlations. Which, after backing everything out, makes the PCs nothing but linear combinations of the original (little) w. (The same is true for PCs in singular value decomposition.)

That’s it. Linear combinations of the w, with weights for each w given by the decomposition. The weights, like eigenvalues, change from PC to PC. Each complete set of w are in each PC, with different weights. The weights are used to say how much variance of W is “explained” by the linear combinations of w.

Point should be obvious: there is nothing in the PCA of W that isn’t already in W. There is nothing hidden that was extracted by some occult formula.

It’s not that you can’t learn from linear combinations of measurements. Of course you can. But you’re not discovering anything that wasn’t always there.

Again, each PC contains, in different weights, each w_j. That may or may not be the best way to use the data to predict some Y, or to explain some phenomenon (there are things like “rotating PCs”, which I’ll ignore).

For one, linear correlation isn’t the be-all end-all of analysis. For another, some of the w you measured might be useless, wrong-headed, or more noise than signal. Some of that uselessness is compensated for by (eventual) low weights, but not all. And who said you need every w every time? And what about non-linear combinations of w? And so on.

Anyway, I don’t want to go on and on about that, because people begin sniping about minutiae and forget the main point: there is nothing in the PCA of W that isn’t already in W.

The Paper

Allow me to introduce Eran Elhaik, who wrote the paper “Principal Component Analyses (PCA)?based findings in population genetic studies are highly biased and must be reevaluated“, in Nature: Scientific Reports.

If you haven’t understood anything I’ve said, read his paper. He explains PCA in great depth, and in a population genetics context. I am only going to give the briefest review. His gist (of which I am in complete agreement): “PCA results may not be reliable, robust, or replicable as the field assumes.”

Here’s his running example, simple and elegant, and worked out mathematically; you can even do it by hand.

“Expression levels” of colors are measured (as SNPs, or “snips”). His w_1, w_2, and w_3 are “R”, “G”, “B”. He has several samples of these expression levels. Below the picture is the figure legend.

Applying PCA to four color populations. (A) An illustration of the PCA procedure (using the singular value decomposition (SVD) approach) applied to a color dataset consisting of four colors (n_All = 1). (B) A 3D plot of the original color dataset with the axes representing the primary colors, each color is represented by three numbers (“SNPs”). After PCA is applied to this dataset, the projections of color samples or populations (in their original color) are plotted along their first two eigenvectors (or principal components [PCs]) with [simulated sample sizes] (C) n_All = 1, (D) n_All = 100, and (E) n_All = 10,000. The latter two results are identical to those of (C). Grey lines and labels mark the Euclidean distances between the color populations calculated across all three PCs.

Now you can see I hid something from you in my description. The repetitions of measurements are always given labels. That’s the real secret to the mysticism of PCAs.

Here, the labels are color names: Red, Green, Blue, and Black. But they could be anything. If these were real SNPs then the row labels could be, say, country of origin of the person providing the sample. Or their race. Or it could be each person’s favorite fruit pie flavor. It could be the amount of change they had in their pockets. It could be anything.

Let me repeat that last sentence: The row labels could be anything.

When PC1 is plotted against PC2, the labels always come along. That’s what you see in Fig. 1 (C) (or the other two, which are the same but for larger sample sizes). The Green value of the pair (PC1, PC2) is separated nicely from the Red value of (PC1, PC2). In real life, the distances between colors is uniform. Some of that information is lost in plotting only two PCs, so the distances aren’t completely uniform. Well, no one expected just two of the PCs to represent all three (there are always as many PCs as w’s).

So far it looks like PCA is giving us good results. Let’s now see how it all goes wrong.

Start with PC1. It’s a linear combination of -0.71 Green and +0.71 Blue and 0 for Red and Black. PC2 is +0.82 Red, -0.41 Green, -0.41 Blue, 0 Black. Here we know not to try and interpret these, because we know, for instance, Green is green all by itself, with no admixture of red or blue or black. But, in ordinary PCA the temptation to interpret is overwhelming.

It is never resisted.

So let’s us not resist, either. Let’s see. PC1 is -0.71 Green and +0.71 Blue. Those are like opposites. And no Red or Black? What’s up with that? Racism? Maybe we call PC1 Blue supremacist-Greenophobic. PC3 is 0.14 for all but Black, which is -0.43. Certainly that is anti-Black! Have a go at interpreting PC2 in the comments.

Okay, laugh or dismiss. But this kind of interpretation always happens. And, do not forget, we could have labeled the rows something else. We knew—this is key—to pick color names because we knew in advance what the results should be.

But we could have picked anything for those row names. From what lab did the color samples arrive? What technician measured them? What machine? What races like which colors best? Maybe more Asians like Red more than Australians, who like Blue. Anything can be chosen—if you don’t already know the right answer.

There’s more. Our author again (with my paragraphifications, and the picture he alludes to below; I beg you to stick with this):

Studying the origin of Black using the primary colors

Three research groups sought to study the origin of Black. A previous study that employed even sample-sized color populations alluded that Black is a mixture of all colors (Fig. 1B–D). A follow-up study with a larger sample size (nRed = nGreen = nBlue = 10) and enriched in Black samples (nBlack = 200) (Fig. 4A) reached the same conclusion.

However, the Black-is-Blue group suspected that the Blue population was mixed. After QC procedures, the Blue sample size was reduced, which decreased the distance between Black and Blue and supported their speculation that Black has a Blue origin (Fig. 4B).

The Black-is-Red group hypothesized that the underrepresentation of Green, compared to its actual population size, masks the Red origin of Black. They comprehensively sampled the Green population and showed that Black is very close to Red (Fig. 4C). Another Black-is-Red group contributed to the debate by genotyping more Red samples.

To reduce the bias from other color populations, they kept the Blue and Green sample sizes even. Their results replicated the previous finding that Black is closer to Red and thereby shares a common origin with it (Fig. 4D).

A new Black-is-Green group challenged those results, arguing that the small sample size and omission of Green samples biased the results. They increased the sample sizes of the populations of the previous study and demonstrated that Black is closer to Green (Fig. 4E).

The Black-is-Blue group challenged these findings on the grounds of the relatively small sample sizes that may have skewed the results and dramatically increased all the sample sizes. However, believing that they are of Purple descent, Blue refused to participate in further studies. Their relatively small cohort was explained by their isolation and small effective population size.

The results of the new sampling scheme confirmed that Black is closer to Blue (Fig. 4F), and the group was praised for the large sample sizes that, no doubt, captured the actual variation in nature better than the former studies.

This is hilarious. If you’ve understood everything so far, you get the joke. If not, start over and try again until you laugh.

If you don’t see it, then just know that it’s easy to get the colors in PC1 and PC2 to line just about anywhere you like. In turn, PCA “confirms” Black is “really” Red, then it is “really” Green, then it is “really” Blue. PCA is a terrific tool for “proving” things.

Elhaik continues the gag by adding in a secondary “admixed” color population. Again, he gets any results he wants, and even easier (see Fig. 8). And then, again, using “primary and multiple mixed colors” he does the trick again. And…but you get the idea.

Well, I LOLed, as the kiddies say.

This is a rich, deep paper, dissecting PCA and population genetics and the plethora of over-statements and over-certainties that plague the field. If you are in this area you must read it. Maybe another day we come back to that.

Part of his conclusion (in which we agree):

PCA did not produce correct and\or consistent results across all the design schemes, whether even-sampling was used or not, and whether for unmixed or admixed populations. We have shown that the distances between the samples are biased and can be easily manipulated to create the illusion of closely or distantly related populations. Whereas the clustering of populations between other populations in the scatter plot has been regarded as “decisive proof” or “very strong evidence” of their admixture, we demonstrated that such patterns are artifacts of the sampling scheme and meaningless for any bio historical purposes.

Even stronger:

Specifically, in analyzing real populations, we showed that PCA could be used to generate contradictory results and lead to absurd conclusions (reductio ad absurdum), that “correct” conclusions cannot be derived without a priori knowledge and that cherry-picking or circular reasoning are always needed to interpret PCA results. This means that the difference between the a posteriori knowledge obtained from PCA and a priori knowledge rests solely on belief.

As we did in our “interpretation” above.

And:

Overall, the notion that PCA can yield biologically or historically meaningful results is a misconception supported by a priori knowledge and post hoc reasoning. PCA “correct” results using some study designs are utterly indistinguishable from incorrect results constructed using other study designs, and neither design could be justified a priori to be the correct one.

And:

Some authors revealed the cards by proposing to use PCA for “exploration” purposes; however, the “exploration” protocol was never scripted, and neither was the method by which a posteriori knowledge can be garnered from this journey to the unknown. “Exploration” is thereby synonymous with cherry-picking specific PCA results deemed similar to those generated by other tools.

Yep.

And:

As a “black box” basking in bioinformatic glory free from any enforceable proper usage rules, PCA misappropriations, demonstrated here for the first time, are nearly impossible to spot.

Amen.

Fan Letter

This paper caused me to write a fan letter to the author. I said, in part:

My heart soared like a hawk when I read your paper on PCAs…

There is some strange magical belief in certain analysis techniques, like “machine learning”, neural nets, and these PCAs. People somehow believe they can extract everything that can be known about cause if only enough correlates are measured, and manipulated in just the right way. You can scarcely talk them out of it.

Which means I think your paper, as terrific as it is, will be ignored, largely. Just like how all the many (huge, even) demonstrations null hypothesis significance testing is absurd are ignored.

You’re attacking a technique that is positively fecund in generating “results”. And that is what science is these days: “results”.

Elhaik replied, giving a depressing, but familiar, anecdote about how the science press works. He was asked for a quote on a hot population genetics PCA paper, “proving” something or other. His cautions given to the reporter were ignored, and the journalist used him as a strawman for the paper’s authors to savage.

He knew his paper would likely be ignored. And he knows of the temptation to produce juicy research (my term), and tools like PCA are too fruitful in this chase to abandon. He said, “Take away PCA and what will you get? Will people actually have to think? What good would that do?”

Indeed.

Buy my new book and learn to argue against the regime: Everything You Believe Is Wrong.

Subscribe or donate to support this site and its wholly independent host using credit card or PayPal click here; Or go to PayPal directly. For Zelle, use my email.

Categories: Statistics

20 replies »

  1. Briggs, thanks for this. This old guy doesn’t quite follow the math (I’ll need to spend more time on that) but sees where it’s going. A couple of years ago I did a PCA analysis of Wuhan flu incidence in PA by counties using state info on population density, number of nursing homes, population age distribution, average income level, etc.
    I used PCA analysis software available for free on the web. It turned out that population density and nursing home number were the two most important variables…which one would have thought to be the case beforehand. A couple of weeks of intense effort to show the obvious.

  2. From the paper:

    …whether PCA grants a posteriori knowledge independent of experience (a priori). Let us also agree that if the answer to any of those questions is negative, PCA is of no use to population geneticists.

    A posteriori knowledge, by definition, is knowledge derived from experience. So, how can it be independent of experience? The answer to the question is therefore negative. Am I missing something here?

  3. Oh my goodness, an average distance changes if you change the sample! Who would have guessed? (A correlation is an average scaled pairwise distance)

    What’s the QC procedure the author slides in there? and the magic SD1, SD2, SD3, … unexplained variables in the main code? Maybe he’s diddling with the distances again?

  4. Ahhhhh. Yes, of course, PCA is utter bollox. Laughably so. Always has been.

    Yet, and here’s the frightening part, PCA is THE statistical analysis method used in nearly every sigh-ance undiscipline today, from ecology to climatology to sociology to psychology.

    Brainless twits who want to be scienterrific are spoonfed crap PCA computer programs they don’t understand and can’t run themselves by commie professors who must publish or perish, and so the PCA brain fever epidemic has spread far and wide.

    The cure? I don’t know. A zippy paper by some clear thinking doctor will not suffice. Stern lectures and even public square canings won’t heal the infected. Gibberish breeds gibberish. There is no hope.

  5. Yep, more evidence that Everything You Believe Is Wrong. The older I get, the more I realize that wading through the BS is the primary task of a person trying to think clearly.

  6. Michael Mann
    Steve McIntyre

    See Andrew Montford’s magnificient “The Hockey Stick Illusion: Climategate and the Corruption of Science.”

  7. Something is amiss in the canned examples, say, in Figure 1.

    Why would anyone reduce the simple standard basis of (1,0,0), (0,1,0), and (0,0,1) to a more complicated basis via PCA as done in Figure 1? There might a reason, I don’t know. What is the purpose of PCA?!

    I also wondered why one would run a PCA on indicator variables (R, G, & B) using the covariance matrix C. First, note the math in here, X
    TX=0, rendering a moot maximization. Second, it’s known that regular calculations of variance and covariance (hence correlation) for those binomial variables are not useful if not nonsense. That is, the matrix B in Figure 1 makes no sense. Since R, G & B are multinomial, one might estimate the covariance matrix using the corresponding sample proportions.

    IMO, it’s not whether PCA is used as a confirmation tool or useful as an exploration tool, but rather whether the method is applied appropriately.

  8. JH,

    Your examples are off. It’s (1,0,0,0) etc. You forgot a 0. It doesn’t start as an orthonormal basis.

  9. JH,

    I see it. And?

    Now see matrix A, which is not an orthonormal basis. If he left off black, then it would be, of course.

  10. Graph B plots the four data points in the R^3 with basis of (1,0,0), (0,1,0), and (0,0,1) in a simple way. Why rotate them by using PCA? Why? Furthermore, the covariance matrix is not meaningful.

  11. JH,

    Also the point (0,0,0). He’s working with the data matrix A. As if it were a data matrix. Of the kind of data people normally take.

    A is not orthonormal. A is known and understandable, and so PCA should represent what we know well. It does not, or not always. That’s his point.

Leave a Reply

Your email address will not be published. Required fields are marked *