The alternate title to today’s post, suggested by reader Kip Hansen, is “Data scientists find connections between birth month and health“. Don’t scoff. We’re talking peer review and wee p-values, so you know the following must be true.
Columbia University scientists have developed a computational method to investigate the relationship between birth month and disease risk. The researchers used this algorithm to examine New York City medical databases and found 55 diseases that correlated with the season of birth. Overall, the study indicated people born in May had the lowest disease risk, and those born in October the highest. The study was published in the Journal of American Medical Informatics Association.
The peer-reviewed paper is “Birth Month Affects Lifetime Disease Risk: A Phenome-Wide Method” by Mary Regina Boland, Zachary Shahn, David Madigan, George Hripcsak, and Nicholas P. Tatonetti. The abstract reads in part:
Our dataset includes 1 749 400 individuals with records at New York-Presbyterian/Columbia University Medical Center born between 1900 and 2000 inclusive. We modeled associations between birth month and 1688 diseases using logistic regression. Significance was tested using a chi-squared test with multiplicity correction.
So, nearly 2 million people of all ages thrown into regression models with diseases as outcomes. Wee p-values “confirmed” the “links”, which is to say, the ritual of classical statistics was used to infer birth month causes certain diseases. Which is to say that astrologers were right all along.
Funny that many of the astrological diseases were cardiovascular.
Looking at all 10 (9 novel) cardiovascular conditions revealed that individuals born in the autumn (September–December) were protected against cardiovascular conditions while those born in the winter (January–March) and spring (April–June) were associated with increased cardiovascular disease risk…
Now because probability models are silent on cause, but something must be causing these curious correlations, the authors of this (and similar) study have to launch into causal explanations.
The relationship between cardiovascular disease and birth month could be mediated through a developmental Vitamin D-related pathway. Serum 25-hydroxyvitamin D levels are lower and parathyroid hormone levels are higher during the winter when no supplementation is given. Even with maternal supplementation, seasonally dependent Vitamin D deficiency has been observed among breastfed infants60 and newborns…
So mothers having babies in off months might—might—lack vitamin D, and that this deficiency is somehow transferred to their enwombed babies, and then said babies are somehow damanged from this lack until they become aged and have time to develop heart disease. Hey. It could be true.
Or it could be statistical nonsense. You pick.
Sums of sums equal some sums
From reader Ken Steele “his interesting video from a scientist/mathematician named Marvin Weinstein concerning how to find patterns/data in a huge dense dataset.”
Video has some interest relating to probability and cause, but I wish Weinstein would move along faster.
Singular value decomposition takes a matrix, which you can think of as the rows and columns of a spreadsheet, and form weighted sums of the columns such that the sums are orthogonal (in the algebraic sense) to the other sums of columns. The number of sums of columns always equals the number of original columns.
That means you can use in probability models the sums instead of the original data. Well, this is old news. What Weinstein is selling is the idea of using something akin to kernel density estimates to find “patterns” among the sums. Idea is to find points in the “hairballs”, i.e. the three-D plots of the sums, that are clustered together.
Points will, of course, cluster together. Something is causing those points to cluster because something caused every original data point. That points cluster is therefore not especially interesting, but it’s nice to have an automated method of picking these points out.
But the automated method will only find clustering points that are susceptible to be found by the automated method. This isn’t quite tautological, because if you used a different automated method, you’d find different clustering points.
Now any statistical method might uncover a causal relationship that you hadn’t previously thought of. But it will only uncover them if they (the causes) are consonant with the method used. The problem is the age old one: some of the “causes” uncovered will be spurious. The more data you cram in and the more “tuneable” the method, the more likely, experience shows, that any “cause” identified will be spurious.
One of the conceits of “big data” is that it can even measure all the correct things that are causative of some observation. This might be so for simple physical phenomena, but not for human behavior. Almost any act we measure across a large number of people will have oodles upon oodles of causes. There’s no hope we can capture everything.