# Statistical Challenge: Can You Rate How Good This Study Was?

I will lose many of you today. But for those up for an intellectual challenge, this post is for you. This will not be easy.

Let’s play a game. I’m going to give you the outline of a study, a real one. I want you to say whether it’s a good, bad, or in-between study. I want you to judge if the conclusion the researchers made can be justified, or if you see flaws they might have over-looked.

I’m not going to tell you what the study is about. The reason is that if I did give the frame some of you would be prejudiced one way or the other. From my experience with you, my friends, I’d say you will be split about fifty-fifty on this topic. I don’t want pre-formed judgments playing a role here.

We have to do this on the honor system. It’s too easy to search and discover what this study really was. Don’t do that. Or, if you do, keep silent in the comments. I want you to have an honest go at this, and see if you have learned the lessons I have been trying to teach these many years.

To make it as fair as possible, and to cut down cheating, I’m going to paraphrase everything, and I’ll describe, but not show, the relevant graphs. Reverse image searches are too good and you’d find it in a hurry. If I change any numbers, all these changes will be scrupulous and won’t change our conclusions, as you’ll discover next week when I reveal the answer.

You have one week.

The study is medical and looked at a certain kind of measured harm between two groups of kids, labeled U and N, and some alleged bad things in the blood. They also tracked the number of times the kids had a certain procedure.

The main question was this: is there less B in kids in U or N, and could smaller amounts of B be caused by bad things in the blood, moderated by the procedure.

The two groups were pretty well matched. The ages, weights, sizes, parents’ ages and all that kind of thing was as close as you’re going to get between two groups. Again, the only named ostensible difference between the groups was the kids getting U or N.

U is unnatural, and N is natural. N is something that happens naturally, whereas the kids (and their parents) had to go out of their way to avoid N and get U, which was artificial and manufactured. The strange thing is the conclusion: the natural kids (N) had purportedly worse outcomes!

They also say that there were more bad things in the blood leading to lower levels of B in the natural group (N).

Here are the steps the researchers followed.

1. They looked only at healthy kids from both U and N. Any kid that had any kind of malady was excluded. The kids were reportedly raised normally in an advanced Western country.

2. The outcome was a certain measure of blood; call it B. B was a change from one time point to another. B itself is seen as a good thing, with high values better than low. But again, whatever values the kids had, no kid had any malady. This was a study only to measure the potential of a malady, which no kid reportedly had (but see below).

3. They recruited 19 Us and 79 Ns. Then they measured the blood for B and other things on all of these kids. And they also measured blood on their mothers.

4. The researchers noticed something odd in 35 of the kids. We don’t know the mix, U and N, just the total number. They re-measured the blood on these 35 kids, and not the other 63, because something appeared odd in their blood, but they don’t say much about what was odd. Only that it wasn’t B. Another 14 kids had an increased value in another blood measure that was not B. Again, we don’t know how many were U and how many N, or if the 14 were part of the 35. So they did fresh blood draws on these kids.

5. All of the blood measurements used standard protocols and procedures.

6. After a government report was issued on U and N and the alleged bad things in the blood, the researchers went in and gathered more blood.

7. Next was the stats. They did a regression model on the changes of B and the timing of a certain procedure. All kids had the procedure at least once; some had it more than once. The procedure, call it P, was not related to U or N. All the analysis the researchers presented, with one exception, was based on the regressed values and not the raw data. All downstream analyses used these regressed values, and not the raw data.

8. Because there were other measures in the blood besides B, that might influence or be influenced by B, i.e. those alleged bad things, they included these measures in the model with the procedure timing, using what is known as a stepwise regression to find the “best” model (with respect to a statistical criterion we can ignore). The group U and N were, of course, also in the model.

9. To check this model, they used a smoother regression, cut the data into differently sized chunks, and the original model may or may not have been adjusted because of how B looked in the chunks. They are light on details here.

10. Now in 41 of the kids, there was observed some sneezing or coughing. But we don’t know how many were U or N. They then looked at a common marker for inflammation and discovered 22 kids had high levels. Again, we don’t know how many were U or N. Nor do we know the levels of alleged bad things in any of these kids.

11. Data comparing measurements of the blood between U and N were then presented. Not of B, but of other things in the blood. Things that are considered bad. The U (artificial) group had lower mean values of bad things for almost all bad things, and the N (natural) had higher. The ranges of the N group were wider, but this can be accounted for by the much larger sample size.

12. One school of thought said that the bad things should have been higher in the artificial group, but another said it should have been higher in the natural. This is the fifty-fifty split we discussed.

13. A curious plot was offered based on the model. Since the U group could not do what the N group did (since U is artificial), the researchers showed counterfactual predictions as if the U group had been natural by bad things in the blood. Again, there were more bad things in the blood in the N kids. We cannot logically see the raw data here, since the U group did not do N.

14. We next see tables and plots of B and the bad things in the blood, the main point of the research. We no longer see the groups U and N, just the modeled levels of B or changed in B plotted or averaged against the bad things.

15. The models now have a step in them. That is, they are linear in the bad things up to a point, at which point the slope is allowed to change. The change point was chosen by another model. The scatterplot of logged modeled B against the bad things. The scatter is very wide, and the model fits are never that good. But there is a hint that higher values of the bad thing are associated, weakly, with lower levels of modeled B. And again we do not see the raw data and do not know which point belongs to U and which to N, nor do we see the number of procedures.

Here is one of the plots, modified to make cheating less easy. Please do not search for this.

The numbers have all been substituted and I erased identifying information from the plot. The y-axis is modeled logged values of B (the numbers are only labels), and the x-axis is level of bad thing. This is not the raw data. The red line is the second model; the gray bar the confidence interval on the model, and not the predictive interval. The other plots are very similar to this one.

16. The authors throw in some correlations with this and that which we can ignore. Most correlations are low (below an absolute value of 0.1), with a very few a bit higher (an absolute value of about 0.25).

17. From all this the authors conclude that there is a higher risk of malady from lower levels of B caused by the bad thing, regardless of number of procedures. And that being in group U is better than being in group N.

It turns out you might be able to regulate a few of the alleged bad things, though it would be costly. Would you? And that you could encourage U over N. Would you?

Do you think the conclusion that U is better than N is justified? Is the bad thing really bad? If you knew what the bad thing was would you avoid it? We’ll ignore the procedure here.

How confident are you the results are accurate, assuming no cheating?

Yes, there’s a lot you don’t know. I’ll answer questions, but I won’t reveal too much until next week. If anybody goes through with this.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank.

1. Tillman Eddy

You have finally stumped me! I have no background nor education in statistics – I do not understand linear regression!

So, based on my limited knowledge, I opine that the number of independent variables alone casts serious doubt on the study.

I give the hackneyed example of Columbus:

“When he departed, he did not know where he was going…

When he arrived at his destination, he did not know where he was…

And on his return, he did not know where he had been.”

‘Thus ends the lesson.’

2. Nym Coy

In the end we still don’t know if high or low B is good or bad because it’s never measured, right? No kids were ill and if they were, they were excluded. So what were we even trying to figure out?

3. Jeez! If the data dredging wasn’t bad enough, the slapdash measurement protocols are completely Daffy Duck. I’d purely love to see the (predictably wide) prediction intervals for those Tinker Toy regression models, or even just a few adjusted R-square values (bet they’re small, no?). I say shenanigans on an observational study that had to be beaten into submission for a journal credit.

4. Robin

This isn’t just bad, it’s insanely bad.

First comment is that there are so many confounding and conditional influences that nothing can be concluded.

Second comment is that the data appears to have been massaged (understated) to fit models created with inherent bias. They appear to be searching it in a way that supports a pre-concieved conclusion, ie with prejudice.

Third, when statistical plots must use log scales, that’s a red flag that they are trying to hide or misrepresent the data.

Fourth, there appears to be a stratification Simpson’s Paradox effect emerging in the data.

Fifth, I would argue that the group sizes are so disparate that comparisons become invalid.

Sixth, I’m speculating that Fisher’s 1932 P value analysis was used to draw conclusions between good group bad group.

Seventh, no statement can be made on causation other than it is unknown.

Eighth, how can they draw any conclusions about any individual in order to say whether a single individual should be better off in one group than another. There is no predictive relationship between the group and the individuals within that group, apart from all the criticisms above.

5. Robin

Three more things: 1) there is no clear scientific basis of good and bad, 2) the plot is weird. The line should be within the 95% confidence area, but it diverges (escapes) these bounds. How can this happen? And 3) Step (inflection?) in the model? LOL.

6. Steve B

That’s quack science. I don’t care what real data you put in there. It sounds like it comes from a political agenda by a .1% fringe group. Funny too… a good read.

7. 1 – I generally agree with Robin, above.

2.1 No to regulation and no to encouraging U over N – insufficient evidence for either/both.

2.2 No; using models populated by data from models is grant baiting, not stats; and no, we do not know if B is really bad. (& maybe – insufficient info for an opinion)

2.3 I cannot assume good faith because they do not, at least in your summary, report fully on the confounding factors found – e.g. those 35 odd “results.” If I were to pretend to accept this, I’d still reject their conclusions because the conclusions are two (3?) models removed from the data and, in any case, the sample sizes are too small to be compelling (especially given the ratio of U/N).

8. Chris

Briggs,

Clarification on step #6 – was the data from the second batch of blood draws lumped in with the first, or was it treated differently in any way?

9. cdquarles

Horrible study. The main flaws have been noted by prior posters. No, do not trust anything other than “I don’t know”, and definitely oppose forcing any kind of treatment, based on the tiny numbers of kids studied (how representative are they, really?).

10. Cary D Cotterman

Starting with the wildly uneven sample sizes, it all looks sloppy, biased, and manipulated. I don’t have to be an Expert for my skepticism radar to be set off.

11. Briggs

All,

Tilman: Your instincts are in the right direction, though.

Nym: Whether B was lower because of U and N, and those bad things. That the kids weren’t ill is key.

Mike: The don’t give prediction intervals, as you’ll see.

Robin: You’re on the right track. Any mistakes in the figure are mine, though. I butchered it.

Paul: You too are on the right track.

Chris: I think, though it is not clear from the reading, it was kept separate. But other times it appears it was lumped in.

CD, Cary: Something like that.

12. Charles West

The study smells like rubbish. 1. Too many models. 2. Conclusions based on models that are smoothed. 3. Models are segmented based on other models.

Give me a break.

13. One would hope the researchers made a politically expedient conclusion, because on merits alone the study has no value. A negative value perhaps, considering it wasted everyone’s time.

1. Proxy X.
2. Proxy X with an ambiguous, obscure or otherwise controversial relationship to what it is meant to represent.
3. Data mulched through regressions, models and witchery until it no longer resembles anything that took place in reality. “We fixed the data!”
4. Model built from the data used to correct the data (???)
5. The final analysis forgets that it was supposed to be comparing two groups, must have got lost somewhere along the way. Mistakes happen I guess.
6. Confounding variables not measured between groups. I can tell this was probably one impetus for wanting to show us this study.
7. Sample size is lazily small. An average elementary school clinic would have a larger sample size in flu season. This is compounded by the fact that it’s taken for granted, by this tiny sample, that our critical X is unquestionably higher in one group than the other (we don’t even get a short story about how and why? merely an assertion?)
8. No control group? Not-U-not-N would have been helpful.
9. For all the rituals imposed on the data to transmogrify it into something viable for publishing, the scatter plot has no obvious trend. A random Formula 1 track cut in half and pasted on the chart would have a better r-squared.
10. Having sufficiently reified our better-than-life model and our magic lines, we can now rest assured that there’s an increased risk “associated” with the measure “associated” with the condition, a measure “associated” with one group and not the other. Because of the law of equivocal convertibility, we now “know” that there’s an increased risk with being in group N.

If I can say a good word about the heroic researchers behind this study, it’d be to say that they were paying attention in their undergrad statistics class. F for effort.

14. “It turns out you might be able to regulate a few of the alleged bad things, though it would be costly. Would you? And that you could encourage U over N. Would you?” What’s the challenge here, of course the answer is yes and yes. That’s the whole point of studies by scientismists, to produce justification for more government regulation and spending. This is clearly an excellent study that produced the correct results with bonus points for the scientismists because the government action required will be costly. Whoever they are these are scientismists at the top of their game.

15. Hagfish Bagpipe

Since Briggs is a pathologist of stupid studies this is obviously a stupid study: the “bad things” in the blood are actually good, there is a patent poison being pushed by big pharma as an alleged cure, the procedure involves chopping off your willie and putting a potato up your patootie, and the researchers are hand-rubbing white-coated trans-vampires using the ruse of a stupid study as an excuse to draw blood from children. All funded by fake usury ponzi bucks issued by Satan. Amirite?

In my experience, it all comes down to the research question. If you get it wrong, it doesn’t matter what you do afterward. If you translate the question into methodology wrongly, it doesn’t matter what happens afterward.

So the question. It’s actually a compond question made up of two independent questions. The first question is simple: “is there less B in kids in U or N”. I haven’t seen an answer to this in this retelling of the study. This is supposed to be an open-and-shut case: just measure B in U and N, make two curves (cut up all measured values of B into brackets and state how many kids, of U and N separately, are in which bracket) and be done with it. I’m fighting a very strong urge to make a p-test. Anyway, I don’t see the answer to this. If anybody does, please point me to it.

Although, I suppose they could use the other meaning of the word “moderated”, which is that it’s a necessary transmission mechanism to a process. I struggle to understand how to interpret the question with that meaning of the word. Perhaps they say “does procedure enable bad things to cause smaller amounts of B?” I think to answer that question you need to get down into medical biology and study molecular and other mechanisms. I doubt an epidemiological study can help you answer that question. You might try to get some handle on the truth by answering the question in the previous paragraph of my answer.

So I, overall, conclude the study is bogus. It’s especially jarring they imagined what would B be in certain circumstances.

I’m looking forward to the big reveal, both of the study and of the correct answer. 🙂

17. Oh man, can I take a stab in the dark what is the study about? Aaahh! Such urge! xD

18. Johnno

Briggs, you Fool! What is all this complicated… science… things…??? I don’t know what any of that means! I don’t know what’s good or bad! Just give me an Expurt, and have them dictate to me what to think about this or that study! Then we can get back to the business of simpler things that we are intellectually capable of discussing – who best to elect to run the world!

19. Chris

Briggs,

A little more clarification, please, if possible within the bounds of necessary obfuscation. First, can you clarify the stated purpose of the “procedure” in light of the main question? Were they monitoring the changes caused by the procedure because of presumed harm/benefit, or were they controlling for a confounding factor?

Also, can you clarify the timing of said procedure? All participants in both U and N had P, and since they’re looking at the change of B with respect to P, presumably the chronology was 1) kids got U or N, 2) researchers took a blood sample, 3) all kids got P, 4) researchers took a second blood draw. Is this correct?

20. Pk

Is the problem statement something like:

Is the P(malady|U, B, bad things in blood,P) less than the P(malady|B, bad things in blood, P)? Given the authors’ assumptions, sure maybe so. However, since no one had a malady, how can we know? It is kind of like speculating on the probability of a coin flip on your coin flip machine without flipping the coin.

Of course all the talk about the mean or bad-things-in-blood differing between U and notU adds lots of doubt too.

Having run many studies and correlations, I have found that once you get to four independent variables the reliability is nil. One or two can often work well.

Pk

21. Andrew

Pre-filtering the data sets to include only healthy children might be a serious design flaw if there is any reason to assume that U or N would bias towards unhealthy children

Example: if I select from men vs women who can run 100m in less than 13 seconds, the female group will contain athletes with a much higher investment in training

22. James Daniel

“19 Us and 79 Ns”

That’s not data. It’s an anecdote.

“Anecdata”, maybe?

23. C-Marie

“An excuse to draw blood from children” …. from Hagfish Bagpipe comment ….. that seems to be the point and reason for doing the “study” ….. to use their blood for what purposes? ….. else why such a study? ….. especially if associated in any way with pharmaceutical pursuits?

God bless, C-Marie

24. Scotty T

This plot is just amazing! Before we even get to the appropriateness of a log y-axis, is level of the bad thing predicting B, or vice versa?

What should the relationship between the lines drawn over the shotgun scatter-plot be? If you back up a little you notice there is essentially no relationship. Because it is so strongly suggestedby the lines superimposed, maybe I’ll concede there is a slightly lower log-B associated with modelled points at the far right of the x-axis. But, look how few points there are at more than 5. It feels sparse to me, and my feelings are as valid as the poor methodologies here! Then the core blob of points (0-6 x-axis and 0.5-3.2 y-axis inclusive) it is nearly a random dataset that no one would see any signal in.

All that to state, the plot seems like the old magic trick, where no matter what data/pseudo-data is underneath — people’s eyes will be drawn to a line drawn over the top. They will forget reality and be brought by slight of hand into a house of mirrors. This is a powerful bait-and-switch. Conversely, questioning a plot can be a strong tool in the anti-pathological science / anti-cargo cult science toolbag.

25. Milton Hathaway

Hmm, I smell a rat. This study is too perfect, and by that I mean too perfectly bad. The flaws others have already pointed out tick all the boxes of our extensive training in these blog pages: a tiny sample size, mysterious unvalidated models reified to replace the meager measured data and create new metamuciled data as needed for convincing constipated data-torturing algorithms into excreting a publishable result, using regressions and smoothing and regressions of smoothers and smoothing of regressors, changing protocols mid-study (after noticing “something odd” in the data, and again in response to some sort of “government report”) which implies zero blinding, a proliferation of proxies, use of charts to mislead rather than illuminate, and the ever-popular “correlation must be causation because we can’t think of any other explanation”. Oddly, I don’t have a problem with the log axis in that plot, at least in principle.

So, no sir, I ain’t touching this stinker, that cheese looks poisioned, the ground under that low-hanging fruit looks sketchy, I can just make out the outline of a pitfall trap obscuring a bed of dangerous statistics sharpened to razor points poised to impale me on my sizable robust ignorance.

26. Tillman Eddy

“ I can just make out the outline of a pitfall trap obscuring a bed of dangerous statistics sharpened to razor points poised to impale me on my sizable robust ignorance.”
^^^^^^
Love it?!!

Tillman

27. McChuck

Let’s play a game. I’m going to give you the outline of a study, a real one. I want you to say whether it’s a good, bad, or in-between study. I want you to judge if the conclusion the researchers made can be justified, or if you see flaws they might have over-looked.

Not having read beyond this point, I can say with confidence that the study is deeply, deeply flawed. Especially if it was conducted any time in the previous 50 years of blatantly falsified “research”.

28. McChuck

The purpose of the test was to determine if some factor was predictive of future health. Since the test was not continued long enough, and was not tracked closely enough to obtain such future (from a given perspective) health information, the research is ultimately pointless. It did not track health, nor did it track any particular or even general group of maladies. Some of the children “had the sniffles” at some unspecified point, but this is not exactly uncommon in children, nor was it apparently tracked well (or possibly at all) in the data.

The research was also flawed in that it purported to track blood levels of things over time, but took multiple blood tests at different intervals from some but not others, without apparently tracking that crucial variable.

29. JH

My Dear Mr. Briggs,

This, in a way, is like asking people to rate Gretchen Whitmer’s achievements by watching Fox News for 15 minutes.

Some of the steps are incoherent. So, what am I rating? Your summaries or the researcher’s work?

I have to admit how to pass judgment based on this post is beyond me.

—-

I don’t want pre-formed judgments playing a role here.

This is funny. Because you didn’t hesitate to tell us about your judgments in this post.

Just for example, strange…purportedly. What are the bad things?

The strange thing is the conclusion: the natural kids (N) had purportedly worse outcomes!

They also say that there were more bad things in the blood leading to lower levels of B in the natural group (N).

30. JH

P.S. I wasn’t talking about local Fox New channels.

31. Darin Johnson

I’m going to think about this more, but at first blush it seems like they’re asking a little bit of data to support an awful lot of assumptions. Not just B and the bad thing, but P and the timing of P and whether there’s a change in B and then how much of a change plus they had a bunch of “controls,” which *should* require more data still.

I’m not sure what to make of the re-testing and extra-sampling. For instance, if they found that half those sample were pregnant they might figure there was something wrong with their procedures and re-test. On the other hand, if they found that half their samples had a rise in B and they resampled because of that, well, then they probably crossed some kind of line.

Whether the study is any good depends in the end on what it’s for. If you review their work and conclude it’s anything better than, “Total garbage,” the next question is a cost/benefit one: does applying the model make us better on net? Not possible to answer than in the abstract.

32. Robin

Again on the plot; insufficient information at the extremity. Seems that the step is artificial and not supported by the data. Further, the regression on B will contain error. We do no get to see this. Then there is error in the modeled values against the level of a bad thing. Too much error and too much uncertainty.

“They did a regression model on the changes of B and the timing of a certain procedure. All kids had the procedure at least once; some had it more than once. The procedure, call it P, was not related to U or N. All the analysis the researchers presented, with one exception, was based on the regressed values and not the raw data. All downstream analyses used these regressed values, and not the raw data.”

It P is not related to U or N, then how can the researchers conclude: “…being in group U is better than being in group N”?

This should have never made it through the peer-review process. The reviewers should be sacked and maybe the Editor as well.

Reminds me of the study published by Oxford on the results of trial of the Oxford/AstraZeneca vax; I believe around Nov 2020 in the Lancet, from memory, with about 40 or 50 co-authors. Another study that never should have made it through peer review. Remember that one? The vax that was going to save the planet? The vax that has since vanished from the face of the earth? We don’t hear squat about it anymore.

The lasting legacy of that now non-existent vax is VITT, or “Vaccine-Induced Thrombotic yada yada”. But VITT was extremely rare, you say. So rare that NICE had to develop a national protocol; so rare that a national conference was held on it. This pathology did not exist before the Oxford/AstraZeneca vaccine.

We are in the age of experts. But this is only temporal. Next will be the age of AI and experts won’t be needed anymore. We are being prepped for this next step.

Why is Elon Musk involved in CureVac? Anyone though about this? Could it be his AI systems? We are headed to a future where all new vaccines will be mRNA, designed by AI within hours, and tested by AI models within days. There will be no more human or animal trials. Just AI. That’s why Musk is involved.

33. cdquarles

“Don’t make vast conclusions from half-vast data”, yes, that was an admonition drilled into young skulls full of mush back in the old days. Today, it seems, not so much.

34. Simon Platt

I would never encourage U over N. U is never better than N. Of that I am certain, as a matter of principle.

35. Simon Platt

You’re quite mistaken, by the way. It’s not at all easy to search and discover what this study really was.