Statistics

Randomized Trials Are Not Needed

I promised this article a long time ago: I hope these well known calculations are enough to give the gist.

Randomized controlled trials are supposed to be the gold standard of experimental research. This is true. But it would be just as true if we said “controlled trials” without the “randomized.”

“Randomness” is not needed. “Random” trials are even the opposite of what we desire. Which is information and evidence. Since “randomness” means “unknown”, adding randomness decreases information.

There is just one reason, which I’ll reveal later, why “randomness” is important.

Suppose you’re designing a marketing, finance, or drug trial: anything where you have “treatments” to compare. Since everybody understands drugs trials, we’ll use that.

A controlled study splits a group of people in two; it gives one half drug A, the other half B. The results are measured, and if more people improve using A, then A is said to be better. How much better depends on what kind of probability model is used to measure the distance.

How to make the groups? If you need 100 folks, take the first 50 and give them A, the second 50 gets B. That sound suspicious to you?

Ordinarily, the groups would be split using “randomness”, usually computerized coin flips. “Randomness” is supposed to roughly equally distribute characteristics that might affect the drug. This being so, any difference in A and B is due to the drugs themselves and not to these unmeasured characteristics.

This view is false.

One measurable characteristic is sex. If men and women appear equally and uniformly (yet unpredictably) then the chance one group is filled with all men is 2-n, where n is the number of people in your study. If n = 50, the chance is about 10-16: that’s a decimal point followed by a lot of zeros. If n = 100, the chance is about 10-30. Pretty low.

All men in one group is an extreme imbalance. But if men were, say, 70% of one group, this would still be cause for concern. The chance of at least this sized discrepancy for n = 50 is 0.001; and for n = 100 is 0.000016. Quite a difference! While a discrepancy is still unlikely, it no longer seems impossible.

Of course, men and women don’t always show up equally and uniformly. Any departure from these ideals only makes the chance for a discrepancy between the groups higher. How much higher depends on what is meant by “unequally” and “non-uniformly.” However, the effect will always be substantial.

Very well. If we just measure sex, then the chance of a discrepancy is low. But now consider fat and skinny (say, divided by some preset BMI). We now have two groups to balance with our coin flips. The chance that we have at least one discrepancy (in sex or weight) doubles.

For n = 50 it is now 0.0026; and for n = 100 it is 0.000032.

OK, we measure sex and weight; add race (white or non-white). The chance of at least one discrepancy is higher still; about three to four times higher than with just one characteristic.

How many measurable characteristics are there? Can we think of more? Height, blood pressure, various blood levels, ejection fraction, and on and on and on. At least three to four dozen—call it 50-100—characteristics might be important and are routinely measured in medical trials.

You can see where we’re going. Every person is, by definition, different from every other person. There are enormous differences between people at every level, from physiological to cellular to genetic.

Eventually, of course, it becomes all but certain that there will be a discrepancy between the two groups. Whether you measure that characteristic is irrelevant. It—actually manywill be there. And recall: not all these characteristics will present equally and uniformly; they’ll be all over the place. Discrepancies, imbalances between groups, are always there.

“Randomly” splitting groups, therefore, does not, and cannot, “balance” groups.

But control can.

Control means taking those characteristics we think are important, and splitting the groups such that each receives an equal proportion. There will always be, even when we control, discrepancies between the groups we did not control, and one or some of them might be responsible for the results. But if we choose our groups carefully, then the chance of this is small.

While “randomness” cannot live up to its promise, there is one good reason to use it in medical trials. The human animal cannot be trusted. He will cheat, steal, trick, and lie, even to himself—especially to himself.

You can’t trust a man to pick who receives what treatment for fear he will game the system, maybe even unawares. You have to remove the decision to a disinterested authority. Like a non-human computer.

Or a statistician.

Categories: Statistics

23 replies »

  1. Briggs,

    Please comment on a study where each person (or dog or rat) is his own control. The variable of interest is measured for (say) six weeks with no treatment then for another six weeks with treatment. Assume that the variable of interest can be measured objectively (serum potassium for example) and the people drawing blood and running the machines are “blind.”

    This would be one way to make sure that the treated and untreated groups are identical.

  2. Speed,

    Sure, that’s fine. But then you aren’t claiming randomization.

    Something similar, “crossover” trials do make that claim.

    There’s various ways to construct these; for example, in one we again have two groups, A and B. A gets the treatment, B nothing. Wait some time, then A get nothing and B gets the treatment. The “nothing” could be another treatment. That splitting via “randomness” is then just as before.

    Everybody should realize that I am not claiming randomized trials are bad, just that you are fooling yourself if you believe they provide balance. Control does, randomization cannot. This is acknowledged frequently when “randomization” within blocks (like sex) is planned for, because the experimenter realizes the blocks are correlated with the outcome.

  3. It’s like hearing the truth I’ve been thinking which all my medical colleagues disagree with!

  4. Matt:
    Well stated. Designing powerful experiments requires extensive knowledge of one’s subject matter. At core is the notion that a good experiment requires that you control all known factors that may be of relevance except the ones that you deliberately manipulate. The last point that you make about the non passive nature of human subjects is crucial. This is one reason why most psychological experiments are problematic and need to be replicated a large number of times with differnet pools of subjects in order to be marginally comfortable that something real is occurring – as opposed to repeatedly documenting the behavioral tendencies of college sophmores. Most psychological experiments treat the human subjects as black boxes – the experimenters seldom systematically debrief their subjects to obtain their internal explanations as to why they chose to do what they did. Following up with experimental subjects could reveal other relevant factors influencing the experimental outcomes. It is one of the fundamental methodological flaws in many psychological experiments. A parallel to the black box treatment of human subjects in psychological experiments is the analysis of failures in medical trials. Hand one group a pill and the second a placebo. 60% of the treatment group get measurably “better” as opposed to 20% of the control. If you are solely interested in selling your pill these are pretty good results. If you are interested in how your pill works and the design of the next experiment where 100% of those receiving the pill get measurably better, then your work has just started. Analysis of failures is very important for extending knowledge. (I am assuming this is one reason that in long lived medical trials/experiments subjects are asked to keep detailed diaries covering possibly relevant activities they may influence the outcome.)

    Speed:
    Your design assumes one key process point that all other things are equal. It is an O1,O2,X,O3 design. Assume that we are looking at weight, O1 takes place at the beginning of December, O2 at the end of December, the pill is taken in January and O3 occurs at the end of January. A classic control design would essentially control for the timing effect – unless of course one group celebrate Christmas more than the other.
    Longitudinal designs also assumes that testing has no effects. For example, A takes a weight loss pill for 3 weeks and then stops taking the weight loss pill for 3 weeks. There are not one but two actual treatments – the pill and the public measurement of one’s weight. Most weight reduction programs I suspect leverage the latter rather than solely rely on the magic pill.

  5. I strongly disagree with the implication that randomization is dispensable.

    My objection is this – even if balanced assignment on any number of believed-potentially-problematic characteristics is assured through stratification, random assignment within those strata is still *necessary* for a valid experiment. That’s simply because we don’t know what we don’t know; e.g. just because we don’t have any reason to believe that the order in which people come in the door is causally associated (broadly defined) with response to treatment doesn’t mean that it isn’t.

    I think the problem is crystallized in this passage from your post:

    “‘Randomness’ is supposed to roughly equally distribute characteristics that might affect the drug. This being so, any difference in A and B is due to the drugs themselves and not to these unmeasured characteristics.”

    The problem I see is with the phrase “roughly equally distribute”. Yes, all other factors that make a difference are, generally speaking, assumed to be “roughly equally distributed”, but that’s shorthand for “randomly distributed”. The fact that they may be extremely unequally distributed in some cases is perfectly consistent with the latter concept.

    What’s important, though, is that the experimenters (and their statistical enablers) believe this randomness allows them to define an applicable “error distribution” that in turn allows them to assert things about the likelihood that the true difference between the drugs is greater than some threshold (usually zero). So it’s not that “any difference… is due to the drugs themselves”, but that a) any other difference is random, and b) they think they can deal with random differences.

    I realize you have significant reservations about this line of thinking (along with the generally affable tone, those well-articulated reservations are the primary reason I’m a regular reader of your blog), but that’s a long way from being able to do away with randomization.

  6. Morgan:
    I think we depend on randomness as the readily available solution to a sub-optimal situation – there are things (a) we do not know and/or (b) cannot or chose not to explicitly control. Surely the ideal experiment is where there is zero variance in the outcome measure in the treatment group and zero variance in the control group and a large difference in the actual effect between the two groups. Randomly assigning individuals is an effort to equalize the variance but there is no guarantee that (a) it will be equal and (b) it will be sufficiently small to allow meaningful results.

  7. Morgan,

    Thank you.

    Like you, I also don’t want to “do away” with randomization. I cited cheating as an excellent reason to keep it.

    We have to be careful with language here. You say “random assignment within those strata” is necessary “because we don’t have any reason to believe that the order in which people come in the door is causally associated.”

    Let’s go slowly; this is important. If we did not have any information—and I mean no information—then there is no reason to suspect that the order makes any difference. By definition. If you have a suspicion the order makes a difference, then you have definite information. Then, of course, order might make a difference and we had better control for it.

    It’s easy to think of many examples where order does matter. Day of week effects, time of day effects, month of year effects, and on and on.

    So it’s right back to control and away from “randomness”, which merely means “unknown.” If we knew or suspected order is correlated with the response (where I, as always, use “correlated” in its plain English sense), it would be foolish not to control for it.

    Let’s be clear, too, what “randomly distributed” means. It means “unknown-ly distributed” (if I can abuse the language). And that means we have no information about the uncertainty of some thing. Specifically, it does not mean we have “some” or “a little” information; it means no information.

    Thus, unmeasured and un-thought of characteristics might come at us in any way. We don’t know how. By definition. We have no way of quantifying the uncertainty in their arrival because they are “unknown-ly” distributed.

    For others (Morgan knows this): if you dismiss that argument by saying, “Ah, most things are normally distributed; so I’ll assume that” you make two errors. Most things are not normally distributed (the central limit theorem applies to sums/averages of things, not things). And the second is, by saying you know the unmeasured characteristics are “vaguely” normally, then you are saying you positive, definite have information.

  8. Bernie:

    Agreed with your characterization of the ideal experimental outcome, and I’d like to see pharmaceutical companies (and psychology researchers) do more to identify the determinants of the impact of the intervention. In the pharmaceutical case, there are currently all kinds of incentives not to do so – we can hope that the trends that tend to restore these incentives outpace those that tend to further reduce them, but I’m not holding my breath.

    I like your inclusion of “choice” within reason (b) for randomization, because it’s easy to overlook the practical fact that given an infinite number of possibly contaminating measurable factors, choices regarding those on which to stratify have to be made.

    I think there is no doubt that “control” is preferable to randomization, but the open question is really whether randomization is the *best* choice once we’ve exhausted our ability/willingness to stratify. I just don’t see a better alternative.

  9. Morgan:

    ‘…“roughly equally distributed”, but that’s shorthand for “randomly distributed”.’

    I think you’re making an unjustified leap here. It’s true that “roughly equally distributed” could be randomly distributed, and it’s true that “randomly distributed” could be roughly equally distributed. But there’s no requirement that either one be true.

    Further:
    ‘That’s simply because we don’t know what we don’t know; e.g. just because we don’t have any reason to believe that the order in which people come in the door is causally associated (broadly defined) with response to treatment doesn’t mean that it isn’t.’

    In particular, if you feel that the order in which subjects came in the door might be important, you should control for it. But otherwise, we might as well control for the first letter in a person’s name.

    I think that if you can’t give an actual reason for controlling for something (and it’s probably not too difficult to come up with one for the order of appearance) then you’re practicing cargo-cult science. Which seems to be a main point of this post.

  10. All,

    Perhaps I should have said that today’s arguments are standard, Bayesian ones. They are in no way new or original to me. Do a search for Don Berry, a leading Bayesian clinical trials guy and read some of his original papers.

  11. Matt:
    Can you help by defining an ideal experiment in Bayesian terms? I used a frequentist definition above.

  12. In a drug trial, the physicians administering the trails are one more set variables to control. Radomization controls the physicians.

  13. Doug M,

    It does not. It merely removes, or lessens, the chance that they could cheat. Randomization itself does nothing. Well, it does this: it assures. If we assign docs by some “uncontrollable” mechanism (by which I mean a mechanism that people agree is difficult to control), then we all feel there is less chance for cheating.

    Bernie,

    Ideal? The more control the better. This is not glib. The closer you can get to the estimating the mechanism of the treatment—the closer you can control everything with regard to the experiment—the more ideal it is.

  14. Briggs:

    Yes, I got lazy at the end there with the “do away with randomization” bit. Apologies!

    If I take you correctly, you’re saying that if we don’t have any reason to believe that a characteristic is associated with our outcome of interest (such that we don’t see a need to control for it), then we have no reason to make sure that our assignment process is unbiased with respect to that characteristic. Once you control for what matters, all methods of assignment that don’t mess with that control scheme are equal. So randomization to groups across equivalent cells is fine (and possibly preferable for reasons of “cheating”), but so would be a median split on the third letter of the first name, or order in the door.

    Which is what you said in the first paragraph of the post. Perhaps I should read these things more carefully, and I would have understood that the overriding point is that control trumps randomization, and theoretically renders it no better than any other method of assignment. I cheerfully agree.

    But there is still the question of whether it’s possible to control for everything that matters, and (if not), whether randomization is preferable to other methods of assignment in ideal, but real, circumstances – i.e. after controlling all we can. I’d say that there are cases where we can’t practically measure (let alone obtain a sample that is balanced with respect to) every possibly-problematic-characteristic.

    And given that possibility, even a seemingly innocuous assignment strategy like “third letter of first name” might turn out to cause bias with respect to one or more uncontrolled problematic characteristics. If third letter of the first name puts people with Eastern European names into one group and those with Western European names into another, we might have a problem.

    We could have controlled for this in our design. Perhaps we should have controlled for it. But are we supposed to control for East versus West, North versus South, or even finer levels of geographic detail simply because they “might be a problem”, even though we have no particular reason (other than the general understanding that genetics are associated with both geography and response to drugs) to believe that these distinctions will make a difference? Every characteristic about which we would say “I doubt it makes a difference, but obviously can’t say for sure” would require another split.

    So if we aren’t going to go there, it seems to me that we have to admit the possibility that we have missed (or chosen to ignore) problematic characteristics, and that any assignment method but “coin flip”-type random might produce systematic bias with respect to them. Then randomization is a kind of insurance against adding an unknown and inestimable bias to the already problematic error.

  15. Morgan:
    I think the Bayesians would say that if you don’t know, then you don’t know. I think that the use of random assignment is more frequent when the experimenters say we do know but we do not want to be bothered to control for all known factors so we will try to spread the “other sources of variance” evenly (and keep our fingers crossed). I think Matt is really callling experimenters on an inappropriate short-cut to a full specification of their models.

  16. Briggs,

    The drug trial you describe above starts with 100 subjects (I’ll assume that these are people with disease X who are used because they are currently patients and are known to the researchers — common in drug trials) and is concerned with how to assign individuals to one of two 50 person groups – one to receive drug A and the other drug B.

    Is this different from picking two 50 person groups (two samples) from the entire population of humans having disease X?

    As always, you’re making me think. Which is a good thing.

  17. Suppose you don’t have the grant money for more than 100 subjects. You start to select subjects and get 30 black men, 40 white men and 30 Asian women. Looks like it would be hard to balance the experiment unless you actually accept many more than 100 subjects and eliminate some to even out the controlled groups.

    Sounds to me like this could be really difficult. You really want the experiment to be valid for the larger population of people who need the treatment and I would think you would want to balance the groups based on knowledge of this population of potential customers. The local group of subjects that you actually get could be greatly biased compared to the total population. Perhaps like trying to pick subjects from the Hanford reactor site for a cancer study. There are many biases based on geographical area that can have an effect (and yes, people who live near there think the cancer rate is higher, and if it is, it might be due to some reason other than what your treatment trying to correct).

  18. Morgan,

    Your example of the third letter is excellent; I will steal it in the future. But it just goes to show that there are always a near-infinity of characteristics that will go unmeasured but that might be correlated with the outcome. That being so, randomization is powerless.

    But you know what actually happens. In one study where I was the statistician (I came after the data was taken), there were over 7000 items measured per patient! The researchers sat down and said, “What about X? That might be correlated.” Another would answer, “Then we should measure Y, too.” They went on and on.

    And with that much data you can be sure that something will be correlated.

    Speed,

    I can’t see any difference; but maybe there’s something in your wording that I’m skipping over.

    James Gibbons,

    Situations of actual imbalance like you imagine happen continuously. But you can’t quit and admit defeat. Papers are still written!

  19. I hope the example comes in handy, and I give it freely so you don’t have to steal it.

    I have to admit that I still don’t quite understand why randomization is “powerless” (it still seems to me to be at least as good as – and probably better than – any other method for assignment across equivalent cells), but I’ll keep mulling it over and see if a new insight hits, or maybe you’ll take it up again at some point in the future. In the meantime I’ll still use it for assignment, but with a greater appreciation for the primacy of control, and a completely new understanding of why I think it’s worthwhile.

    Again, I appreciate the blog!

  20. Bernie,

    I guess the point was that for some conditions that we might want to study, such as cancer, we don’t really know the sum total of causes and controlling for all of these causes could be a rather difficult.

    A researcher might not know that many dairy cows were dosed back in the 40’s around Hanford and thus may not properly control for this in a study of people who may have consumed the milk. These people may have moved to other cities and thus bias any study they take part in unless the researcher noted their place of childhood and controlled for it.

    But as Briggs says, publish or perish.

  21. James:
    If you are designing an experiment you should strive to explictly control all variables that you believe are potentially relevant to the outcome – not to do so and to try to compensate by using random selection of subjects limits the confidence that one should have in the study and is poor science. Men and women for example differ sufficiently in their response to some classes of drugs that you need to control for gender in the studies. If you had no access to patients in one gender how good is your experiment? If you have male and female patients but you do not control for gender, how good is your experiment? More importantly by not controlling for a major potential factor, how much have you advanced your understanding of how the drug works?
    I think the experimental model is critical, but if used crudely and naively does not really advance or knowledge: A measurable and significant effect is not the same as a proven causal explanation. That does not mean that we should not take advantage of serendipitous discoveries – but we should be modest in our assertions as to our understanding.

Leave a Reply

Your email address will not be published. Required fields are marked *