# Why The Bootstrap Seems To Work—And Why It Produces Over-Certainty

Let’s drag the statistician’s hoary threadbare ball-filled bag out of the cupboard to make a point. In it are 3 white balls and 2 black.

Using the statistical syllogism, we deduce the probability of drawing out a white ball as 3/5.

We could in a similar manner deduce the probability of any proposition conditional on this evidence. Such as the mean of draws of size n (with balls replaced in the bag after each draw). Or variance, or whatever. The math to do this isn’t hard.

If you were in a hurry, or hated math, or just weren’t that good at it (and most of us aren’t), you could program a “simulation” that draws imaginary balls from imaginary bags, and then use these simulated results to estimate the probability of any proposition. It’s easy: count the number of times the proposition is true in each simulation and divide by the number of simulations.

What’s a “simulation”? Nothing but a proxy recreation of the causes that bring about effects. Here, we have to have a cause of picking out a ball, and a way of simulating this drawing. It’s so easy that I leave it as a homework problem.

The simulation is only a crutch, an approximation to the analytic answer which is easily had. The “randomness” of the simulation does zero—zip, zilch, nada—toward proving the goodness of the simulation approximation. Read these two articles for why: Making Random Draws Is Nuts, and The Gremlins Of MCMC: Or, Computer Simulations Are Not What You Think.

That the “randomness”, which only means “unknownness”, is thought to confer correctness of bootstrapping is why I say in the title “seems to work”. The scrambled nature of the simulations are not why results are sometimes decent. That can be explained much simpler.

In the balls-in-bag situation, we have the entire population of possible events; or, that is, we can readily deduce them. Either a black or white ball is drawn! (This is a deduction based on the assumption only balls are and necessarily are drawn.) For any n draws, we know exactly what the population of observables would be, in a probabilistic sense (binomial).

But we don’t always have the full population of events, or possible events, in hand. Consider the weight of each American citizen of at least 18 years old. There is a definite population, though if we wait too long some new people turn 18 and others die. So simplify this to those citizens alive on 4 July 2019.

If we could measure everybody, then we’d have the entire population again, in the probability sense. We could ask the probability of any proposition in reference to this population, and to get it we simply count. “What’s the probability a citizen weighs more than 200 lbs?” We count.

Getting populations can be costly, so we usually make do with a sample. Here we measure a fraction of citizen weights and ask the same question about probability over 200 lbs. We count again. This gives a correct probability conditional on our sample.

This works even if we measure only person, such as Yours Truly. Given my weight, what’s the probability a citizen weighs more than 200 lbs? We count. It’s 1, and the sample size is 1, so the actual deduced correct probability is 1.

It is obvious the following two probabilities can be different:

(1) Pr(Citizen > 200 lbs | Population) = p_1

(2) Pr(Citizen > 200 lbs | Sample of size 1) = p_2

What people hope is that

(3) Pr(Citizen > 200 lbs | Sample, which is Close to Population) ~ p_2

Yet since nobody knows what the weights are, we don’t know if the sample is close. Of course, we all have lots of prior information about weight, so that we know, conditional on that information, that this sample of 1 is not close. But it should also be obvious that

(4) Pr(Citizen > 200 lbs | Sample, My Information) != Pr(Citizen > 200 lbs | Sample).

By which I mean that if you judge a probability conditional on whatever information, this is a different probability than one not conditional on that information. Many look at probability statements and say “They’re wrong”, when in fact the statements are correct. What they’re doing is changing, in their mind, the conditioning information and coming to different answers. This is fine, as long as it is kept in mind that changing any condition changes the probability—and that no proposition has a probability in any unconditional sense.

(3) Pr(Sample is Close to Population | My Prior Information) ~ 0.

We all know how to make the sample closer: increase its size. How much? Nobody knows. Not for sure, and this is because we don’t know what the population looks like. If we make up a guess of what the population looks like, we can use that guess to say how close a sample of a given size is to the population.

That is a different kind of bootstrap, in the plain-English use of that word. Statisticians call it the “sample size calculation”, which are always “cheats” like this. Think: if we knew the population, we don’t need to sample. If we don’t know the population, and can’t get it, we can sample, but we’ll never know, not for certain, when to stop such that the sample is close to the population unless we guess what the population looks like.

What we can do instead, in a state of true ignorance, is to begin collecting samples and then, say, plot a histogram (here of weights) of the sample. This will have a certain shape (one is pictured above). If we collect a larger sample and the shape changes only a small amount, or not at all, then we can use this to guess that the sample is “close” to the population. It’s only a guess, though, conditional on the hope that our sample is “representative”—a circular way of saying “close.” Hence, we have third kind of bootstrap.

Suppose we use this closeness trick. Then, after we have our sample, we can use the sample as in equation (3) to answer probability questions (by just counting). We’ll never know, not for certain, how far off we are, though. We can say how far wrong we are if we make guesses, or assume the guesses are true, about the population. Sometimes the math of these guesses is so elegant, some statisticians forget all the equations are all built on guesses.

But if the guess is correct, and experience shows it can be, then we have good estimates of probability questions of the population. We still have these estimates even if the guess is wrong! We just assume, as we must if we insist on getting a numerical answer, that our guess is good. Insisting on getting a numerical answer is what accounts for much over-certainty.

Now what? Suppose besides just weight, we also measured on each person the presence of at least one Y chromosome. Then we’d have two pictures of weights, one for those with Ys and one for those without. The pictures themselves are not necessary, of course; we could just order each set of weights. The pictures are for visual inspection only.

Then we might ask, “What is the probability the weights of Ys are different than non-Ys?” This probability will be 1 if they are different, and 0 if not. They are different if the two pictures don’t match up exactly. This is a true statement for the sample, and it’s conditionally true for the population assuming our guess the sample is close to the population is good. You can call this a “test” if you like, but it’s just counting.

We could instead ask “What is the probability the mean of weights of Ys are different than non-Ys?” Same answer for the same reason. Just look. Any difference is a difference.

But that’s a different question than “What is the probability the mean weights of Ys are different than non-Ys in the population?”

We cannot answer that, not without resorting to our bootstrap of the second kind, which is to put some kind of number on what the population looks like. And if we knew that, we wouldn’t have to ask the question.

If we assume the sample is close to the population, then other samples will be close to this sample, and then we can compute, using the two pictures, the probability statistics like “The mean of the Ys minus the mean of the non-Y” are less than 0. Or whatever.

This is what is formally called “the bootstrap“. Procedures using it won’t just use the pictures as they stand, and just count. They’ll instead use that “random” idea and make simulations. The idea is that the pictures can be used to make new samples, in just the same way that drawing out new balls from our bag makes new samples. We pull off observations from the sample pictures, i.e. from the bags of Ys and not-Ys, of the same size as the original sample. (See the Homework below.)

There are all kinds of proofs that show that this “works”—in the limit. Which is to say, when all evidence at the end of time has been accumulated. But it only works in the small when it turns out our guess of the closeness of the sample to the population is correct.

Incidentally, every real-life situation is finite and thus has a population. No real series goes on actually to infinity; not that we can measure.

Well, if the samples really are close to the population, then we needn’t do any of this “randomization.” All we have to do is count. And if the sample isn’t close, then we wouldn’t know it. Not unless we knew what the population looked like. But if we knew that, then we wouldn’t have to sample.

In other words, there’s lots more uncertainty in these situations than is commonly heard of.

Homework

Here’s a data set to play with, which we’ll assume is our sample.

library(ggplot2)
library(car)

data(Davis)
x=Davis
x$weight = x$weight*2.205 # change to non-barbarian unit of lbs

ggplot(x) + geom_density(aes(x = weight, fill = sex), alpha = 0.9)

A density plot is one way of many to show the picture of values. If you do it, you’ll see the bump out to the right is a non-Y (here Y = M and non-Y = F). Anyway, the weights are obviously different. I leave for you to compute the means and its difference.

Now a simple bootstrap, without any of the nice frequentist properties about which we do not care, is to compute two new simulated samples by “drawing out” values with replacement from the two groups, such that you have two new sample groups of the same size. Compute the difference in means for this new sample. Then repeat for, say, 1,000 times. You’ll have a sample of mean differences.

Explain just what you’d do with this creation.

Compute the raw-counting probably a Y-citizen in the population is heavier than a non-Y citizen. Explain the difficulties of this.

1. Sheri

“Insisting on getting a numerical answer is what accounts for much over-certainty.” It’s like in computer form development. You create a form—you see them in medical records and doctor’s office all the time now—and have people fill it out. You can put in “required” fields. This is done to assure the “accuracy” of the data and so people can’t just skip that which they want to. However, it’s also a great way to get garbage. Forcing people to answer when they don’t want to will get the blank filled in, absolutely. Whether it’s garbage data or not is another story. Requiring an answer, or a numeric value, must be considered carefully.

I learned what Briggs is saying here in “dummy statistics” (for those in the “soft sciences” who ran from calculus) years ago. You can’t necessarily extrapolate from a small population to a large one. You have to be careful what population you use as input and what population you extrapolate to. Interesting that apparently that’s no longer taught. I guess the worship of the god “Computer” has wiped out the thought processes here.

“Explain just what you’d do with this creation.” Substitute “climate data”, explain how I cleverly bootstrapped the computations and get published as long as I get warming as an answer. Never mention the original data could maybe, just maybe, bias the outcome. An unnecessary detail, right?

2. Ye Olde Statistician

In industrial statistics randomness is less prized than representativeness. These two tactics are intended to combat two sampling diseases: judgemental sampling [cherry-picking] and convenience sampling, resp. Judgemental sampling means you have seen the unit before putting it in the sample [or not]. But in that case, YOU are telling the SAMPLE what the population is like. In which case, as you say, “Why sample?” Convenience sampling means pulling in whatever units are most available. A bank once tried to estimate the annual number of accounting error by verifying every voucher from the month of July [most recently available] and then multiplying by 12. It would be impossible to explain to management how this could ever possibly be right. Instead, one ought to take a portion of the sample from each month’s vouchers and probably spread across all five work-days and the individual clerks who entered the data. IOW, whatever factors one believes might be effective. [Your model].
In manufacturing, these factors can be summarized as: Materials, Machine, Man, Method, Measuring device, Milieu. Oh, and Time. In manufacturing, the key question is less often the central tendency of the process than whether there is a process at all. [Just how many different bean bags are in play?] The ‘static’ statistics of the 19th century are less useful than the dynamic methods of the 20th.

3. Ray

“You can’t necessarily extrapolate from a small population to a large one. ”
I had a mathematics professor who used to warn us students that the only thing more dangerous than extrapolation was predicting the future. When you extrapolate you are actually predicting the future. About 50 years ago I had a course in numerical analysis and, of course, we studied numerical methods such as interpolation techniques. I used to do lots of Fortran programming and still have books on numerical methods. Some of these books have a chapter on interpolation, but no book on numerical methods that I have seen has a chapter on extrapolation techniques. Extrapolation is not a valid mathematical operation because you are actually predicting the future. That is prognostication, not mathematics. Only stock and bond salesmen can do that, not mathematicians.

4. MNRaider

@ Ye Olde Statistician

Oh I’ve been on the wrong side of Convenience Sampling. Hear Ye:

A few years ago I was renting a one-bedroom apartment in Dayton, Ohio. The unit only had windows on the North and South sides which were about one foot above ground level; it was in the middle of the larger building structure. The South windows were a short walk to a mature tree line so the unit never caught any direct sunlight. On top of that I was making a conscious effort to limit electricity consumption: air conditioner thermostat set to 72F or higher and everything but the fridge was unplugged over night. Month after month after month my electric bill was under $10 so I became convinced these tactics were making a big difference! A few days before I move away I receive a letter from DP&L explaining that a faulty meter misrepresented the electric usage during my tenancy. With it they included their estimation of my yearly consumption and sent me a new bill which I was to promptly pay accordingly or be subject to certain legal consequences. I was so surprised to see that the bill was over$1k I wasted no time dialing forth a rebuttal. After a few phone angry phone calls they finally admitted how they determined my yearly usage: multiply the consumption from the most recent cycle (July 20 – August 19) by 12. You know… because every 30-day temperature cycle in Ohio averages over 90 degrees Fahrenheit, right?

5. “Think: if we knew the population, we don’t need to sample.”

Not sure I 100% agree with that. There are non-sampling errors that are often larger than sampling errors. You already mentioned the problem of cost of taking a whole population vs a sample. You could also have a sample or even entire population, and have some item nonresponse that would make things uncertain you’d need to account for.

Yes one can “guess” at the population, but that’s semantics, since it is guess not totally blindly but based on reasonable experience, subject matter expertise, assumptions, math (strong law of large numbers, central limit theorem), the experiment(s) that led to the data, and the sample data. For example, not knowing anything else about that data, eyeballing the graphs, the samples look roughly symmetrical to me (which anyone can disagree with). How about if we assume the underlying populations are t-distributions? If that assumption disturbs you, you might like that it is robust from departures from nonnormality, or can just assume they are symmetrical distributions (of which t and normal are just a few examples from that larger family) and do something like Wilcox test. Or we can be even more nonparametric and do bootstrap or permutation tests.

In this example there is not much overlap in the distributions – makes our job easier. In other cases with more overlap, “obviously different” judgments just don’t cut it, and it might not be so clear without a formal test.

t.test(weight ~ sex, data = x)
wilcox.test(weight ~ sex, data = x)
library(exactRankTests)
perm.test(weight ~ sex, data = x)

I’m not sure that by using something like these I am being over-certain, I think just the opposite actually. I’m stating all my fairly reasonable assumptions I’d say (ie. I’m not literally claiming the populations are literally say normal distributions, just using it as a reasonable wrong but useful model), putting them out in the open for discussion, as well as allowing the possibility to be wrong (alpha, beta, etc.),

Also, in this case, the results from all tests are same direction/conclusion. In some cases, that might not be the case. In which case, another method is to use an ensemble approach that might perform many known tests known to have good performance and take the majority vote among them,

Justin

6. Ray Kidd

Line 12 of the Davis data “12 F 166 57 56 163” is an obvious transposition of wt and ht and should be “12 F 57 166 56 163”. Hence the large F bump at 366 lbs.