Pajamas Media—my new pals–conducted, or rather commissioned, their own poll in the Scott Brown-Martha Coakley race.
It found, after calling more, but reporting on 946 “likely” voters, Brown ahead by 15.4%, a whopping lead. Other organizations had Brown up by no more than a point or two, or even down by the same amount. The huge—it is huge—difference between PJM and their rivals needs explaining.
Some said that PJM’s poll was not “random”; several insisted on it. Nobody questioned this truth—that polls have to be conducted randomly.
They do not and should not be. The word “random”, as we have discussed many times, does not mean what most think it does.
Suppose what is not true: the PJM’s poll correctly identified 946 voters. They did so by asking this question:
Only a small percentage of all voters will cast a ballot in this Tuesdays [sic] special election for US Senate. How likely is it that you will actually vote in this election on January 19th? If you will definitely vote press 1. If you might or might not vote press 2. If you probably wont [sic]1 vote press 3.
The results were stated in terms of likely voters (those who pressed 1 or 2). This is common practice, but it is sumamente (I’m in CA and practicing my Spanish) misleading. Likely does not mean “will vote”, it means “intends to vote or might not vote.”
There are many dangers betwixt wanting to and actually voting. Not everybody who said they will definitely vote will vote. Some who said they will not vote will. And the fraction of those who said they might or might not will be variable.
This ambiguity is a large source of polling error. To estimate it would require re-calling the same people (all of them, not just the 946 who pressed 1 or 2) and asking whether or not they voted. And then hoping they tell us the truth.
Thus, the second source of error: people lie. They lie like dogs, they fib, they prevaricate, they make statements at variance with the truth. If you call a lefty (righty) and he suspects that the polling organization is right (left), he will more likely lie. People lie not just in saying they will vote for the opposite person, but they say they are undecided when they are truly not. They say they will not vote when they will and vice versa.
Sometimes, people misunderstand the questions and answer oppositely of what they truly mean. So bad questions and poor English—by either party—is a third source of error.
Of these three, the largest is in reporting on “likely” voters instead of on actual voters.
Then there who was sampled. Ideally, we would only sample people who will actually vote; since this is impossible, we are led to report on who said they would vote (or are “likely” to). But again, let us suppose we have only polled actual voters.
The idea is, if we have polled 1000 people, the fraction of these who said they would vote for Brown will match the fraction of people who actually vote for him. A fourth source of error is if some who said they would vote for one candidate change their minds and vote for another, a non-rare occurrence. Ignore this error, too.
Why does it feel wrong to just call up 1000 actual voters and report on the fractions found? Assuming we had a list of actual voters, why not randomly call 1000 of them? Randomly here would mean grabbing out phone numbers with no set plan, no method. We could just call the first 1000 on the list, right?
Oh, yes, we could. If our total information is that we had two candidates and we knew they were people who were going to vote, we could just call the first 1000 and have an accurate poll (ignoring the other error sources).
Why it feels wrong is because this is not our total information. We know more. We know that a small percentage of those will actually vote will not have a phone (or, if they do, they will not all be at home when we call; however, given the first scenario of total information, non-answers are ignorable), we know that there are geographic clusters of kinds of voters, we know there are other clusters by sex, age, race, and so on.
That is, we have further positive information that we should use in a non-random fashion to ensure that our sample mimics the actual population who actually vote. We want the same percentage of cluster members in our sample as in the actual population.
In other words, we want anything but a random sample. We want a controlled sample. The only way to tell whether the PJM poll is well done beforehand is to check that the clusters it used match what we expect (via history) the actual clusters to be. After the fact, it will be easy to see what went wrong, or what went right.
Random merely means unknown. There is nothing spooky or mystical about it. Grabbing a sample by rolling dice does not guarantee better results than by systematically picking voters.
If you’ve understood this, you should be able to answer why, if our total knowledge was solely “that we have two candidates” etc., why calling the first 1000 names in the list is just the same as “randomly” picking names from that list.
1 I do this kind of thing so often, that I had to indicate that these times weren’t—note the apostrophe—my fault.