The Scott Brown, Martha Coakley race is over. The results are not surprising nor undesirable. This is because anybody who dares call another—close your eyes if you cannot stand foul language—a Yankee fan deserves to lose.
Incidentally, congratulations, Mr 41. Running for President, are we?
How did the pre-election polls stack up? Not too badly. They were useful and accurate on average. But what do “useful” and “accurate” mean?
Polls can be viewed two ways. The first is as a snapshot of opinion. That is, as a survey of voters’ thoughts given the information known at the time the poll was taken. It is understood that voters’ opinions might change if the information changes. For example, polls are often used to tell candidates what information to emphasize or to offer—or to withhold!
A poll is also a prediction. It is a guess of who will win—again, given the information available to the voters at the time the poll was taken. The information the voters know might, of course, be real or imagined.
This note explains how to use a poll’s, or multiple polls’, results to predict a winner. Before the election, of course. This is just a sketch, as a complete explanation could fill a book.
Start with an example. The PPP poll of 17 January showed a 51%/46% split for Brown/Coakley (the remaining 3% were undecided or for the third candidate). This did not mean that there was a 51% chance that Brown would win. It also did not mean that there was a 100% chance that Brown would win.
The real chance—given the poll results—of Brown winning was between 52% and 100%. How could we have guessed that chance?
First, each poll is accompanied by a “margin of error”, usually three to five percent. These numbers can largely be ignored if they are the result of theoretical equations in the statistical theory of finite population sampling. If they are from the result of a model, then they can be informative (see below).
The “margin of error” in any poll is meant to mean something like this: There is a 95% chance that the actual fraction of people who vote for candidate A will be 51% plus or minus 4%; or 47% – 55%. The “95%” is never spoken; but it is implicit. It can be variable, say “90%”, but it never meant to be “100%”—which would mean the poll is guaranteed to be perfectly accurate within the margin of error.
What we wanted to know is this: What is the probability that Brown will have an actual vote fraction of 50% or greater? (In three- or more-party races, this fraction can change; this also ignores specialty rules of run-offs, etc.) Or, What is the probability Brown wins? This probability is always conditional on the information available to the voters at the time of the poll. And conditional on the mechanics of the poll, processes which we’ll ignore today (but see this link).
The only way to produce a probability of winning from a poll is to create a statistical model. The easiest way to do this is to combine polls from multiple sources, like the website 538.com did.
They first used a (proprietary) regression model, which gave a 15-point advantage to Coakley. They also created an average of third-party polls (like PPP, Rasmussen, etc.). This average (or mean) is also a model, albeit a simple one. They then combined their regression model, with the mean model, to produce a final mixed model, and that was used to calculate a 74% chance of a Brown victory (on the day before the election).
The 538.com model also wisely used polls from the same sources (like Rasmussen) through time, which acknowledges that the information available to voters changes. The most recent polls were given the most weight.
More difficult to model are the results of just one poll. If, for example, Rasmussen had several polls through time, then this is similar to the situation of having polls from multiple sources. All the polls are fed into the model, which can be recalculated as each new poll comes in. Presumably, the model improves as time progresses.
But if the organization only has one poll, they must rely on their past performance of similar polls to create a prediction model. Or they must incorporate information external to the poll; such as the unemployment rate, weather, or anything that is deemed probative.
In either case, several polls through time or just one poll, previous performance on similar polls must be used to create a model. A predictive model must take as input prior poll numbers married to their actual outcomes, plus information on the length of time before the election of the poll, the geography, and so forth. But it is the past performance—the difference between the poll and the actual fractions—that is more important. Incidentally, those past performance are what should be giving the margins of error.
Finding the best information to create predictive models is an art, which is why most polling firms consider their processes proprietary. Given the importance of polling in modern politics, these processes can be extremely valuable.