Big Data, Big Predictions — Big Deal!

The title, you’ll be shocked to learn, is sarcastic. You’ll forgive the tone after reading NBC’s article
Big Data’s Big Misses: 2016 Was a Bad Year For Predictions“, which opens

From the United States to the United Kingdom, the last 12 months will be remembered for missed calls, surprises and upsets that didn’t just beat the odds, but that shattered them – and not just in politics. On the most basic level, 2016 was a bad year for data.

Au contraire, mon Probabiliste: it was a terrific year for data. People from boardrooms to broadcast booths slavered over it. Articles that glowed harder than the Fukushima pile spilled forth, a cataract. Classes both near and far were conducted. Fealties were sworn. Big Data was—and is—in. Why, and why this is inevitable and bound to lead to disappoint, below. First:

Going into Election Day, most odds makers had Trump as long long shot, with a less than 30% chance of winning the presidency. The data mavens at, gave Trump a 28.6% chance of winning. The Upshot at the New York Times gave him a 15% chance of winning. Others had the numbers much lower, 2% or less.

Reminder: these probabilities differ because the evidence differs. All probability is conditional.

The article continues with major lapses in sports prognostications, then continues:

All those results made the numbers, and the people who created them, look silly. In other words, many of us may be focused on how pollsters “got it wrong” in the U.S. presidential race, but those data wranglers had plenty of company in Las Vegas and London. (And remember, as many pollsters will remind you, Hillary Clinton did win the popular vote.)

But the misses on Trump and Brexit were more complicated than the sports books. They were about misreading and mis-predicting the behavior of millions of people. And in those two cases, the data tools may be part of the problem.

Most of the traditional data measures on the 2016 election data were pretty consistent in showing a big Clinton win. It could be that 2016 is telling us the electorate now functions differently.

The data tools are part of the problem in two ways: (1) the models themselves, and (2) the false belief that everything can be quantified. We’ll go in inverse order.

Science oft leads to scientism, of which a symptom is the belief that everything can be quantified, or at least approximately quantified. Thus (partly) the phenomenon of Big Data. (The facet of Big Data that merely involves storage, processing, and computation of massive buckets of bits, while important, such as how individuals can be tracked, I ignore; I am interested here only in prediction.) Big Data is at least the hope that if enough things are measured, prediction will be easy. Cram every tidbit of information into some universal algorithm, ask that algorithm a question, and out will pop the answer.

Computers get better, faster, stronger. And, after all, do not computers now routinely beat humans in chess and go? They do, but those games have known rules which are of trivial complexity. What are the rules that account for human behavior? How do you program them? Nobody knows the answer to either question. The hope that as technology progresses we’ll learn these rules is bound to fail.

Some have the idea that, eventually, the brain and body will be mapped down to the atomic level. Imagine that this is so. A person’s behavior, these people say, can then be projected (and understood) within the uncertainties imposed by quantum mechanics (or whatever might be its replacement). This follows. If we really could, never mind how, know and computerize how every element in a person’s body interacts then via true rules of elemental interaction, as in equations of motions of a gas, we could predict that person’s behavior. We could predict his very thoughts!

Alas, the missing assumption is that man is entirely a physical creature, which is false. There isn’t a person alive (though there may be many dead) who understand the rules of spiritual interaction. There is no way to quantify the spiritual, and thus no way to quantify the spiritual-physical interaction. Thus complete persons will ever be closed off to Science.

Besides, we’re never interested in persons, but person-environments. Even if you could map every element in a person, and there was no spiritual dimension, you’d have to also map every element in his universe to capture the person-environment interactions. Meaning, of course, the computer you use would have to be larger than the universe mapped.

There is more hope that we can predict group behavior. People are quirky but mobs are predictable, it is said. Compile data into big enough a pile and we can show just how far ISIS will have spread fifty years hence. Which means also predicting how non-ISIS groups interact with ISIS. Which means we’re still in deep kimchee, accuracy wise.

Much, much more to say on this topic. Today is only a teaser. About the election predictions themselves, before the outcome I wrote this, cautioning against the idea that we have correctly identified all the things that should—and can—be measured. Computer models, as is obvious, are biased towards that which can be measured. That which can’t be measured, even if it is that which is most important or influential, won’t be computed. This is a tautology with a twist. Because that which is computed is computed on a computer, and labeled inter alia “sophisticated”, the output will be accorded undue weight because of scientism.

About the ineptness of the models themselves, see this book, especially the latter chapters. One must admit (as I do in the book and elsewhere) that computer scientists are genuises at naming their algorithms. They sing. They beguile. They imbue the same hopefulness in one’s breasts as do those infomercials one sees upon waking after falling asleep in front of the tube about new ways to chop vegetables and cook and serve them with no mess. Yet when the algorithm arrives in the mail, you unwrap it and try applying it to human behavior in real life, they work just as good as do those new fangled vegetable peelers.


  1. Michael Dowd

    Maybe the pollsters should have spent more time on State data considering the Electoral College and all. But, of course, this would have run up the bill and been much harder to provide a simple answer to the public. Providing simple answers to the public is what the news business is all about. Sex and simplicity sells.

  2. Mike86

    Yet what if a long-term program could be implemented whereby people could be exposed over more than a decade to controlled situations where their responses were monitored with feedback to correct those outside the desired range? Particularly if the people involved were not told what was happening and believed they were undergoing an necessary course. If successful, the program could overcome many of the issues with varied responses, making people largely predictable and, potentially, controllable.

  3. Bill_R

    “Imagine how much harder physics would be if electrons had feelings” – attributed to Richard Feynman

    If one’s models are based on urns and simple measurement error asymptotics, it is not surprising that they don’t work too well with things that have free will and can react to changes in their environment.

    In Marketing Discrete Choice models (a group model) work fine until someone changes the choice set….

  4. Ray

    “Because that which is computed is computed on a computer, and labeled inter alia “sophisticated”, the output will be accorded undue weight because of scientism. ”
    Too true. Back when I used to do a lot of computer programming we used to joke, “garbage in, gospel out.” As long as the results came on a computer print out, people believed it.

  5. Ye Olde Statistician

    “I have been saying that modern science broke down the barriers that separated the heavens and the earth, and that it united and unified the universe. And that is true. But, as I have said, too, it did this by substituting for our world of quality and sense perception, the world in which we live, and love, and die, another world—the world of quantity, of reified
    geometry, a world in which, though there is place for everything, there is no place for man.

    — Alexandre Koyré, Newtonian Studies

    Regarding the curse of Big Data:

  6. Milton Hathaway

    As an engineer reading this blog, much of the philosophical goes over my head. Big Data, I gather (I had to research that term a bit), appears to be another shiny new tool that can allow one to study intractable problems. When a new tool comes along, it’s easy to become enamored with it, like that man with only a hammer in his tool belt who saw all problems as nails.

    When engineering simulation tools first came into widespread use, it was common to see engineers wasting vast amounts of time, since using the tools felt productive. We learned through trial and error that there were trade-offs. First and foremost, there was a need to constantly validate a tool’s output against reality, and tweak the input until it matched. Second, we often had to give up on understanding some of the more complex details of how the design actually worked. But our productivity vastly improved, so we never looked back.

    Anyway, implying that Big Data is a bad tool comes across to me as blaming the hammer when it’s used to pound screws.

    Veering off topic, this blog has caused me to think a lot more about cause and effect. I am starting to form a theory about where many cause/effect investigations go astray. It’s pretty simple, and perhaps addressed in the “Uncertainty” book, but I haven’t made it to the latter chapters yet.

    Say you have an undesirable state of affairs – let’s call it “A”. Situation A might be, for example, parts failing on an assembly line. Your job is to find out what is causing A and implement remedial actions. Lets say you don’t know what is causing A, but you can list candidate causes. Let’s call them B, C, D, etc. If there were control knobs labeled B, C, D, you could twist the knobs one at a time and watch what happens to A. Since there are no such knobs (or there actually are, but messing with them is way too expensive), another approach is needed.

    If you have past measured data for A, B, C, D, you could calculate correlations between A and each of B, C, and D. But think about that for a minute. What is a measured quantity going to do? Well, it could increase, it could decrease, or it could stay the same. Since A (failure rate) has increased, if B, C, or D have either increased or decreased, there will be a correlation, which could very possibly be spurious.

    Now think back to the knobs we wished we had. If there was a knob labeled B, you would turn it, and watch what happens to A. But would you stop there? Of course not! No matter if A moved in response or not, you would turn knob B back the other direction and watch what happened. If you think you see a response, you very well might (especially if you are like me) twist the knob back and forth continuously, while slowly adjusting the speed at which you are doing so. If A responds in a correlated way (i.e., you can make A ‘dance’ by twisting knob B), you will be almost certain that B at least causally contributes to changes in A.

    So now back to our past data (A,B,C,D). How do we replicate knobs?

    I took a signal processing class in college. The professor had an ongoing side job as an industry consultant, and mixed in practical advice with the theoretical course material. When analyzing data, the first thing he always did was to fit a trend line to the data and subtract it out. He did this with minimal explanation – I remember thinking it was done because of drifting measurement instrumentation or because it messed up the math, but I took it with a grain of salt. Data was data, right?

    So my theory is that trends in data are so often meaningless or misleading that the professor was correct in removing them before processing the data. If there is still anything left of evidentiary value, it will still yield a correlation. But if the only correlation was due to the trends, then you need more data.

  7. Michael Dowd

    Predictions of life under Trump from Democrat Joel Ross. Most of those who read this column should be very happy.

    This guy is a futurist and a Democrat with no love for Trump, but his predictions make a lot of sense. Worth the read.

    Joel Ross writes a subscription newsletter called The Ross Rant, from New York City, for real estate investors. He provides an interesting perspective on the Trump presidency.

    The black swans held a massive victory party last night. Not because Trump won, but because they showed the world that they are the real rulers.

    Being a NY real estate guy, I have always known Trump is a terrible person. My former partner who at one time was worldwide head of real estate at Citibank, hated him. She and he had bitter battles when he was bankrupt and she was seizing assets. I know other bankers, lawyers and contractors who have had dealings with him, and nobody had good things to say. However, now he is president and we need as a nation to be fully supportive or the world will come apart. As I feared the press is already on the attack with their usual rhetoric. CNN today was its usual self – attacking him and having Chris Cuomo and Amanpor make stupid statements, still blaming the Russians. MSNBC actually had Al Sharpton on as a commentator. Shows you how out of touch the media is. Kelly Ann told Cuomo to stop the negativity and he went on to try to claim “it is not us, it was the campaigns”. The press just does not get it. The NY Times was a loser for the last several years, and it will continue to decline at a faster pace. CNN has lost the battle to Fox and will likely have to change out the commentators who are now completely discredited. Many years ago Connie Chung made a luncheon speech I was at, and she said the press had devolved into a bunch of lazy unprofessional kids who just rush out a story without bothering to check veracity. She decried how the press had become rumor mongers and unprofessional. They sure have. They will probably not change much until they realize the world has moved on to social media and the press has minimal credibility now.

    If you have been reading The Rant for a long time you know I have been saying the world is changed and very high risk, and the black swans are circling. We just entered a major inflection point in history. I have been reporting that in Europe the right wing is ascendant. LePen is likely to win the French election next year. The EU is going to come apart once that happens and now with Trump in power that trend will accelerate. The EU will realign into blocs and there will be massive turmoil as things sort out over the next several years. The French will go back to the Franc. Germany will shift right as the refugees create social, crime and fiscal issues. As ISIS gets destroyed they will try to wage war in Europe thru more terror attacks. You do not want to invest in Europe. The world is rapidly shifting right and the changes will be generational. Brussels will be neutered. NATO countries will invest much more in defense and will be forced to build up their armies.

    Here is what I believe will happen in the US. Trump has two years to make massive changes and this is what I believe they may be. Obamacare is replaced with some type of more free market plan that Ryan already has. Corporate taxes will be reduced to maybe 15% or maybe a bit higher. Personal taxes will be reduced. All executive orders by Obama will be reversed. Most of the massive regulation Obama put in place will be cancelled. The Supreme Court will get a conservative justice right away and Ginsburg will try to hang on until she dies in office to try to deny him her seat. She will not last 4 years. Trump will get at least 2 and maybe 3 judge picks. The Supreme court will decide by what strict constructionists think the constitution says not the left wing politics of Sotomayor. It will be much more pro-business. Antitrust cases will go away. Sanctuary cities will lose funding and San Francisco and Berkley and Boulder will go nuts. The border will somehow be secured, and Mexico will not pay. Border Patrol will be materially increased. Gang members will be arrested and deported but everyone else will get to stay here. Ryan will stay as Speaker. Trump will do what any good NY real estate guy does, he will get up from the table until he gets a deal close to what he wants.
    That is key to a lot of what Trump will be able to do. If you listened carefully to what he said, it was I will redo NAFTA and will walk from the table if I do not get what we need.

    There will be a revised NAFTA but Mexico will suffer a lot because many US companies will not move plants there until they see what revised NAFTA says. They will also not defy Trump early on and risk his wrath. Mexico takes a big hit. The Pentagon and US defense contractors are big winners. Defense spending will ramp up by huge numbers. The military will add over 200,000 people over the next two years. Weapons spending will dramatically increase. This will add a lot of new jobs between the additional military and the added jobs in defense plants. Private equity will take a big hit with carried interest going away and this will make a small part payment for the tax cuts. Estate taxes will mostly go away. Cops will be respected again And racial strife will end as Trump tries new ideas to build charter schools, and rebuild the ghettos. There will be no more honoring the families of the thugs like Brown and Travon the way Obama and Hilary did. He will honor the cops. The downtrend in crime will get reinstated. Transgender anything will go away.

    The military will be told to go win wars and not be social experiments with transgender soldiers. Rules of engagement will be changed to kill the enemy instead of cater to political correctness. There will be an infusion of another 5,000 US soldiers into Iraq and more into Syria to back up the destruction of ISIS. The bombing campaign will be stepped up to what it should be. By March ISIS will have been defeated. They will try to carry out major terror attacks, but now the world will call Islamic terror what it is, and there will be a more aggressive fight and coordination. Putin and Trump are from the same mother and will get along. Putin is like all bullies – he will realize he cannot push Trump around like he does Obama, and he will work out a modus vivendi because he knows he has at least 4 more years to deal with a new US president. Bullies back off if they find they cannot intimidate the other guy. The Iran nuke deal will get torn up and Iran will find itself back under sanctions. The Germans will scream, but Merkel is now in a very weak position, so she will not be able to stop Trump from re-imposing them at least for US companies, and anyone wanting to do business in the US, especially banks. This will be world-changing. The Saudis won big on this, Israel won huge. Developers win because the EPA will be de-fanged. Climate change legislation is dead, and the Paris pact will be defunct.

    College campuses will no longer have the threat that unless they find a bunch of young guys to charge with sexual whatever they lose funds. PC on college campuses will be pressured to end although for quite a while there will be protests and other such things. Today, professors are telling students they do not have to take exams because they are so upset. Give me a break. This is exactly what is wrong with American colleges today. It is just telling kids boo hoo if you feel bad, you get excused from work.
    I still think Trump is nuts and a really terrible person, but he is president now, so we need to deal with reality of what next. Paul Ryan has already reached out to heal the rift, and they have already planned a quick special session to pass repeal of Obamacare and undo the regulations and do other things quickly. Ryan will remain as speaker. The SCOTUS vacancy will be filled immediately.

    Most important the entire world is about to change. We will see if for good or bad but change it will in massive ways. The tide of anti-socialist, anti PC, anti-diversity, anti-entitlement, anti-establishment of the past 70 years is washing across the world and Trump is simply the ultimate example of what had already been happening with Brexit and in Europe. As far as the stock market – it will now rise. Taxes will get cut, the Supreme Court will not be activist, anti-trust will end, some type of infrastructure program will be instituted, defense spending will jump, banks will be free to lend, regulations will be drastically reduced, and corporate profits will rise.

    Go all in now. You already see the market reaction is up after the shock. Wall Street elite misplaced their bets, Hollywood and the press way over played their hands, and college professors and administrators will have to get over it. Hilary and crew go to jail.

    I started Ross Rant in late 2007 when I told a tiny group of friends the market was going to crash and nobody believed me. I sold out of the market in May 2007. My broker thought I was nuts. In January 2008 I told people at the big hotel conference that the hotel industry was going to crash in 2008. They told me I was stupid. I began emailing my little group of friends with these kind of thoughts and so the Rant was launched. They encouraged me to expand that, and so began the Rant. In August 2008 I told a few friends driving to a golf outing that Lehman would go bankrupt, and nobody believed me. So now I said Trump and the Republicans would win, and few believed me. In fact, one of my good friends of 50 years bet me $100. So how did I do that. Not smarts. I have no data base. Difference is, I spend a lot of time observing trends and traveling, and talking to real people on the ground all across the country at all walks of life. I do not read the NY Times, nor do I watch NBC or CBS or CNN. I talk to lots of real people who are doing whatever they do day to day and try to glean tidbits and connect the dots. Any of you can do the same and probably better than me. I just try to disconnect myself from what the talking heads are saying, and the latest fad of political thinking, and try to see it from a distance with some objectivity.

    To my Democrat friends, suck it up – the world just changed dramatically – we will see if for better or worse.

  8. Bill_R

    Thanks, Dave. That is a great site. Still has to be approved (see the FAQ).

  9. DAV

    Most of the traditional data measures on the 2016 election data were pretty consistent in showing a big Clinton win. It could be that 2016 is telling us the electorate now functions differently.

    The data tools are part of the problem in two ways: (1) the models themselves, …

    You really can’t blame the models used to predict the 2016 election. A model can only be as good as the data used in its creation. Low quality data leads to incorrect models. So, yes, having lots of data does not guarantee your models will have any validity. Still, other than eschewing models altogether, you still need to construct them.

    Polling data were skewed by two factors: the first was selection bias: polling in echo chambers and the second was assuming the popular vote (as gauged in the echo chambers) would be evenly distributed among the electoral votes.

    That the polls were indicating an outcome desired by many in the media led to their unquestioning acceptance. Not just by the media but by the Hillary camp as well. Team Trump apparently realized the problem with the polls and campaigned accordingly.

    … and (2) the false belief that everything can be quantified.

    I submit that everything outside of data consisting only of names or unconditional concepts can be quantified. Names include things like colors and zip codes. It’s also hard to quantify concepts (such as peace) without specifying conditions.

    Anytime you can say X is greater than Y though implies you quantify both X and Y. That doesn’t mean you may be able to do so with much precision. Examples: how many trees does it take to make a forest (yet a forest has more trees than a group of ten trees) and at what temperature does cold turn into warm (yet we can still say it is warmer today than some other day)?

    The election was neither unquantifiable nor was it unpredictable — in theory anyway.

    Milton was right: you seem to be blaming the hammer for being a bad screwdriver.

  10. Ye Olde Statistician

    DAV is right in part. The only thing wrong with models is that they almost never pan out. But the data (and behind them the lurking shadow of sampling error) is only part of the problem. The model structure can also be problematical.

    1. Electoral vote. Any US election model that fails to take the electoral college into account is simply structured wrong. But (at least as reported) the polls treated the vote as if it were a national popular vote. In 2000, Al Gore warned that even though the polls at that point put Bush ahead, what mattered was the electoral vote, which at that point he expected to win. Fate chuckled and flipped polarity on him and he was hoist on his own petard.)

    2. Propagation of variances. If we estimate the popular vote in each State, it comes with sampling variation (a/k/a sampling “error”). Since each State sample is smaller, the error of estimates will be higher. To obtain the national vote, the electoral votes must be summed and the errors of estimate will add in quadrature. Two States allocate electoral votes by Congressional district; the rest are winner-take-all. Hence, if the uncertainty of the estimate overlaps 50-50, some chunk of the vote may flip from one side to the other because the actual tally came up slightly on one side rather then the other.

    3. Likely to vote. It doesn’t matter what people report to a pollster; only whether they “pull the lever.” (And by all accounts, the Clinton campaign sorely neglected the ground game on election day.) How does the pollster know if Adam Apple is going to take a strong quaff of offyerass and head on out to the polling place? You can’t just ask Adam. He knows he’s supposed to vote and will tell the pollster so. In at least some polls, the “likely-to-vote” parameter was estimated by looking at party affiliation and the proportion of that party that showed up to vote in the previous election. Thus a respondent who claimed to be a Democrat (it was not actually verified against registration rolls) was considered more likely to vote because a higher percentage of Democrats voted in 2012. But Obama was a more inspiring candidate than Mrs. Clinton and many Democrats sat on their hands after Bernie was euchred out of the nomination. So that was another structural flaw in the model.

    4. Precision is not accuracy. That ±3% they (sometimes) tell you about is precision in estimating the parameter. In QC work, we often noted that one might obtain a very precise estimate around a very wrong answer. Accuracy cannot be calculated from the data, but only from comparison to a standard taken as True. A common source of inaccuracy (a/k/a “bias”) is Non-response. The models generally assume that “Undecided” or “Not Responding” will break down in the same proportions as those who state a clear preference. But this is not always the case.

    I wrote about models a while back and a few people might find it interesting or amusing. Part I is here:

  11. Jonathan S

    @Dave Halliday:
    It does indeed look interesting, although my curiosity is tempered somewhat by seeing the egregious Messrs Cook and Lewandowski cited [under Syllabus/week 12].

Leave a Reply

Your email address will not be published. Required fields are marked *