Part I of III
All probability (which is to say, statistical) models have a predictive sense; indeed, they are only really useful in that sense. We don’t need models to tell us what happened. Our eyes can do that. Formal, hypothesis testing, i.e. chasing after statistical “significance”, leads to great nonsense and the cause of many interpretational errors. We need models to quantify the uncertainty of what has not yet been measured or made known to us. Throughout this series I take models in that sense (as all should).
Which is this. A model—a set of premises—is used to make predictions about some observable Y, a proposition. For example, a climate model might predict what the (operationally defined) global mean surface temperature will be at some time, and Y is the proposition “The temperature was observed at the time to be y”. What I have to say applies to all probability models of observable events. But I’ll use temperature as a running example because of its familiarity.
If a model said “The temperature at the time will be x” but was really y, then the model has been falsified. The model is not true. Something is wrong with the model. The model said x would occur but y did. The model is falsified because it implied x would happen with certainty. Now the model may have always hit at every time up to this point, and it may continuing hitting forever after, but it missed this time and all it takes is one mistake for a model to be falsified.
Incidentally, any falsified model must be tossed out. By which I mean that it must be replaced with something new. If any of the premises in a model are changed, even the smallest least consequential one, strictly the old model becomes a new one.
But nobody throws out models for small mistakes. If our model predicted accurately every time point but one we’d be thrilled. And we’d be happy if “most of the time” our forecasts weren’t “too far off.” What gives? Since we don’t reject models which fail a single time or are not “too far off”, there must be hidden or tacit premises to the model. What can these look like?
Fuzz. A blurring that takes crystalline predictions and adds uncertainty to them, so that when we hear “The temperature will be x” we do not take the words at their literal meaning, and instead replace them with “The temperature will be about x”, where “about” is happily left vague. And this is not a problem because not all probability is (or should be!) quantifiable. This fuzz, quantified or not, saves the model from being falsified. Indeed, no probability model can ever be falsified unless that model becomes (at some point) dogmatic and says “X cannot happen” and we subsequently observe X.
Whether the fuzzy premises—yes, I know about fuzzy logic, the rediscovery and relabeling of classic probability, keeping all the old mistakes and adding in a few new ones—are put there by the model issuer or you is mostly irrelevant (unless you’re seeking whom to blame for model failure). The premises are there and keep the models from suffering fatal epistemological blows.
Since the models aren’t falsified, how do we judge how good they are? The best and most basic principle is how useful the models were to those who relied upon them. This means a good model to one person can be a poor one to another. A farmer may only care whether temperature predictions were accurate at distinguishing days below freezing, whereas the logistics manager of a factory cares about exact values for use in ordering heating oil. An environmentalist may only care that the forecast is one of doom while being utterly indifferent (or even hostile) to the actual outcome, so that he can sell his wares. The answer to “What makes a good model” is thus “it depends.”
Of course, since many decisions fall into broad categories we can say a little more. But in so saying, we must always remember that goodness depends on actual use.
Consider the beautiful game of petanque, wherein manly steel balls are thrown towards a target. Distance to the target is the measure of success. The throw may be thought of as a model forecast of 0 (always 0) and the observation the distance to the target. Forecast goodness is taken as that distance. Linear distance, or its average over the course of many forecasts, is thus a common measure of goodness. But only for those whose decisions are a linear function of the forecast. This is not the farmer seeking frost protection. Mean error (difference between forecast and observation) probably isn’t generally useful. One forecast error of -100 and another of +100 average to 0, which is highly misleading—but only to those who didn’t use the forecasts!
You can easily imagine other functions of error as goodness measures. But since our mathematical imagination is fecund, and since there are an infinite number of functions, there will be no end to these analyses, a situation which at least provides us with an endless source of bickering. So it might be helpful to have other criteria to narrow our gaze. We also need ways to handle the fuzz, especially when it has been formally quantified. That’s to come.
Update Due to various scheduling this-and-thats, Part II of this series will run on Friday. Part III will run either Monday or Tuesday.