Reminder: The Thursday Class is only for those interested in studying uncertainty. I don’t expect all want to read these posts. So please don’t feel like you must. Yet, I have nowhere else to put them besides here. Your support makes this Class possible for those who need it. Thank you.
What makes a mathematical score of model goodness proper?
Video
Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty
HOMEWORK: Read!
Lecture
We have our model statement in hand, Pr(Y|M), and indeed we might have a collection of these, Pr(Y_i|M). We learned the last two times the very best measure of model goodness is the judgment that you, and only you, make based on how worthy the model was to you. One and the same model may be wonderful to you, and lousy to another depending on the worthiness criteria. We also learned there is no general solution, unless the individual case involves a known necessary moral truth.
I warned us, vehemently, that judgment measures that were mathematical ought not be loved or trusted because they were mathematical. I warned us that even given this forceful admonition, which I am repeating here, you will fall prey to the allure of mathematics anyway. Let’s prove that.
A substantial class of Y in life are dichotomous: Yes/No, True/False, 1/0, White/Black, Male/Female, Greater than X/Less than or equal to X, AM/PM, Precipitation/Dry, and so on and so forth. Math comes easily to this class. Usually with an ‘indicator’ function, i.e. I(Y), which equals 1 when Y is true and equals 0 otherwise. In some notation the indicator function is left off but tacit, which is fine as long as you remember it is there.
We’re doing math, so we’re interested in scores, which in some cases might be called “loss functions”, quantifying the penalty you pay for being wrong. No penalty is paid for being right. But you might also gain by being right, so the broader class of math can be called gain-loss functions. Which is best is, again and again and again, that which you used, implicitly or plainly, in employing any model.
Let p = Pr(Y|M), assuming Y is dichotomous. Last time we used the score (I(Y)-p)^2, or, dropping-but-remembering the indicator function, (Y-p)^2. Why this form? Easy enough: if we imagine we have some loss function L(Y,p), but don’t quite know its mathematical form, a simple Taylor series expansion gives a constant + (Y-p)^2 + higher order terms. As in most Taylor series, the higher order terms are ignored, and the constant is not important.
The score (Y-p)^2 is called Brier in honor of the fellow who popularized it. As we said last time, this has “nice properties”, like always being positive, having a known interval, and so on. And it can handle local truths and local falsities. These are times in which M dictated p = 1 or p = 0.
There are many other loss scores. Another popular one is the log-loss:
$$S(Y,p) = -Y\log(p) -(1-Y)\log(1-p).$$
(Some just have the first term.) If Y = 1 the closer p gets to 1, the closer the score gets to 0 (which you recall by convention is best). Sort of. And when Y = 0 the closer p gets to 0, the closer the score gets to 0. Sort of. Obviously, the log-loss score cannot handle local truths and local falsities. The (natural) log of 0 gets whacky.
This score is motivated by reasoning about entropy. It looks almost like entropy, too, which is usually written
$$H(p) = -p\log(p) -(1-p)\log(1-p),$$
which of course comes from (again, usually written)
$$H(x) = -\sum p(x) \log p(x),$$
and which shows how this score can be generalized where Y has more than two categories.
We’ll hold off (again) our discussion, our very important discussion, of entropy. Right off the bat, though, if I have trained you well, your alarms bells ought to be clanging madly.
If they are not clanging, pause and see if you can discover why they are not.
Question: Since there is no such thing as probability, how can there be such a thing as entropy?
Answer: There cannot.
Entropy, like probability, is always conditional. In other words, we ought to always write
$$H(x|E) = -\sum p(x|E) \log p(x|E),$$
for some evidence E (like a model). Therefore, just as nothing has a probability, nothing has an entropy. Entropy is a measure of knowledge, and not of things, just like probability.
Meaning you caught me in a laziness. We never have a score written S(Y,p), but we have S(Y,p | W). Which is also S(Y_reality, Pr(Y|M) | W). The M and W are always there, even when we forget to write them. That we often forget to write them is why we fall in love with the math, slipping into the idea that the math always applies because it is a property of the world. Which it isn’t.
You’ll have noticed these are only loss scores, and say nothing about gain. These, and many others, are also symmetric, you score as badly for being wrong in either direction. When in Reality, there is gain in using good models, where ‘good’ is relative, and where we saw a prime example with Paul Ehrlich, as well as loss in using bad ones. And the two, gain and loss, are not always symmetric.
A prime example is breast cancer screening. There is a real price to pay for a screening that turns up a false positive, but also much can be gained, perhaps far different from that loss, for catching an early cancer. There’s of course a lot more to this, and we’ll cover all that separately.
The point is that “scores”, or rather “utility functions” can (but not necessarily) exist for these. When predictions involve money, the natural utility is that money, but even then not all money means the same to all people. There is great temptation to quantify emotions as if they were the real utility. Not that emotions aren’t important, but putting arbitrary numbers to them can be ridiculous, especially when one is doing so because that’s what the software offers.
A utility function is nothing more than S(Y,p|W) in any general form. Which means there are an infinite number of such functions. We’ll come to individual ones when needed.
This brings us to the idea of a proper score.
This can be hard to read about because of weird-looking (but perfectly correct) notation often used. The idea is we want a score that can’t be “gamed”, or not easily. Recall your model statement is p = Pr(Y|M). Here’s an example of a score that is easily gamed:
$$S = 1- I(p\in {0,1}) – I(p\in (0,1))p,$$
which is 1 minus the indicator of if p is either 0 or 1 minus p times the indicator whether p is in the interval (0,1) . In other words, the best score is when the p is extreme. Suppose your p = 0.7. If you reported that, you’d get S = 1-0-p = 0.3. Yet if you set aside that p, which your M implied, and picked p = 0 instead, or p = 1, then your S = 0.
In other words, you are encouraged to go extreme and not report the p implied by your M. That encouragement comes in requiring the score you “expect” to be the best score you can get with your p = Pr(Y|M). That “expect” is the statistical meaning of that term.
Pick any score with a probability you think you can game, i.e. some q and not your p = Pr(Y|M). Take the Brier: (Y-q)^2. That will equal either (1-q)^2 or q^2, depending on whether Y = 1 or Y = 0. According to your own model, you deduce the probability the score will be (1-q)^2 is p, and thus the probability the score will be q^2 is (1-p), because the probability Y = 1 is p (given M).
That makes your “expected” score:
$$p(1-q)^2 + (1-p)q^2.$$
The extreme of that “expected” score is found by taking the derivative with respect to q (as you recall from calculus). If you do that (which I’ll spare you), you get
$$q = p.$$
In other words, the best “expected” score, conditional on M (and W), is the one in which q = p = Pr(Y|M). This works for any score of any dimension.
It is not the best score, tout court, for that is still S = 0. In other words, there is still an encouragement to have p be extreme, i.e. either 0 or 1. But with a proper score you would hesitate to report the extreme, because your own model only gave you imperfect confidence, i.e, p somewhere in (0,1), but not the end points.
Most, but certainly not all, scores in common use are proper in this sense. When we come across ones that aren’t, I’ll let you know. But you can easily do the calculation yourself, too, now you know how.
I’m happy to report there is a resurgence of interest in scoring and verification, and, believe it or not, mostly because of AI. It all begin in weather forecasting back in the 1950s (my advisor’s advisor, Allan Murphy, was one of the founders), and we can all be mighty glad computer scientists have picked it up, because that means they understand prediction is the true test of models.
This is lesson classical statistics (in its frequentist or Bayesian form) has failed to learn. For them, it’s just testing-testing-testing and bizarre parameter-based statements.
Here are the various ways to support this work:
- Subscribe at Substack (paid or free)
- Cash App: $WilliamMBriggs
- Zelle: use email: matt@wmbriggs.com
- Buy me a coffee
- Paypal
- Other credit card subscription or single donations
- Hire me
- Subscribe at YouTube
- PASS POSTS ON TO OTHERS
Discover more from William M. Briggs
Subscribe to get the latest posts sent to your email.
