Class 89: The Measure Most Modelers Forget: Skill Examples

Reminder: The Thursday Class is only for those interested in studying uncertainty. I don’t expect all want to read these posts. So please don’t feel like you must. Yet, I have nowhere else to put them besides here. Your support makes this Class possible for those who need it. Thank you. Math alert!

Skill is the improvement of one model over another, according to your Judgement function. Today some examples. You must demand skill in all non-deduced models.

Video

Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty

HOMEWORK: Try the code!

Lecture

Please review the last class on defining skill. I won’t repeat that material here. I’m expecting you remember it.

Again, the most common of models (in all fields) are regressions, and of them most are normal models. So our running example will be an ordinary linear regression, recalling there is nothing special about these ad hoc creatures, and that the normal is always—as in always—an approximation to Reality. Skill applies to all models, even physics models.

We’re going to use logistic (normal) models first, however, which are a tad more complex than ordinary regression. We do this because we haven’t learned how to compute goodness scores except for binary/dichotomous outcome models.

Recall we want

$$\Pr(Y \in s | x_1 = a_1, x_2 = 2_a, \cdots, x_p = a_p, M2),$$

(quoting myself) “for some set, or sets, of interest $s$” given measures x. Here, because we have a dichotomous outcome, there is only one set, which is “s=1” (or Yes, or On, or White, or whatever).

Our immediate comparison model (out of an infinity of models) is the one with no x:

$$\Pr(Y \in 1 | M1).$$

We’re going to use a standard dataset, as is customary, so you can compare what’s done here with with others do. But I want you to try everything here on your own datasets. That is the homework.

We’re using the low birthweight dataset, predicting whether birth weight is less than 2.5 kg. They give us such things as the mother’s age, her weight and race, whether she smokes and the like. There are only 189 observations, so there is no point creating a validation dataset. We’ll REUSE the original data as if it were new. This leads to great exaggeration in the models’ abilities, as you now know. But we don’t care here, because we’re not going to try to sell this model to anybody. It’s only for practice.

The code is at the bottom. Follow along with your own data (notice I said “your own data”).

The Brier Score for the “full” model M2, i.e. the one using all the x provided by whoever came up with this data, is 0.18, and for the naive or simple model M1 (with just an intercept) it is 0.22. So M2 has skill over M1. Ta da.

It’s really as simple as that. M2 is preferred over M1, conditional on our W (in our Judgement function), which says the Brier score is best. Is it best for you? We could do calibration and conformal prediction and all that, too, but it’s easy enough I can trust you to do it as homework.

Measure (Variable) Importance

As we discussed last time, we’re going to take out, in succession, each x_j, while leaving all the others in, and see how much better or worse the Score is. This allows us to rate the importance of each measure, compared to the full M2 model.

Here’s the result:

The dashed lines at top and bottom are M1 and M2’s Brier scores. This plot reads upside down, as it were. The measure “ht” is indicator of hypertension: it is the best or most helpful measure. The BS of almost 0.19 indicates that when taking this measure out the BS increases the most. It’s the worst thing to do, because lower BSs are better. In this way, assuming all the other measures are in the model, ht is the most important measure, the one that gives the best predictive skill.

The worst measure is age: the BS only creeps up a small amount with it out of the model. And so on for the other measures.

Obviously, you can add and subtract whatever combination of measures you like. And you can make math of all this, too, giving relative importance in the obvious way. We’re interested only in relative changes, because we are contrasting all this “one-x-removed” models with M2, the full model. Is M2 the model you would like to use in real life? Yes? No? If not, then make your M2 whatever is interesting to you.

Really Naive

In any case, it’s perfectly clear even the worst model with all the measures (always but one) is far better than the simple naive model.

The simplest model may be M0. Only 31% of the babies in the dataset had low birth weight. M0 thus says, for all babies, “This baby does not have low birth weight.” M0 will be right 69% of the time. Not bad for not math except taking an average, right? The Brier score for M0 is 0.31.

Consider M0′, which is the logical contrary of M0. It says “This baby has low birth weight” for all babies. M0’s BS is 0.61. M0 has skill over M0′. Yet M0 does not have skill with respect to M1, the smoothed simple regression. The reason is the regression more or less gives the probability of 0.31 for each baby’s birth weight, which is of course the rate in the data.

M0 is more akin to a decision than a probability (though it is both). Same with M0′. Which is a reminder the models M1, M2 and the like are not decisions. They only leave you with probabilities. Not decisions. Particularly not the decisions you made conditional on the model. If the model announced the probability “0.6” that a baby would be low birth weight, what would you do? What if the probability was, say, 0.4?

The decision layer is different. And thus ought to be judged differently. This brings us full circle to our slogan: the same model may be good for one man, and lousy for another. You must judge all models conditional on how useful they are to you. Not to anybody else (unless they share your W).

The BS for the decision “if the model probability > 0.5, act as if low birth weight” equals 0.25. This has skill as a decision rule over M0.

Observation Importance

With a fixed set of x, we can do the same service we did for measures for observations. It’s a lot more computationally intensive, because we have to recalculate a model for each observation left out. Here’s the result.

Of course, this one is much harder to read, and perhaps usually one would not make the plot exactly like this, and would rely on tables instead. But this display is a simplest way to demonstrate what is happening. I took off (did not include) the simple/naive model BS. Only the full model BS score is given as a dashed line. Note carefully that it is in the middle.

As before, this plot is upside down. Observation 127 is the best, in the sense that taking it out and recomputing the model gives the worst BS. For reference, here is observation 127 (the “223” is some internal number not of interest to us):

You’ll have noticed (I hope) that many observations make the model worse. Taking out observation 94, for instance, really improves the BS. But that “really” is relative. In any case, it is the worst. Here it is (ignore the “188”):

Whether you tell yourself causal stories about the difference is up to you: it seems to be mostly about the mother’s very high weight (lwt = 170 pounds) verse very low weight (lwt = 95 pounds). The “ptl” is the “number of previous premature labours.”

Let’s look at the averages of those observations which improved the model (those above the dashed line), and which hurt the model (those below it), in that order (I left off race, being lazy):

There are some differences in the “x”, but not major ones. The real tell is the the observations which do worst have a much greater proportion of babies of low birth weight, 69% versus 17%. The model M2 is just too smooth. Here is the calibration plot (use the code from two times ago) for the observations which improve the model (with a step of 0.05):

ONLY THOSE OBSERVATIONS WHICH IMPROVE M2′

This isn’t the worst thing in the world, but it’s not good. Notice the model gives probabilities all the way up to 0.4 and all did not have low birth weight. But then when the model gives probability greater or equal to 0.6, all the low birth weights. The model with just these observations is close to “50-50”.

And here’s the calibration for the “bad” observations, i.e. those which removing them improves the BS:

CALIBRATION USING ONLY THOSE OBSERVATIONS WHICH MAKE THE MODEL WORSE

A switch has been flipped. The model is still sort of “50-50”, but it makes the wrong (almost opposite) decisions (if we use the 50-50 decision rule).

Once again, this is because the logistic regression model, popular everywhere and used in all sciences to make causal claims, stinks. It over-smooths most things. Researchers never recognize this because why? Because they are playing with the Ps, and looking at parameter estimates. Completely useless activities.

Next time

Next time we’ll expand our scores, giving some for models which have more than a dichotomy.

Here are the various ways to support this work:

Subscribe at Substack (paid or free)
Cash App: $WilliamMBriggs
Zelle: use email: matt@wmbriggs.com
Buy me a coffee
Paypal
Other credit card subscription or single donations
Hire me
Subscribe at YouTube
PASS POSTS ON TO OTHERS

# THIS CODE IS IN NO WAY OPTIMIZED. IT'S ONLY FOR ILLUSTRATION. DO NOT USE IN REAL LIFE
library(rstanarm)
library(randomForest)
library(ggplot2)
library(scales)

library(MASS)
x = MASS::birthwt
  x$race = as.factor(x$race)
  x = x[,-dim(x)[2]] # we don't want actual birth weight

brier <- function(p,y){
    m = length(p)
    answer<-sum((y-p)^2)/m
    return(answer)
}


# all x model
fit2 <- stan_glm(
  low ~ .,
  data = x,
  family = binomial(link = "logit"),
  prior = normal(0, 2.5),      # Weakly informative priors
  prior_intercept = normal(0, 5),
  chains = 4,                  # Number of Markov chains
  iter = 2000,                 # Total iterations per chain
  cores = 2                    # Parallel processing
)

# only an intercept model
fit1 <- stan_glm(
  low ~ 1,
  data = x,
  family = binomial(link = "logit"),
  prior = normal(0, 2.5),      # Weakly informative priors
  prior_intercept = normal(0, 5),
  chains = 4,                  # Number of Markov chains
  iter = 2000,                 # Total iterations per chain
  cores = 2                    # Parallel processing
)


#summary(fit)

# "predictions" (of past data)
p2 = colMeans(posterior_predict(fit2))
p1 = colMeans(posterior_predict(fit1))

# scores
m2b = brier(p2,x$low)
m1b = brier(p1,x$low)

# Measure (Variable) Importance
nm = names(x)[-1]
s = NA
k = 0
for (i in 2:dim(x)[2]){
  k = k + 1
  y = x[,-i]

  fit <- stan_glm( low ~ ., data = y, family = binomial(link = "logit"), prior = normal(0, 2.5), prior_intercept = normal(0, 5),
  chains = 4, iter = 2000, cores = 2)
  p = colMeans(posterior_predict(fit))

  s[k]= brier(p,y$low)
}

df <- data.frame(score = s, name  = nm)


g = ggplot(df, aes(x = 0, y = score)) +
  geom_segment(aes(xend = 0), color = "gray70", linewidth = 1) +  # vertical stem
  geom_point(size = 5, color = "steelblue") +
  geom_text(aes(label = name), hjust = -0.4, size = 5) +
   geom_hline(yintercept = m2b, linetype = "dashed") +
   geom_hline(yintercept = m1b, linetype = "dashed") +
  scale_x_continuous(limits = c(-0.5, 1)) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.text.x  = element_blank(),
        axis.ticks.x = element_blank()) +
  labs(y = "Brier Score")

g

png('bw1.png',width=680,height=1200)
 g
dev.off()


# Observation  Importance
nm = 1:dim(x)[1]
s = NA
k = 0
for (i in 1:dim(x)[1]){
  k = k + 1
  y = x[-i,]

  fit <- stan_glm( low ~ ., data = y, family = binomial(link = "logit"), prior = normal(0, 2.5), prior_intercept = normal(0, 5),
  chains = 4, iter = 2000, cores = 2)
  p = colMeans(posterior_predict(fit))

  s[k]= brier(p,y$low)
}

df <- data.frame(score = s, name  = nm)


g = ggplot(df, aes(x = 0, y = score)) +
  geom_segment(aes(xend = 0), color = "gray70", linewidth = 1) +  # vertical stem
  geom_point(size = 5, color = "steelblue") +
  geom_text(aes(label = name), hjust = -0.4, size = 5) +
   geom_hline(yintercept = m2b, linetype = "dashed") +
   #geom_hline(yintercept = m1b, linetype = "dashed") +
  scale_x_continuous(limits = c(-0.5, 1)) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        axis.text.x  = element_blank(),
        axis.ticks.x = element_blank()) +
  labs(y = "Brier Score")

g

png('bw2.png',width=680,height=1200)
 g
dev.off()

# let's look
i = which(s<m2b)

signif(colMeans(x[-i,-4]),2)
signif(colMeans(x[i,-4]),2)

Discover more from William M. Briggs

Subscribe to get the latest posts sent to your email.

Comments

No comments yet. Why don’t you start the discussion?

John Pate on Why There Are (Probably) No UFOs: “Climate Change”June 4, 2026
There are either no space aliens or they're here and running things for their own purposes. Either way there's nothing…
John Pate on No Model Will Save Us: Pope Leo, the Miserostat, and AI’s Woke CodersJune 4, 2026
The Pope is still flogging the Climate Change dead horse, which of course relies on models for its supposed veracity.…
C-Marie on No Model Will Save Us: Pope Leo, the Miserostat, and AI’s Woke CodersJune 3, 2026
Thank you, Matt! Poor, poor Pope Leo XIV and the synodal business .... praying every day for Leo, for full…
Anonimos?aw on Shocking Abortion Numbers: Death Distributions Show We’re Living LongerJune 3, 2026
"By age 80, the men who made it are more robust and the women start slipping under faster." Well, since…
jgh on Shocking Abortion Numbers: Death Distributions Show We’re Living LongerJune 3, 2026
I do census data research, and something very striking is comparing age profiles. The age distribution in 1900 is more-or-less…