Class 87: Calibration & Conformal Prediction: Examples

Reminder: The Thursday Class is only for those interested in studying uncertainty. I don’t expect all want to read these posts. So please don’t feel like you must. Yet, I have nowhere else to put them besides here. Your support makes this Class possible for those who need it. Thank you. Math alert!

Calibration is an intuitive requirement and leads to a new kind of modeling called conformal prediction. Today, we give numerical examples and code.

Video

Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty

HOMEWORK: Try the code!

Lecture

I’m going to assume (yes, I know the joke) you have read and assimilated the Theory lecture from last time. I will not here do any review.

I’m going with a bad-good example. Bad, because the fine points will be difficult to see. Good, because it is so very typical. Most models are poor, according to most quantitative scores.

Download (if you want to follow along) the Bank Marketing data, which is known in “machine learning” and “AI” circles. Some 45 thousand observations. The outcome is “has the client subscribed a term deposit?” To help predict that, they have such things as client age, marital status, amount on deposit and the like. I’m going to use only a few of these “predictors.”

Our goal today is not to find the best model of subscription. All code is at the bottom.

Our goal is to take a model which has been given us (so to speak) and see how well it performed. That’s it. Another day, I’ll explore (again) the idea of the Great White Hope Of Models. The false hope that with the right magic formula—given by AI, of course—we can predict with perfection. This will not happen.

Passing by that, I’m going to look at two models:

Model 1: a simple logistic regression as a function of age + job + marital + education + balance + loan.
Model 2: a random forest with the same predictors.

To save model-fitting time, I dichotomized the bank balance of clients to whether they have a positive or negative balance (about 20% have negative balances!). The “loan” variable means other loans the client might have had. A sociologist would use these very terms, whereas somebody else might not. So the models we look at do have applicability to modern “research”.

I grabbed 10,000 observations and did not use them to fit these models in any way. Indeed, I picked these predictors because they were first in the list, and balance because I thought it would be interesting. But it was so big that my poor computer chocked on running the model fitting, which is why I dichotomized it. The remaining 35+ thousand were used to fit the model.

In real life, all 45+ thousand would be used to fit the model. That model would then be used to make predictions of data never before seen or used in any way.

Of course, everybody cheats in real life, and use the old data many times to find better fitting models. They will always be rewarded for their efforts. In the sense that they will indeed find better fitting models. But will these tight models predict better? Not too often.

Here we avoid all that by not caring about finding “the best” model. The best model would be one where we knew the full cause of every data point, which is impossible; as is not possible. No, not even with AI with trillions upon trillions upon etc. parameters.

We’re going to look at a proper score from predicting the non-used “hold out” or verification data. We’ll also check calibration, and build a crude conformal prediction model using the calibration results. We are not trying here to build the best conformal prediction model here, either. We only want to see the ideas in action.

I wrote my own calibration function. It is entirely discrete. This is a strength because all measures, ever, are finite and discrete. But it is a weakness in that naive use of it doesn’t help much. There are any number of calibration functions out there (like Frank Harrell’s), and these are fine, but they end up smoothing. We want no smoothing except for simple averaging. Our goal is to understand the basics, not (again) to make the best model.

The same holds for the conformal prediction, which here only calibrates model output, making a new model (as you recall). The one here is meh. It works, but I can’t see anybody using it in real life. Nevertheless, it is trivial to understand. Better conformal models abound, but these grow complex fast and invoke smoothing.

Nothing wrong with smoothing. Remember our discussion of “normal” models? Right: they can be used knowing they are always, and in every instance, approximations to Reality. As long as we hold firmly in our minds we have approximations, all goes well. Alas, with most smoothing people forget. The Deadly Sin of Refication sneaks up behind them and, well, the results aren’t pretty. Next thing you know victims are computing p-values.

In the training data, there are only 4,074 Y = 1, out of 35,211. Which is 12%. So this is a tough dataset to make good predictions on. Think. If we said p = 0 for ALL predictions, we’d be right in the training data 88% of the time. We have to beat that.

In the verification data, there are 1,215 Y = 1 out of 10,000, also 12%. We’re really, in effect, training to nail those Y = 1s.

Calibration

We break the predictions into bucks of size byp. I use 0.05 as a default. That gives 20 bucks (0, 0.05], (0.05, 0.1], etc. Put what you like, but the smaller the bucket, the more data you must have. Because next is to count the Y_Reality = 1 in each is probability-prediction bucket. The small byp, the more buckets, but the smaller the sample in each bucket. The proportion of Y = 1 (dropping the Reality subscript, but remember it is there) in each prediction bucket is plotted by the max of each bucket.

Calibration is achieved with the proportion of Y and probability prediction in the bucket are equal.

Simple as that. Nothing more than counting and averaging. Instead of taking “0.05” as the probability for the bucket “(0, 0.05]” for plotting (or calibrating), one could pick the mean, or “0.025”. But that required more effort, and I am lazy. I leave it as homework.

Like I said, more sophisticated models will smooth this, meaning they will create a NEW MODEL of p and Y. That adds uncertainty, which most instantly forget. So we’re not doing that. Anyway, for large data sets, this calibration is good. And perfect if in fact the cut points match the decisions you would make.

If you would make different decisions if the probability prediction is 0.025 rather than 0.075, then buckets of 0.05 are just the thing. If you don’t care between 0.05 and 0.15, then use buckets of size byp = 0.1. And so on. This calibration function is exact in this way.

Conformal prediction

If in a bucket, the proportion of Y equals the max of the bucket, then the points are calibrated at the probability. And so do not need to be adjusted. Or, the adjustment is 0.

If the proportion of Y in the bucket is larger by some q, then we need to add q to all the probability predictions in that bucket to bring it to near calibration. And similarly if the proportion of Y is lower than the probability: we subtract q.

This is only near calibration and not exact because once you add or subtract a q to probabilities, they can change buckets if they were near the edge of the bucket. And because we are applying the correction q to every prediction in the bucket (some of which may have had Y = 1 and others Y = 0).

But this is close enough for us to see how it works.

Brier score

We’ll use this because it’s the most common. We’ll see another day why things like ROC curves are WRONG. They are the p-values of machine learning. You simply cannot talk people out of them.

Here we need to be careful. If we’re calibrating, we’re saying “These buckets correspond to the decisions I would make” and therefore that is how we must score. If there is no difference to you (ahead of time, when you are in ignorance of Y_Reality) in predictions of p = 0.83 and p = 0.84, then both predictions ought to have the same score applied to them (we suppose they are in the same bucket). In the Brier, the p = 0.84 is closer and would receive a better score. But that’s just math, and not how YOU USED THE MODEL.

Instead, if these values are equivalent to, for instance, p = 0.85 in the decisions you would have made, then both ought to be scored as if they were p = 0.85. I’ve been stressing, nauseatingly often, that the best score is in the decisions YOU make, and not anybody else, with the models. That goes for the calibration and the conformal prediction, because choices must be made in each.

There is no one-size-fits-all score, calibration, or conformal prediction.

Model 1 Logistic Regression

Logistic regressions smooth. Hard. They do not show cause. It’s hard to find a sociologist or economist or the like who, whatever they say, understands this. We discussed the vast limitations of regressions before, so I say no more. We’re looking at the predictive posteriors here using the default values in rstanarm. All seems well behaved by the standard measures people use for MCMC methods (not knowing how to do analytic integration).

Here is the calibration plot for the model, with buckets of 0.05:

Pretty good, for the wee range of probabilities the model spat out. The max is only 0.42. Meaning the smoothing is way too aggressive, as it usually is for regressions. Meaning even when Y = 1, the probability predictions were never above 0.5. Sad.

And here it is for the conformal prediction:

A slight improvement. But we didn’t have much room here to make improvements. We could make smaller buckets here, but what’s the point? The model is not going to change in any real way.

We could loosen the buckets, make them 0.1 (which I won’t show, but which you can do yourselves), and the conformal shows a nice and tight calibration, but we only get buckets of 0.1, 0.2, and 0.3. Maybe that’s useful for some decision maker.

The bucket Brier score for the original model was 0.104, and for the conformal model is 0.103. A minuscule difference, but still we get the promised improvement.

Model 2 Random Forrest

I haven’t introduced these before. Roughly, they take a giant multi-dimension cross-table matrix of predictors (all females age < 10, all females age > 10, all males < 10, all males > 10, etc.) and cross that with the outcome. Cells in this predictor-prediction table are then used to vote which outcome is more likely. For instance, “all males > 10” had 80% 1s, then the prediction is 80% for “all males > 10” and the same idea for the other three buckets.

Of course, the whole thing is more complex than this, because making one huge crosstab would be too expensive and monstrous in size, with many (mostly) empty cells, so some kind of mixing-up search is iterated on subsets. Random forest models also do not show cause.

This is a form of smoothing, too, but weak. Regression is a steamrolling smoother. Regressions are the best kind of model in the world if Reality itself is smooth in the same way. Worst kind of models when Reality isn’t smooth. Which it usually isn’t. Which is why machine learning guys are kicking statistician ass.

RF is better than LR, but not by a lot. It gives some predictions with probabilities greater than 0.5, which is good.

Here is the calibration plot for the model, with buckets of 0.05:

Not too bad, as far as these things go. Fairly well calibrated at the beginning, so we won’t gain much with the conformal prediction

And here it is for the conformal prediction:

You can see it’s a tighter calibration. Success.

The bucket Brier score for the original model was 0.1014, and for the conformal model is 0.1007. Better, but so small that nobody would care.

If you decrease the bucket size to 0.02, you get a better idea of what’s going on with the calibration. Original:

That’s a lot more miscalibration, but the buckets are a lot smaller.

And conformal:

Better, but this has a lot of empty buckets.

Briers of 0.1009 and 0.1004.

But then, to we’re really polishing the dried spoor of some rough beast. The model stinks, really. Very few predictions over 0.5. Only 27! For a sample of 10,000.

Of course, we ignored a lot of the data. We could tighten this. But we’re not trying to here. No power will get it much better, because all the causes aren’t in the data.

These are only correlational models.

Here are the various ways to support this work:

Subscribe at Substack (paid or free)
Cash App: $WilliamMBriggs
Zelle: use email: matt@wmbriggs.com
Buy me a coffee
Paypal
Other credit card subscription or single donations
Hire me
Subscribe at YouTube
PASS POSTS ON TO OTHERS

# Example R code to teach calibration and conformal prediction. Written for clarity, not optimization.
# Download any packages you don't have using install.packages('packagename')
library(rstanarm)
library(rms)
library(randomForest)

# Download: https://archive.ics.uci.edu/dataset/222/bank+marketing
x=read.csv('bank-full.csv',sep=';', na.strings ='', stringsAsFactors=TRUE)
  x$balance = (x$balance>0)*1 # dichotomize balance
  x$y = as.numeric(x$y)-1  # make y/n into 1/0
  summary(x)

# The seed ensures you get the same results each time you run
set.seed(732)
i = sample(1:dim(x)[1],1e4) # split the sample
y = x[i,] # verification set
x = x[-i,] # training set

# Model 1: Logistic Regression. This converges fine.
fit1 <- stan_glm(
  y ~ age + job + marital + education + balance + loan,
  data = x,
  family = binomial(link = "logit"),
  prior = normal(0, 2.5),      # Weakly informative priors
  prior_intercept = normal(0, 5),
  chains = 4,                  # Number of Markov chains
  iter = 2000,                 # Total iterations per chain
  cores = 2                    # Parallel processing
)
#summary(fit)
# Predictive posterior probability predictions (how's that for a mouthful) on y
p1 = colMeans(posterior_predict(fit1,newdata = y))

# Model 2: Random Forest
fit = randomForest(y ~ age + job + marital + education + balance + loan, data = x, ntree = 500, mtry = 2, importance = TRUE)
print(fit)
# Probability predictions on y
p = predict(fit, newdata = y)

# Brier score. Not much to it
brier <- function(p,y){
   # no checking/handling for missing values!
    m = length(p)
    answer<-sum((y-p)^2)/m
    return(answer)
}

# Calibration plots, scoring, conformal model building, and redoing calibration and scoring for conformal model
conformal.p <- function(p,y,byp  = 0.05){
   # No checks of any kind are made. No provision for NAs are made. If you use this in real life, you are crazy. This code is in no way optimized.
   # p are the probability predictions
   # y is Y_Reality, 0s and 1.
   # byp is the probability bucket size

   # calibration plot
   s = seq(0,1,by=byp) # bucket boundaries
   p.cut = cut(p,s) # divides probabilities into buckets
   a = table(p.cut,y) # calibration table
   z = a[,2]/rowSums(a) # calibration counts
   plot(s[-1], z, xlim=c(0,1),ylim=c(0,1),xlab='Model Probability',ylab='Observed Frequency', main= 'Original Predictions Calibration Plot')
     abline(0,1)
     grid()

   # Conformal prediction corrections
   e = s[-1]-z
   j = which(is.na(e))
   m = length(j)
   if(m>0){
     e[j] = rnorm(m)*1e-6 # noise added because otherwise we get duplicated 0s, which collapse into one factor level; this ensures we keep the same number of factor levels as in p.cut
    }
   # add the corrections
   p.cal = p.cut # p.cal will be conformal predictions
   levels(p.cal) = e  # reuse the cuts which are now corrections
   p.cal = as.character(p.cal) # R quirk; you can't directly turn factors into numbers, but you can characters
   p.cal = p - as.numeric(p.cal) # p.cal are now calibrated (roughly)

   i = which(p.cal <= 0) # in case
   if(length(i)>0 ) p.cal[i] = 1e-7
   i = which(p.cal >= 1) # in case
   if(length(i)>0 ) p.cal[i] = 1 - 1e-7


   # we want the score to match our decision buckets
   levels(p.cut) = s[-1]
   p.cut = as.character(p.cut)
   p.cut = as.numeric(p.cut)
   cat('Brier raw ',brier(p.cut,y),'\n')

   p.cal = cut(p.cal,s)
   a = table(p.cal,y)
   z = a[,2]/rowSums(a)

   if (Sys.info()["sysname"] == 'Windows') X11() else quartz()
   # plots the conformal calibrations
   plot(s[-1], z, xlim=c(0,1),ylim=c(0,1),xlab='Model Probability',ylab='Observed Frequency' , main= 'Conformal Predictions Calibration Plot')
     abline(0,1)

   levels(p.cal) = s[-1]

   p.cal = as.character(p.cal)
   p.cal = as.numeric(p.cal)
   cat('Brier conformal ',brier(p.cal,y),'\n')

}

# Model 1: Logistic regression
conformal.p(p1, y$y, byp=.05)

# Model 2: Random Forest, with two different probability buckets
conformal.p(p, y$y, byp=.05)
conformal.p(p, y$y, byp=.02)

# This is Frank Harrel's version of the calibration plots. It smooths and has information we have not yet covered.
#library(rms)
# y = actual binary outcome (0/1), p = predicted probabilities
#val.prob(p, y$y)

Discover more from William M. Briggs

Subscribe to get the latest posts sent to your email.

1 Comment

gareth

April 29, 2026, 4:36 pm

Timestamp 21:35: A “correction” from 0.8 to 1.0 is a “correction factor” of 1.25, not 0.2 (akshully wot you mean is a correction “magic number” or “fudge factor” of +0.2, Shirley).
Even a maths semi-literate like me* can see this: “factor” is multiplication, innit?

*but still watching and mostly following :-)

Cary Cotterman on University of California Prof Tries To Save System From DIEingJuly 13, 2026
In the early '90s I was in a university writing class. We were required to exchange writing assignments for critiquing,…
C-Marie on University of California Prof Tries To Save System From DIEingJuly 13, 2026
Would subscribe to you at Substack, but their Privacy Policy allows Substack way too much information. Anyway, this way is…
Brian (bulaoren) on University of California Prof Tries To Save System From DIEingJuly 13, 2026
Back in the late '70s, one of my teeth broke. I needed a dentist pronto, and so I asked around.…
Tars Tarkas on University of California Prof Tries To Save System From DIEingJuly 13, 2026
I love how they blame "unprepared students" and the covid "remote learning" era. The students are not "unprepared" in the…
CorkyAgain on The Equality ParadoxJuly 10, 2026
Somehow I'd missed that story about Schumer's grilling gaffe. Aside from not lighting the fire or putting cheese on a…