Class 67: The Philosophy Of Models (Regression): The Right Way

Class 67: The Philosophy Of Models (Regression): The Right Way

If you cite, enjoy, or create “research” or “studies”, this post is a must. I’ve eschewed all math (given next time in The Wrong Way) and focused entirely on the idea. I hope you will both read and pass it on. If you are enjoying these classes and get value out of them, consider supporting them using the link at bottom.

Video

https://youtu.be/3KvJ43pvwis

Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty

HOMEWORK: Given below; see end of lecture.

Lecture

We have, at long last, arrived at statistical models. We started from nothing–in a quite literal sense–and built that into our entire system of epistemology, of certainty and uncertainty, which is to say, logic and its completion probability.

I don’t want to continuously re-cover those topics, yet some of that will be inevitable. Because great swaths of it will be forgotten in the mad giddy happiness of modeling. My prophecies that you will have forgotten all warnings about models will be realized. Sadly.

Our goal, which you must never forget, but will, is to discover the full cause–formal, material, efficient, and final causes–of those things modeled, and the conditions under which those causes come about. Falling short of cause, which we almost always will, we come to correlation, which is not cause, but which can be used to make predictions.

Almost all models will be correlational, even if they have causal components. The biggest mistake made is to suppose My Correlation Is Causation, just because it is your correlation. The second, and a whopper, as we have discussed ad nauseam, is to say your wee P confirmed your correlation is a cause, a double-plus-bad fallacy.

All of you will make it.

We start with the most ubiquitous model: regression. It is simple. It is everywhere, and, with incredibly vanishingly rare exceptions, used wrongly. In error. Mistakenly. Badly. Horribly.

Yet today will will do it right. Next time we do it wrong. Which is to say, the way most do it.

Our example will be the First Year GPA for College Students as collected by this site. It’s only a couple of hundred observations of what the title says, including whether the person is White or not. Our first duty is to examine this data. Here is the histogram of GPAs by White or not.

And now the hardest question of all, impossible for anybody with formal statistical training to answer: Are Whites different than non-Whites?

Yes. Otherwise we never would have been able to track who in the data was White and not, except by fancy or fantasy. What about GPAs between the races? We can line up the numbers and stare, or we can look at pictures like this, the histogram of GPAs by White or not.

Again I ask our impossible question: Are Whites different than non-Whites with respect to GPA?

Yes. Just look. They are not the same. Therefore they are different. That is our “hypothesis test”, which required from us no math. I award you a Degree in Statistics.

If you got the question right and had no qualms.

Now the easy question: What caused the differences we see?

I don’t know, and neither do you. Not based on this data, anyway. All it shows us are two sets–two mighty small sets–of GPAs and some race markers. That’s it.

How accurate are the markers? I have no idea, and neither do you. Given how manic and perverted universities are about race, it could be anything. How accurate are the GPAs measured? I don’t know and neither do you.

How applicable are these observations to informing our uncertainty about new observations; or, rather, observations as yet unknown to us? Or known but not used? I don’t know and neither do you.

But we can assume. Yes, all models ass.u.me. Or guess. Or hope. All models are conditional, as all probability is conditional, as all logical arguments are conditional, on what we assume. So let’s assume.

If all we wanted was to know about was these people, in this set, then we are done. We have gone as far as we can. We have no hope of answering any causal questions. We can guess, of course, using outside knowledge. We can, any of us, condition on any evidence we please, and come to any judgement of cause that we please. Have a ball. Your task is then to convince others the outside knowledge you dragged in is justified. Lots of luck.

Here are two causal explanations, both equally supported by what we see: (1) Whites are often more intelligent than non-Whites; (2) Whites used “racism” against non-Whites to change their GPAs. If you wanted to believe either, here is data which says you’re right.

Suppose instead we want to say something about new observations, folks who will look in some way like the old people we measured. That is, that new folks who experience the same set of causes and conditions, whatever they might be, even though we don’t precisely know what these are. So we can’t say in what proportion these new causes and conditions will eventuate; all we can do is assume they’ll be the same.

That assumption made, we can turn our correlational model into a predictive model. And we shall use regression for this model.

We start by quantifying our uncertainty in GPA. New, not the old. We know the old. Now in real life, any person’s GPA can only take one of a finite and discrete set of values. Add all GPAs and divide by the number of classes: that average will always be finite and in a discrete set. Even if it seems large to you it is still finite and discrete. It cannot be that GPA exists on the continuum of an uncountable number of possibilities. It cannot even be in an infinite set, because GPAs won’t go to infinity.

We would do best by discovering what rules govern GPA at this place and deduce the finite discrete set of GPAs we might see for however many new people—call that number n—we want to predict. This n will always be less than infinity.

We won’t do that, because that’s work, and nobody likes work. Takes time from writing grants. Instead, we’ll just pretend GPAs live on the continuum. Why not? Everybody else does.

The idea is to suppose the uncertainty in any new GPA is quantified by a normal distribution. That takes two parameters. Something to specify the central point and another to specify the spread. Now we could just take the same two parameters for Whites and non-Whites, or we can give Whites their own two, and non-Whites their own two. Or we can do what everybody else does and given Whites their own central parameter and non-Whites their own (different) central parameter, and let them share the spread parameter. Sharing is caring. Sharing makes for much easier math, which is the real reason.

Do you know what the parameter values are? No. I don’t either. How can we? They do not exist! They certainly do not exist in the observations we’ve already made. They do not exist in new data, either, because we’ve already deduced GPA is finite and discrete and normals given probabilities on the continuum.

So parameters are a useful fiction. Which means it makes no sense, no sense at all, speaking of “true values” of them. It makes as much sense as trying to estimate the true color of a unicorn’s horn.

We don’t care about any parameters anyway, since we’re interested in GPA!

We have already assumed a normal model, knowing it’s strictly wrong, but hoping it might be a useful model just the same. We then assumed the number of parameters. Now we have to assume the uncertainty in the parameters. Lots of ways to do it. Which is right?

None of them.

Bayesians call the initial uncertainty assumption on the non-existent parameters “the prior distribution”. Fine. Some like to assume an uncertainty using something called maximum entropy. Others use something called “conjugate” priors. Others fake the math entirely and say the parameters can equally likely take any value, which is false. These are called “flat” priors (for the obvious reason). They are impossible on the continuum. But there’s ways to tricking out the math to use them anyway.

There are certain things to recommend some uncertainty approaches over others, the most important being the usefulness of the eventual model, but we’ll today ignore these. Because it doesn’t really make any difference to us. If you change your assumption, you’ll change your probability. So each assumption can give different probabilities. But we knew that. We hoped for that! We learned that is how every single probability and logic argument works.

And just think: if we picked another model, and not the normal, we’d again get different probabilities. As we should. None of these will be the “true” model, unless we really have nailed the causes and conditions of each and every GPA, which is a practical impossibility. Think how much goes into just one grade! We’re stuck at correlation.

So for now we’ll let the software we’ll use (R) pick whatever defaults they have (conjugate), not caring much about that. These assumptions are drowned in the deluge of assuming the normal model in the first place anyway. The parameter assumptions matter, but not very much. Not here, anyway. We’ll later meet cases where they do matter.

Now there’s some rigmarole about the method of mathematics used in finding the answer, but we today again don’t care. We’ll do the math next time anyway. This uses something called MCMC, which we covered before. We recall the one big lesson: it’s just a technique for numerical integration, and nothing more. There is no mysticism about “drawing” “random” numbers to bless the results. That’s pagan mathematics, which we eschew. We want

Pr(GPA = g | data, model, assumptions, White)

and

Pr(GPA = g | data, model, assumptions, non-White)

for whatever values of g we like. That’s what the math gives us.

The answer is 0. Always 0, regardless of g. Because we picked a model that lives on the continuum, and the chance of seeing any valuable out of an uncountable infinity of numbers is always o.

Oops. Well, we knew the normal was going to be an approximation. Our duty is reduced to one: always remembering it is an approximation, and never, ever, never, never, never falling prey to the Deadly Sin of Reification and supposing the model is Reality.

There’s two ways of handling this 0 difficulty. And both are really the same. Turn the approximation into a discrete form. Like this:

Pr(g low < GPA < g high | data, model, assumptions, White)

Just pick the “g low” and “g high” to be meaningful steps. Like, I don’t know, every 0.1 of a GPA. It depends on the decisions you’d make with the model, which might not be the same for you as me, or anybody else. Your cuts are not my cuts. There is no universal correct method here, even though it seems like there is one. We meet that next week in The Wrong Way.

Meanwhile, here’s a picture of new GPAs by race (White = 1). It is discrete. Maybe too fine a cut. Whatever. It’s bumpy because the approximation method of integration is rough. Which means you must add the integration method to the list of assumptions! Change it, and change the probability. Change any data point or assumption and you change the probability! That’s a feature, not a bug. There is no true probability because probability does not exist! There’s math we can do to make it smooth, depending on the parameter assumptions we make. But we don’t care today. We only want to interpret what we see.

The model predicts Whites will have on average a higher GPA than non-Whites. How much higher? Depends entirely on what question you ask. There is no universally applicable question all model users have. To think so is The Wrong Way.

How about this:

Pr(GPA_white > GPA_non-white | data, models, assumption, race) = 0.69.

But consider if the answer was 0.5. That means knowing race is not informative here. It does not–as in NOT–mean race is not causal. Race could have been causal in some individual cases, or even all, in various ways and directions. All–as in ALL–we can say is knowing race tells us nothing here.

But is the question itself interesting? I don’t know. Depends entirely on the uses to which the model is put.

Here’s two more questions:

Pr(GPA > 4.15 | data, model, assumptions, White)?

and

Pr(GPA > 4.15 | data, model, assumptions, non-White)?

We know by the rules GPA the answer is 0 in both case, because nobody can have one higher. It is not possible. The model answers are:

Pr(GPA > 4.15 | data, model, assumptions, White) = 0.002

and

Pr(GPA > 4.15 | data, model, assumptions, non-White) = 0.015.

Oops. The model is predicting real chances for impossible outcomes. I call this probability leakage. This also means the parameters, even if they could be right, are wrong.

Well, the model is an approximation. Is the error here too big?

There is no universal yes or no. Depends entirely on the uses to which the model is put. It your application it might cost you. In mine, it might not.

Thing is this: probability leakage is rampant. Errors are often a lot larger than single digits. There’s a lot of bad models out there. And almost no one knows about it, because why? Because of The Wrong Way.

Well that’s it. That’s how all models work. As you can see, it’s just regular old probability. YOU CAN ASK ANY QUESTION YOU WANT AND ANSWER IT, IF IT CAN BE ANSWERED. What a terrific way to approach models.

No special interpretations or fanciness is needed. You already knew everything, really. Oh, sure, there’s lots of math to do a lot of this, but that’s just mechanics. It has interest, and we’ll do it, but not one bit of it changes the philosophy.

We have lots more to do. We have to learn what makes a good and bad model, for instance. And we have to learn how to do it wrong. Important for those seeking academic careers.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use PayPal. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.

Here’s the code, unexplained today, which assumes you downloaded the data at the link; if you don’t have R, it’s free, and you may have to download the packages called (shown how in the first line):

#install.packages("Stat2Data")
library(Stat2Data)
require(rstanarm)
require(ggplot2)
require(scales)
data(FirstYearGPA)

# the initial data histogram; saved in g in case you want to save it
g = ggplot( aes(x=GPA, fill=as.factor(White)), data=FirstYearGPA) +
    geom_histogram( alpha=0.6, position = 'identity') +
    theme_bw() +
    scale_fill_manual(values=c("#69b3a2", "#404080")) +
    labs(fill="")
g

# the norml model
fit = stan_glm (GPA ~ White, data=FirstYearGPA,iter=1e5)
z = data.frame(White = c(0,1)) # start of the new data predictions
p = posterior_predict(fit,z)
n = dim(p)[1]
w = data.frame(GPA = matrix(p), White = c(rep(0,n),rep(1,n)) )

# the final picture
g = ggplot( aes(x=GPA, fill=as.factor(White)), data=w) +
    geom_histogram( alpha=0.6, position = 'identity',bins = 200) +
    theme_bw() +
    scale_fill_manual(values=c("#69b3a2", "#404080")) +
    labs(fill="")
g

#Pr(GPA_white > GPA_non-white | data, models, assumption, race)
sum(p[,2]>p[,1])/n

# Pr(GPA>4.15| data, model, assumptions, race); White = 2, non-White = 1.
f = function(x){sum(x>4.15)}
apply(p, 2, f)/n


Discover more from William M. Briggs

Subscribe to get the latest posts sent to your email.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *