Free Data Science Class: Predictive Case Study 1, Part VIII

Free Data Science Class: Predictive Case Study 1, Part VIII


We’re continuing with the CGPA example. The data is on line, and of unknown origin, but good enough to use as an example.

We will build a correlational model, keeping ever in mind this model’s limitations. It can say nothing about cause, for instance.

As we discussed in earlier lessons, the model we will build is in reference to the decisions we will make. Our goal in this model is to make decisions regarding future students’ CGPAs given we have guesses or know their HGPA, SAT, and possibly hours spend studying. We judge at least the first two in the causal path of CGPA. Our initial decision cares about getting CGPA to the nearest point (if you can’t recall why this is most crucially important — review!).

It would be best if we extended our earlier measurement-deduced model, so that we have the predictive model from the get go (if you do not remember what this means — review!). But that’s hard, and we’re lazy. So we’ll do what everybody does and use an ad hoc parameterized model, recognizing that all parameterized models are always approximations to the measurement reality.

Because this is an ad hoc parameterized model, we have several choices. Every choice is in response to a premise we have formed. Given “I quite like multinomial logistic regression; and besides, I’ve seen it used before so I’m sure I can get it by an editor”, then the model is in our premises. All probability follows on our assumptions.

Now the multinomial logistic regression forms a parameter for every category—here we have 5, for CGPA = 0-5—and says those parameters are functions of parameterized measurements in a linear way. The math of all this is busy, but not too hard. Here is one source to examine the model in detail.

For instance, the parameter for CGPA = 0 is itself said to be a linear function of parameterized HGPA and SAT.

These parameters do not exist, give no causal information, and are of no practical interest (no matter how interesting they are mathematically). For instance, they do not appear in what we really want, which is this:

    (8) Pr(CGPA = i | guesses of new measures, grading rules, old obs, model), where i = 0,…,5.

We do not care about the parameters, which are only mathematical entities needed to get the model to work. But because we do not know the value of the parameters, the uncertainty in them, as it were, has to be specified. That is, a “prior” for them must be given. If we choose one prior, (8) will given one answer; if we choose a different prior, (8) will (likely) give a different answer. Same thing if we choose a different parameterized model: (8) will give different answers. This does not worry us because we remember all probability is conditional on the assumptions we make. CGPA does not “have” a probability! Indeed, the answers (8) gives using different models are usually much more varied than the answers given using the same model but different priors.

What prior should we use? Well, we’re lazy again. We’ll use whatever the software suggests, remembering other choices are possible.

Why not use the MNP R Package for “Fitting the Multinomial Probit Model”? But, wait. Probit is not the same as Logit. That’s true, so let’s update our ad hoc premise to say we really had in mind a multinomial probit model. If you do not have MNP installed, use this command, and follow the subsequent instructions about choosing a mirror.

install.packages('MNP', dependencies = TRUE)

There are other choices beside MNP, but unfortunately the software for multinomial regressions is not nearly as developed and as bullet proof as for ordinary regressions. MNP gives the predictive probabilities we want. But we’ll see that it can break. Beside that, our purpose is to understand the predictive philosophy and method, not to tout for a particular ad hoc model. What happens below goes for any model that can be put in the form of (8). This includes all machine learning, AI, etc.

The first thing is to ensure you have downloaded the data file cgpa.csv, and also the helper file briggs.class.R, which contains code we’ll use in this class. Warning: this file is updated frequently! For all the lawyers, I make no guarantee about this code. It might even destroy your computer, cause your wife to leave you, and encourage your children to become lawyers. Use at your own risk. Ensure Windows did not change name of cgpa.csv to cgpa.csv.txt.

Save the files in a directory you create for the class. We’ll store that directory in the variable path. Remember, # comments out the rest of what follows on a line.

path = 'C:/Users/yourname/yourplace/' # for Windows
#path = '/home/yourname/yourplace/' # for Apple, Linux
# find the path to your file by looking at its properties
# everything in this class is in the same directory

source(paste(path,'briggs.class.R',sep='')) # runs the class code
x = read.csv(paste(path,'cgpa.csv',sep='')) 
 x$cgpa.o = x$cgpa # keeps an original copy of CGPA
 x$cgpa = as.factor(roundTo(x$cgpa,1)) # rounds to nearest 1

You should see this:

>  summary(x)
 cgpa        hgpa            sat           recomm          cgpa.o     
 0: 4   Min.   :0.330   Min.   : 400   Min.   : 2.00   Min.   :0.050  
 1:17   1st Qu.:1.640   1st Qu.: 852   1st Qu.: 4.00   1st Qu.:1.562  
 2:59   Median :1.930   Median :1036   Median : 5.00   Median :1.985  
 3:16   Mean   :2.049   Mean   :1015   Mean   : 5.19   Mean   :1.980  
 4: 4   3rd Qu.:2.535   3rd Qu.:1168   3rd Qu.: 6.00   3rd Qu.:2.410  
        Max.   :4.250   Max.   :1500   Max.   :10.00   Max.   :4.010  
> table(x$cgpa)

 0  1  2  3  4 
 4 17 59 16  4 

The measurement recomm we’ll deal with later. Next, the model.

require(MNP) # loads the package

fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=2000)
#fit <- mnp(cgpa ~ sat + hgpa, data=x, burnin = 2000, n.draws=10000)

The model call is obvious enough, even if burnin = 2000, n.draws=2000 is opaque.

Depending on your system, the model fit might break. You might get an odd error message ("TruncNorm: lower bound is greater than upper bound") about inverting a matrix which you can investigate if you are inclined (the problem is in a handful of values in sat, and how the model starts up). This algorithm uses MCMC methods, and therefore cycles through a loop of size n.draws. All we need to know about this (for now) is that because this is a numerical approximation, larger numbers give less sloppy answers. Try n.draws=10000, or even five times that, if your system allows you to get away with it. The more you put, the longer it takes.

We can look at the output of the model like this (this is only a partial output):

> summary(fit)

mnp(formula = cgpa ~ sat + hgpa, data = x, n.draws = 50000, burnin = 2000)

                    mean       2.5%  97.5%
(Intercept):1 -1.189e+00  2.143e+00 -7.918e+00  0.810
(Intercept):2 -1.003e+00  1.709e+00 -5.911e+00  0.664
(Intercept):3 -8.270e+00  3.903e+00 -1.630e+01 -1.038
(Intercept):4 -2.297e+00  3.369e+00 -1.203e+01 -0.003
sat:1          9.548e-04  1.597e-03 -3.958e-04  0.006
sat:2          1.065e-03  1.488e-03 -7.126e-06  0.005
sat:3          4.223e-03  2.655e-03  2.239e-05  0.010
sat:4          1.469e-03  2.202e-03  1.704e-06  0.008
hgpa:1         9.052e-02  3.722e-01 -5.079e-01  0.953
hgpa:2         1.768e-01  3.518e-01 -2.332e-01  1.188
hgpa:3         1.213e+00  6.610e-01  1.064e-01  2.609
hgpa:4         3.403e-01  5.242e-01 -7.266e-04  1.893

The Coefficients are the parameters spoken of above. The mean etc. are the estimates of these unobservable, not-very-interesting entities. Just keep in mind that because a coefficient is large, does not mean its effect on the probability of CGPA = i is itself large.

We do care about the predictions. We want (8), so let's get it. Stare at (8). On the right hand side we need to guess values of SAT and HGPA for a future student. Let's do that for two students, one with a low SAT and HGPA, and another with high values. You shouldn't have to specify values of CGPA, since these are what we are predicting, but that's a limitation of this software.

y = data.frame(cgpa = c("4","4"), sat=c(400,1500), hgpa = c(1,4))
a=predict(fit, newdata = y, type='prob')$p

The syntax is decided by the creators of the MNP package. Anyway, here's what I got. You will NOT see the exact same numbers, since the answers are helter-skelter numerical approximations, but you'll be close.

> a
            0          1         2      3            4
[1,] 0.519000 0.24008333 0.2286875 0.0115 0.0007291667
[2,] 0.000125 0.04489583 0.1222917 0.6900 0.1426875000

There are two students, so two rows of predictions for each of the five categories. This says, for student (sat=400, hgpa=1), he'll most like see a CGPA = 0. And for (sat=1500, hgpa=4), the most likely is a CGPA = 3. You can easily play with other scenarios. But, and this should be obvious, if (8) was our goal, we are done!

Next time we'll build on the scenarios, explore this model in more depth, and compare our model with classical ones.

Homework Play with other scenarios. Advanced students can track down the objectionable values of sat that cause grief in the model fit (I wrote a script to do this, and known which ones they are). Or they can change the premises, by changing the starting values of the parameters. We didn't do that above, because most users will never do so, relying on the software to work "automatically".

The biggest homework is to think about the coefficients with respect to the prediction probabilities. Answer below!


  1. JH

    So the model estimates/predicts that it is impossible for the two students to have a CGPA of 3.5 or 3.75. Their CGPAs can only be a value of 0 or 1 or 2 or 3 or 4. Doesn’t the conclusioin give you a clue that that something is wrong?

    You mean you cannot deductively assign a probability model using observed data only, and you have to make assumptions. And the justification for the assumptions is

    Given “I quite like multinomial logistic regression; and besides, I’ve seen it used before so I’m sure I can get it by an editor”, then the model is in our premises.

    Because you like it. And because you’ve seen it used before. Therefore, it is appropriate and let’s build it in the premise.

    But, wait. Probit is not the same as Logit. That’s true, so let’s update our ad hoc premise to say we really had in mind a multinomial probit model.

    But, wait. What you really had in mind is a multinomial probit model. But, wait gain, what you really really had in mind is a ordinal multinomial probit model.

    No kidding!

    Well, I agree that all probability follows on our assumptions. The money is on why an assumption of a multinomial logistic or probit model or a certain model is (more) appropriate for the data. Statistical modeling it is.

  2. Per

    I’ve been following your course, but you lost me with the software output. I look forward to the next.

Leave a Reply

Your email address will not be published. Required fields are marked *