This class is neither frequentist nor Bayesian nor machine learning in theory. It is pure probability, and unlike any other class available. And for the right price!
Last time we completed this model:
(5) Pr(CGPA = 4 | grading rules, old observation, fixed math notions).
What we meant by “fixed math notions” gave us the multinomial posterior predictive, from which we made probabilistic predictions of new observables. Other ideas of “fixed math notions” would, of course, give us different models, and possibly different predictions. If we instead started from knowledge only of measurement, and grading rules, we could have deduced a model for new observables, too. This is done in Uncertainty. But the results won’t, in this very simple case for our good-sized
n, be much different.
We next want to add other measurements to the mix. Besides CGPA, we also measured High School GPA, SAT scores (I believe these are in some old format; the data you will recall is very old and on an unknown source), and hours spent studying for the week. We want to construct models like this:
(7) Pr(CGPA = 4 | grading rules, old observables, old correlates, math notions),
where “old observables” are measures CGPA and “old correlates” are measures of things we think are “correlated” with the observable of interest.
This brings us to our next and most crucial questions. What is a “correlate” and why are we putting them in our models? Don’t we need to test the hypotheses, via wee p-values or Bayes factors, that these correlates are “significantly” “linked” to the observable? What about “chance”?
Here is the weakest point of classical statistics. Now we have no chance here of having a complete discussion of the meaning and answers of these questions. We’ll have a go, but the depth will be unsatisfactory. All I can do it point to Uncertainty, and to other articles on the subject, and hope the introduction here is sufficient to progress.
What many are after can’t be had. The information about why a correlate is important is not in the data, i.e. the measurements of the correlate itself. Because of this, no mathematical function of the data can tell us about importance, either. Importance is outside the measured data, as we shall see. Usefulness is another matter.
Under strict probability, which is the method we are using, a “correlate” is any measure of bit of evidence you put on the right hand side. Here is where ML/AI techniques also excel. For instance, a correlate might be, “sock color of student worn on their third day of class.” With that, we can calculate (7).
Suppose we calculate these:
(7a) Pr(CGPA = 4 | grading rules, old obs, sock color, math) = 0.05,
(7b) Pr(CGPA = 4 | grading rules, old obs, math) = 0.05,
and the same for every values of CGPA (here we only have 5 possibly values, 0-4, but what is said counts for however we classify the observable). I mean, the prediction is the same (exactly identical) probability whether or not we include sock color, then in this model in this context and given these old obs, the sock color is irrelevant to the uncertainty in CGPA.
If we change anything on the right hand sides of (7a) or (7b) such we get
(7a) Pr(CGPA = 4 | grading rules, even more old obs, sock color, math) = 0.05,
(7b) Pr(CGPA = 4 | grading rules, even old obs, math) = 0.051,
then sock color is relevant to our uncertainty in CGPA. Relevance, then, is a conditional measure, just as probability is. Any difference (to withing machine floating-point round off!) in probabilities for any CGPA (with these givens), then sock color is relevant.
Irrelevance is, as you can imagine, hard to come by. Even a cloud, made up of water and cloud condensation nuclei, can resemble a duck, even though the CCN have no artistic intentions. As for importance, that’s entirely different.
Would you, as Dean (recall we are a college dean), make any different decision given (7a) = 0.05 and (7b) = 0.051? (You have to also consider all the other values of CGPA you said were important, and at least one other value will differ by at least 0.01.) If so, then sock color is useful. If not, then sock color is useless. Or of no use. Even though it is, strictly speaking, relevant.
Think about this decision. Think very hard. The decision you make might be different than the decision somebody else makes. The model (7a) may be useless to you and useful to somebody else.
And then you think to yourself, “You know, that 0.01 can make a big difference when I consider tens of thousands of students” (maybe this is a big state school). So (7a) becomes interesting.
Well, how much would it cost to measure the sock color of every student on the third day of their class? It can be done. But would it be worth it? And you have to know it if you use (7a) instead of (7b). It’s a requirement. Besides, if students knew about the measurement, and they caught wind that, say, red colors have higher probabilities of large CGPA than any other color, wouldn’t they, being students and by definition ignorant, wear red on that important day? That would throw off the model. (Answering why we do next time.)
Now if you dismiss this example as fanciful and thus not interesting, you have failed to understand the point. For it is the cost and consequences of the decisions you make that decide whether a relevant “variable” is useful. (Irrelevant “variables” are useless by definition.) We must always keep this in mind. The examples coming will make this concept sharper.
“But, Briggs, what could sock color have to do with CGPA?”
Sounds like you’re asking a question about cause. Let’s save that for next time.
It’s Christmas Break! Class resumes on 9 January 2018.