Class - Applied Statistics

How To Do Predictive Statistics: Part I New (Free) Software Introduction

Introduction

Here’s what we always want, but never get, using the old ways of doing statistics: The probability that some proposition Y is true given certain assumptions. Like this:

     (1) Pr(Y | assumptions).

What happens if we change the assumptions? We change the probability. This is not a bug, it is a feature. There is NO SUCH THING, not ever, as Pr(Y), i.e. Pr(Y|no assumptions). It is an impossibility. And that means frequentism as a theory of probability fails. As does any theory which takes for granted that things “have” a probability and it is our job to discover it. For proof of these claims, read this New York Times-eligible bestseller Uncertainty.

Just because you don’t write down assumptions, doesn’t mean they are not there. At the least there will be the assumptions about what Y means, about grammar and language, and maybe rules of math. There are always assumptions.

That means if we change the model, and since the model is an assumption, we change the probability. If we add or subtract “data”, and our model uses data, and since the model is an assumption, we change the probability. If we tweak the model, as in adjust it or modify its parts, like priors, and since the model is an assumption, we change the probability. Again, this is not a bug, but a necessary feature of probability. (Some people actually complain about how changing the model or prior changes the probability, but that’s because they believe in Pr(Y) which is impossible.)

Most models are ad hoc. The vast majority of models make rather gross approximations of Y, such that it lies on the continuum. (No real thing does, though, oddly, the continuum is called the “real” line.) Regression is in this class of ad hociness. It is, by far, the most used model class, judging by published “research” (sorry for scare quotes). Regression is, or rather should be, this:

     (2) Pr(Y | new X, old X & Y, Model & assumptions),

which is the probability of our proposition Y given a new assumed or measured proposition X, the old measured values of X & Y, and the ad hoc model with its assumptions. There is no notion of cause in this set up, despite how folks describe their results. All (2) says is “This is the probability of Y given these things.” That’s it, and nothing more.

For shorthand we can write

     (3) Pr(Y | XDM),

where D is the old “data”, and the others are obvious. Change anything—as in anything—on the right hand side, change the probability of Y. For the last time, this is not a bug, it is a feature.

I’ve skipped over lots of detail, because regular readers have seen it all before, and because it’s in Uncertainty and in the other posts of this on-line class (of which this post is a member).

Practicum

One of the big reasons predictive methods have not taken off is the lack of free and easy-to-use out-of-the-box software. There is tons upon metric tons of software for classical parameter- and hypothesis-testing methods. But those methods are dead to us. We do not care about parameters and we most certainly do not care for hypothesis tests. We care only about things like eq. (2) (and 3).

The best solution would be to build this software from scratch, which requires time and money, and Yours Truly has neither in any abundance (I am wholly independent; no students, no grants, no affiliations of any kind). The next best is to build upon what’s already out there. This has the advantages of speed and familiarity, but the disadvantage of being stuck in a pre-determined format with all its inherited limitations.

We’re tried other things in the past, but I’ve settled—as a start—on extending MCMCpack. With full acknowledgement of the severe limitations of simulation methods). Simulations are not magic, nor do they “have” probability. Nothing “has” probability. There is no such thing as “drawing” from a “random” distribution. Read the links: simulations are only reasonable approximations to difficult numerical integrations. That’s is, and nothing more.

MCMCpack model routines return approximations of posterior distributions of parameters of models. We have zero interest in parameters per se, since they are only part of model innards. Concentrating on parameters as if they were probabilities of observables (like Y) has led to vast, vast, incalculable over-certainties. All that is also in Uncertainty.

We can use the posteriors, though, in approximating (2), and having knowledge of the model forms. The idea is to take the approximated parameter posteriors, and feed them into a second simulation-approximation to get (2). Simple as that.

MCMCpack allows all kinds of priors to be input, and lots of other abilities to form models. Here is just one of many examples—recalling, of course, that I do not endorse their philosophical interpretation. Here for example is MCMCregress. Look at everything you can set!


Usage:

MCMCregress(formula, data = NULL, burnin = 1000, mcmc = 10000, thin = 1,
verbose = 0, seed = NA, beta.start = NA, b0 = 0, B0 = 0,
c0 = 0.001, d0 = 0.001, sigma.mu = NA, sigma.var = NA,
marginal.likelihood = c("none", "Laplace", "Chib95"), ...)

The formula is usually typed right into the method, but unfortunately, for many of its models, MCMCpack does not store the formula as part of its output. So sometimes we’ll have to create the formula outside the method so we can store it. That’s easy in R, e.g.


form = formula('y ~ x1 + x2 + x3 + x4 * x5')

That simple but unfortunate notation, incidentally is why many believe probability models say something about cause. After all, does not the formula say y is x1 plus x2 etc.? When, of course, in regression, the real formula is “central parameter of approximate normal distribution of y = x1 + x2 etc.” Normal distributions are always approximations, as are any continuous-based parameterized models. As long as we keep in mind the sometimes severe limitations this imposes, this is not a problem. Many (most?) don’t keep it in mind, however, and slip into committing the Deadly Sin of Reification, which is to assume the model is reality.

The strategy of extension is to take the output of MCMCmodel and feed it into a new method MCMCmodel.pred, for prediction. Because MCMCpack is not consistent in what it passes to its ouput, we sometimes have to track things like the model formula, the limits of y, and so on. This isn’t hard to do, but it makes the software less automatic. So ideally a wrapper would be put around each MCMCmodel that handles this. I did NOT yet write such wrappers. In fact, all the predictions methods thus far have zero error or bounds checking, nor sanity checks of any kind at all. That level of coding takes the most time (to my mind) and is the least fun. So you will have to be careful what you pass into the prediction methods. This, again, is a usage limitation, which can be corrected in time.

What I’ll be introducing in further posts are prediction methods for, so far:

  1. MCMCregress, ordinary regression
  2. MCMClogit, ordinary logistic regression
  3. MCMCmnl, multinomial (bucket) regression
  4. MCMCpoisson, Poisson regression
  5. MCMCtobit, regular regression with censoring on y.

This represents a big chunk of routine procedures, and is therefore enough of a “package” for people to get started.

Regular readers will know that multinomial regression is shockingly underused. Why? Because, I say, each analysis should be related to the decisions that will be made with the model. If no decisions are going to be made, there is no reason to do the model! Two people can have three different (or more!) decisions to be made of the same model structure, so that it should come as no surprise that they, using the same data, can come to different probability conclusions and different decisions.

All decisions, like all probability, are conditional. Since all decisions are finite and discrete, therefore so too should be the models. No y is continuous, each falls naturally into buckets delimited by a measurement process and the decisions to be made. And the multinomial does that for us (to reasonable approximation). Well, we’ll come to that when the multinomial method is introduced.

Consider this software to be version 0.1. It is strictly in the testing stage. At this point, it has very little versatility. It works, though; or, it works as near as I can tell. I am even more famous for typos in my code than in my writing, so that if there are bugs or misconceptions, do not be surprised.

Next time , we do MCMCregress.pred!

LEGAL DISCLAIMER: using this software might break your computer irrecoverably, cause black holes to form in your toilet when flushed, and encourage your children to become lawyers. Do not say you weren’t warned.

Update Download the code: mcmc.pred.R, mcmc.pred.examples.R. Update 2 If you downloaded before, download again. This is version 0.2!

9 replies »

  1. Great. If I keep hanging out around here, I’m just going to have to learn a new computer language. Plus, you know, maybe some statistics.

  2. I am even more famous for typos in my code than in my writing, so that if there are bugs or misconceptions, do not be surprised

    Suggest you setup a downloads page or use something like SourceForge so that updates can be easily found.

  3. Agree with “Matt” (writer of #comment-177214, above).

    Yes, before another official version is released, you would have to check the contributed code yourself for “bugs or misconceptions,” as you put it, but you would also have to do that if you had oodles of money and an army of RAs.

    And congratulations!

    Now all you have to do is gather your army of open source contributors and learn to swear like Linus Torvalds, and you’re famous.

  4. I am still trying to understand your point of about there is no need for simulation. I understand that randomness is not a thing. But in the case of, say the distribution is confined to a small but unknown region, then sampling seems to be useful, as you are likely to draw samples in the most relevant parts and can ignore most of the state space.

  5. Chao Feng,

    It’s not that there’s no need for simulation, because there could be a need if no other solution presents itself. It’s that the simulation is not doing what some claim it is doing. There is no magic to it. “Randomness” is not essential. We are never “drawing from” “distributions” on the computer. We’re only following strict deterministic rules, some of which we turn a blind eye to, in an effort to pretend this guarantees “randomness”. It’s only doing the boring job of estimating the area under the curve when an analytic method is lacking.

    And those methods are lacking because, I think, people haven’t turned their minds to them. The simulations give an answer, they’re easy enough to do, so why bother investigating what seems to be a harder problem? I’m guessing, or hoping, we’ll have new interest in these problems. A very potent and fascinating area of research.

    Best time to get in is now! When others aren’t yet aware.

Leave a Reply

Your email address will not be published. Required fields are marked *