The Third Way Of Probability & Statistics: Beyond Testing and Estimation To Importance, Relevance, and Skill. New Paper

14797835963_211f35d6c9_z

New paper up at Arxiv, which is a sort of précis to some key chapters in my book most relevant to probability and statistical modeling.

Abstract:

There is a third way of implementing probability models and practicing. This is to answer questions put in terms of observables. This eliminates frequentist hypothesis testing and Bayes factors and it also eliminates parameter estimation. The Third Way is the logical probability approach, which is to make statements Pr(Y in y | X,D,M) about observables of interest Y taking values y, given probative data X, past observations (when present) D and some model (possibly deduced) M. Significance and the false idea that probability models show causality are no more, and in their place are importance and relevance. Models are built keeping on information that is relevant and important to a decision maker (and not a statistician). All models are stated in publicly verifiable fashion, as predictions. All models must undergo a verification process before any trust is put into them.

And here are some words from the second section (edited from the LaTex):

Let past observables be labeled D = (Y,X)_old, where Y is the observable in which we want to quantify or explain our uncertainty, and X are the premises or observables assumed probative of Y (the dimensions of each will be obvious in context). Let the premises which lead to a probability model (if one is present) be labeled M. And let X = X_new be the premises or assumed values of new observables. The goal of all probability modeling is this:

     Pr(Y in y | X,D,M),

where y are values of the observable Y which are of interest to some decision maker. Models should be rare, because most probability is not quantifiable—and we must resist the temptation to force quantification by making up scientific-sounding numbers. But even if we do, the equation can be calculated, as long as supply the premises which led to our creations.

Although it is obvious, the equation reads, “The probability Y takes the values y given the premises or assumptions X, the past data D, and the model M.” If the model is parameterized and Bayesian philosophy is adopted, (equation) is the posterior predictive distribution, and M incorporates those premises or assumptions from which the priors are deduced. If a frequentist philosophy is adopted, there are many difficulties and inconsistencies in interpretation, but I will not discuss them here; the meaning of equation is plain enough. The key is that no parameters are explicit in (equation); the uncertainty in them has been “integrated out.” Only observables and plain assumptions remain. Logical probability would supply premises from which the model $M$ is deduced (there would be no parameters thus no priors).

I use a regression model as an example because of their familiarity. College grade point average is the observable (Y) of interest, with probative $X$ high school GPA and SAT. You’ll have to read the paper for details:

[We want to calculate things like]

     Pr(Y > 3.8 | X_h = 3.5, X_s = 1160 ,D,M_(h,s)) = 0.038…

But suppose I am interested in the relevance of X_h. Its presence is an assumption, a premise, one that I felt important to make. There are several things that can be done. The first is to remove it. That leads to

     Pr(Y > 3.8 | X_s = 1160 ,D,M_s) = 0.0075.

Notice first that both (equations) are {\it correct}. The probability in (first equation) is 5 times larger than in (second equation). This is a measure of relevance and importance, given y = 3.8 and X_s = 1160. Importance and relevance, like probability itself, are always conditional on our assumptions. A second measure of importance is the change in probabilities when X_h is varied. That can be seen in the following figure.

And:

The Third Way strategy is to create scenarios that are of direct interest to a decision maker, the person or persons who will use the model. Plots like those above can be made at the values of the probative observables in which the decision maker is interested. There are no one set of right or proper values, except in the trivial sense of excluding values that are, given exterior information, known to be impossible. For instance, given our knowledge of grade points, the value X_h = -17h is impossible. Assessing relevance and importance for large models will not be easy. But who insisted it should be? That classical statistical procedures now make analysis so simple is part of the problem we’re trying to correct.

Verification strategies are discussed, and the grand finale is:

Once importance or relevance are known, it is a mistake to say that X is linked to, or is associated with, or predicts Y, or, worse, some variant of “When X equals x, Y equals y”. These are versions of a colossal misunderstanding, which is to say X causes Y, It is true that X determines the uncertainty we have in Y, but determines is analogical; it has an ontological and epistemological sense. Probability is only concerned with the latter usage. The only function probability has is to say how our assumptions X determine epistemologically the uncertainty we have in Y. If we knew X was a cause of Y, we have no need of probability.

Importance and relevance are replacements for testing and estimation, but not painless ones. The recipient of an analysis is asked to do much more work than is usual in statistics. However, this is the more honest approach. The benefit is that equation (first main equation above) answers questions our customers ask us in the form they expect. The probabilities are in plain English and painless to interpret. Everything is stated in terms of observables. Everything is verifiable. The conditions on which the model relies are made explicit, made bare for all to see and to agree or disagree with. Gone is the idea that there is one “best” model which researchers have somehow discovered and which gives unambiguous results. Gone also is the belief that the statistical analysis has proved a causal relationship.

The model is made plain so that all can use it for themselves to verify predictions made with it. Everybody will be able to see for themselves just how useful the model really is.

Obviously, this post is only a sketch. Read the paper, which is itself only a sketch of the book. I don’t, for instance, prove in the paper why logical probability is the only truly justifiable interpretation of probability. Et cetera.

29 Comments

  1. Dixon Duval

    I would remind you of the old adage about “”the power of the memo”. In olden days we wrote memos rather than emails. Everyone knew that the rare memo had more power than the daily memo. Hence “did you get the memo”.

  2. Briggs

    Dix,

    Is that memo business why Hollywood never releases trailers to upcoming movies?

  3. Briggs: Let’s hope your “trailers” aren’t like movie trailers, where the best 3 minutes of the program are seen (out of order, of course, for effect). Then, one finds the best 3 minutes in the trailer was all that was good in the movie. Of course, those who do statistics may not be as emotially swayed as a movie goer and I’m not sure how one would cut the most exciting 3 minutes out of a book, so we’re probably safe!

  4. Steve E

    Your enemies are clever doing a precis of the word precis to arrive at “preis.”

  5. Briggs

    Steve E,

    They are everywhere.

  6. JohnK

    LOVE LOVE LOVE “The Third Way” as what you’re talking about and arguing for. Wonderful. Keep that. Please.

    Sorry, I can only skim the paper. Way too much on my plate. Also, only got to the top of p. 5 of the paper before I wearied and quit trying to proof-read. Here are my proof-reading results to that point.

    Abstract:
    Models are built keeping on information that is relevant
    should be
    Models are built keeping information that is relevant

    Body:
    To maintain consistency the interpretation of philosophy is as logic is adopted.
    To maintain consistency the interpretation of philosophy as logic is adopted.

    can be calculated, as long as supply the premises which led to our
    can be calculated, as long as we supply the premises which led to our

    but I will not discuss them here; the meaning of equation is plain
    but I will not discuss them here; the meaning of the equation is plain

    ???There is no X probative beyond M and D can be absent or can be a record of previous flips (i.e. X and D are null or are assumed not probative).
    Possibly??:
    There is no X probative beyond itself. M and D can be absent or can be a record of previous flips (i.e. X and D are null or are assumed not probative).

    Model deduction can be accomplished if the measurement of observable are properly accounted for.
    Model deduction can be accomplished if the measurement of observables is properly accounted for.

    This implies a model may be useful in some decision contexts and of no use or even harmful in others.
    This implies that a model may be useful in some decision contexts and of no use or even harmful in others.

    Improvement? (here, I used [E] to substitute for ‘element of’, the unicode of which I couldn’t find quickly):
    The idea in the Third Way is, conditional on D and M, to vary X in the range of expected, decisionable, or important values to some decision maker and see how these change the probability of Y [E] y.
    Here is our strategy using the Third Way: conditional on D and M, to vary X in the range of expected, decisionable, or important values to some decision maker and see how these change the probability of Y [E] y.

    a particular X as it ranges along the values we choose do not change
    a particular X as it ranges along the values we choose does not change

    If the probability of Y [E] y changes in any
    If the probability of Y [E] y changes in any important way,

  7. Briggs

    JohnK,

    I can see my enemies have been very busy.

  8. DAV

    1) On p6, you say: Notice first that both (3) and (4) are correct. The probability in (3) is 5 times larger than in (4). This is a measure of relevance and importance, given y = 3.8 and Xs = 1160.

    Is there a difference between relevance and importance? What is it if there is?

    Why would a higher probability imply relevance and importance. Is 7 more important or relevant than 2 or 12 in a toss of dice because it has a higher probability of occurrence?

    2)If high school GPA was not probative of Y > 3.8 given these
    premises then the graph would be flat, indicating no change

    What graph are you talking about? I couldn’t find any mention of it prior to this point. Is it a graph of Pr(Y…) as CPA changes?

    If it is then Is this “departure” from flatness important? There is no single answer to this question. directly contradicts your assertion of a measure of relevance and importance. What’s the value of a measure if it can’t answer your question?

    I’ve run out of time at the moment. Perhaps these are answered later in the paper. I may find out when I pick this up again later.

    —-

    While the goal may ultimately be Pr(Y in y | X,E,M), the actual goal in (hopefully all) scientific endeavors is to come up with M makes Pr(M | X,E,Y) the next goal. Relevance and Importance to M is the concern. Relevance and importance to Y only in so far as it aids in the search for M.

    It’s what all of the p-valued hypothesis testing was to aid. Parameter significance to the model somehow replaced significance of the model with respect to Y which your paper is emphasizing. However so far (and after a quick scan) I don’t see how you’ve added anything other than perhaps ways to see if M is of any value.

    This isn’t a Third Way, it’s the original but frequently forgotten way.

  9. JH

    Logical probability would supply premises from which the model M is deduced (there would be no parameters thus no priors).

    No matter how you want to spin it, logical probably simply cannot go beyond some toy examples such as coin tossing, which is clearly stated in Williamson (2001) and reference therein. If it could, it should’ve allowed you to justify the postulation of an ordinary regression model with parameters in section 3.

    In section 3, for a normal linear model with a uniform/flat prior on the coefficient parameters, you can simply present the analytic form of the posterior predictive distribution. Everyone, including some of your blog readers, then can easily see how the normal probability of an interval, say Y > 3.8, changes as the center of a normal curve is moved along the horizontally axis (varying x values) … if you just provide the ordinary least estimates of the parameters for the simulated data. It would also become evident that the probability depends crucially on the estimates of the coefficients.
    Your way does not eliminate parameter estimation.

    Ooh, a referee surely would demand that you report exactly how you generate the Y and X’s in your simulation. BTW, a truncated regression would have been more appropriate for your simulated data.

  10. JH

    The second part of section 3 – the well-known CRPS scoring (and therefore so-called skill scoring defined accordingly) provides a measure for the evaluation of probabilistic forecast, is used to compare two models (full and reduced models) using the observed data, with a concluding sentence that

    They (Outsiders) can wait for new data and apply verification on them themselves.

    So, a new data point – a student, who has a high school GPA of 3 and a SAT score of 1600 and studies 1 hour per day, receive a freshman GPA of 3.5.

    How do I apply verification on a model using the posterior predictive probability? How does this new data point tell me if I have an appropriate model? Do I calculate P(Y=3.5) or P(Y>=3.5) or P(Y>3.8)? After calculating the probability, how do I conclude whether the model performs poorly?

    Note that correlation coefficient or p-value can be seen as a measure of relevance and importance too.

  11. JH

    (Must have mispelled “blockquote”. )

    The second part of section 3 – The well-known CRPS scoring (and therefore so-called skill scoring defined accordingly) provides a measure for the evaluation of probabilistic forecast, and is used to compare two models (full and reduced models) using the observed data, with a concluding sentence that

    They (Outsiders) can wait for new data and apply verification on them themselves.

    So, a new data point – a student, who has a high school GPA of 3 and a SAT score of 1600 and studies 1 hour per day, receives a freshman GPA of 3.5.

    How do I apply verification on a model using the posterior predictive probability? How does this new data point tell me if I have an appropriate model? Do I calculate P(Y=3.5X’s, Data, M) or P(Y>=3.5|X’s, Data, M) or P(Y>3.8|X’s, Data, M) or P(Y<3|Xs, Data, M)? After calculating the probability, how do I conclude whether the model performs poorly using my new data?

    Note that correlation coefficient or p-value can be seen as a measure of relevance and importance too.

    Ah, I see I forgott the conditional, P(Y=3.5|X's, Data, M) for the posterior predictive probability! A good dinner helps!

  12. JH

    The first half of Section 3 –

    What is the point of calculating P(Y2| h,s,D,M)? It’s as if you are saying that the answer as to whether a variable is relevant and important depends on
    (1) over what range of Y (Y>3.8 or Y>2 or whatever) we are interested in making probability prediction,
    (2) at what given value of SAT score (SAT=1160) and
    (3) to qualify as a relevant or important variable, what is the desired change in the probabilities resulting from varying the values of the particular variable?

    So the answer is that there is no definite answer. I am not even sure if anyone need to do those calculation to come up such answer.

    Well, a 16% change from 0.038 to 0.044, (page 6, – line 5) indicates that time studying is of no importance, yet the line is not flat is proof of the relevance of the variable high school GPA. Inconsistency!

    BTW, when I suggest using truncated regression model, it is not an act of will.

    DAV,

    The “line” is supposed to be made up by varying HS GPA values and their corresponding conditional probabilities of Y > 3.8; see Figures 1 and 2. But it is usually not (if not never) flat, for the same reason that a sample correlation coefficient is usually not 0.

    Looks like “the third way” tries to decide whether a explanatory variable is contributing significantly to the probability prediction of a response variable. Variable selection that is. If this is the case and if it is applicable (it is a bit wishy washy for my taste), this would be counted as the fifth or sixth (not sure) way since one can find at least two ways in the Gneiting and Raftery (2007).

  13. Jesse

    Two questions:

    (1) What is the advantage of this approach over prediction-based optimization methods like neural nets, SVMs, and regression without the probabilistic baggage?

    (2) If the worth of a model is assessed on its predictive ability, doesn’t the philosophy drop out altogether? Couldn’t any interpretation of probability be entertained if it produces models which predict well?

  14. Bumble

    I’m sympathetic to your approach. But I still have qualms over the logical concept of probability.

    In footnote 1, you invite the skeptical reader to try to discover the probability of a proposition that relies on no evidence. What is the probability of a tautology? Is it unity? Moreover, what does it mean to state the probability of some proposition conditional on a tautology? This is something that advocates of logical probability do quite frequently (David Stove, for example). To me this seems directly comparable to a bayesian stating an unconditional prior probability. Indeed the two things appear to amount to the same thing in practice, so what’s the difference?

    You give a “deduced” example of a model of a throw of a 6-sided die from which it follows that Pr(Y = 6/X,D,M) = 1/6. It seems to me there is nothing deduced about this: it simply amounts to an assumption of the fairness of the die. A bayesian might say that such an assumption is reasonable in the absence of evidence to the contrary, or even that to assume anything else would expose one to some unfortunate decision-theoretic consequences. But there is a big difference between saying that I believe the probability of rolling a 6 is 1/6 because I have no evidence to the contrary and saying that I believe it is 1/6 because I have spent hours testing this die in a laboratory and have determined that it is perfectly balanced and symmetrical within engineering tolerances. To change the example a little, suppose I give you a 6-sided die and tell you that I have examined this die and it is biased (and I am a truthful person and you believe me). What is the probability of rolling a 6? How does one apply the symmetry considerations of logical probability when I have told you that the die is biased and therefore asymmetrical? A frequentist will presumably decline to answer in the absence of any data, which is unhelpful. A bayesian will answer that the probability is 1/6 because in the absence of other evidence, all possible biases can be assumed to be equally probable, consistently with maximising uncertainty. The bayesian answer is the helpful one: indeed it is the genius of bayesianism that it always delivers an answer, so it can be employed even in those “never mind that we don’t have enough information, just give it your best shot” situations.

    Also: “To maintain consistency the interpretation of philosophy is as logic is adopted.”
    – Do you mean that the interpretation of probability as logic is adopted? Interpreting philosophy as logic seems rather strange, though I’d be willing to bet there are some philosophers who see it so.

  15. Gareth

    That all seems very sensible, but is it really a “third way”? It looks to me like the second way done properly. Ie Bayesian statistics as expounded by Jaynes, or “Probability Theory As Extended Logic”.

    Of course we seek a posterior probability distribution for whatever quantity is of interest. Of course we integrate out any nuissance variables we may have introduced in our model. A generation after the Bayes Wars it should not be necessary to labour these points.

    I think it is overstating the case to say we don’t need parameter estimation or Bayes factors. Sometimes parameters are themselves of interest, in which case we seek their joint PDF, or maybe the marginal PDFs. And there are real world cases where it is useful to make use of the fact that the posterior PDF depends upon the prior PDF and a Bayes factor, or likelihood ratio. Forensic genetics is a case in point, when the expert responsible for assessing the genetic evidence is not responsible for determining the prior probability of the defendant’s guilt.

  16. Briggs

    Gareth,

    The title is grandiose, it’s true. But there is more to it, particularly in model deduction.

    I wonder if you can think of an example where knowing the parameters is to be preferred to knowing the actual uncertainty. I doubt this is true, but I’m willing to be convinced.

  17. Joy

    Why not just design a way of forcing, by means of Force, modelers to render models for independent verification of usefulness?

    If JH is correct which I suspect she is about the limitations only to toy examples, why or how does that make the “old ways” more valid? They are just more complex.
    Statisticians aren’t ready to admit that yet. Why should they? They would be giving up their magic cloaks of power.

Leave a Reply

Your email address will not be published. Required fields are marked *