All Of Statistics: Part II

(A) No new data (cont.)

If we want to know how that data arose, and we are not satisfied by X itself, we need to propose a model—a fully causal to fully probabilistic, to somewhere in between, M. This puts us in a jam because, for any X, there will not exist a unique model which explains X. That is, for any X, we can always create any number of M which explain X; for any X, we can always invent an explanation M (from fully to partially causal) for why X took the values it did. It matters not how fanciful M is compared to evidence not in X—in relation to some E not used to infer M—it only matters that such M exist (you could always say M = “Venusians caused X”, which to many E is absurd).

Anyway, in classic (frequentist and Bayesian) statistics, an M is proposed. We now have a problem, because if our model is indexed by parameters, M = Mθ, we have to supply a guess for the θ (possibly multidimensional). We usually provide this guess by using the X itself; but this is not necessary and a guess can be supplied via external evidence or subjectively.

Frequentist theory often begins (and ends) with a “plug in” guess of the parameters. The truth of the model is assumed, and inference about X is made indirectly by discussing the parameter guesses as if the guesses were certain. More often, a subset of the parameters is set to a subjectively chosen predetermined value; usually at least one of the θ= 0, but any number besides 0 may be (subjectively) chosen. It then computes

     (2) Pr( T(X) | Mθ[0] ),

where Mθ[0] indicates the model with the predetermined value of the parameter(s) supplied and T() is any function of the data (T(X) is also a proposition). The function T() is subjectively chosen and is not unique; for any given Mθ and X, there are any number of T() that can be used, with each T() giving different answers to (2). This equation is called the “p-value”; thus p-values are not unique and are a function of the base model M, the values substituted into the parameters, and the “statistic” T().

Now, if (2) is (subjectively) thought “too large”, the guess of θ is then “confirmed” and then formally substituted into Mθ. Usually this means setting the relevant θ = 0 (but again, any number may be used). Surprisingly, this setting parameters (in the fixed M) to the pre-chosen values is the end result or goal of frequentist analysis. This result of this operation is said to explain X; that is, the discussion focuses on whether the unobservable θ were set to 0 or not.

Bayesian statistics inverts (1) and computes

     (3) Pr(Θ | X & M),

where the M is taken to be fixed except for the value of the parameters, and Θ = “θ takes a specified value.” This is called the “posterior” and it may be derived in a formal way.

It is at this point that the typical Bayesian analysis matches frequentist procedure. That is, if in (3) some of the Pr ( |θ| > c | X & M) (where c = 0, typically) are “small”, then these θ are set to some (subjectively chosen) predefined level c (0 usually). Needless to say, what is “small” is subjectively chosen.

Once again, the M is taken as fixed and the goal is to say which of the θ should be set to their predefined levels (usually 0). The slight advantage the Bayesian analysis enjoys are two: (one) it eliminates the arbitrary step of choosing a T(); and (two) it allows probability language in discussing the parameters. But, in practice, at least for common problems, the Bayesian and frequentist end result is the same or similar, an Mθ whittled down to some Mθ’ where cardinality(θ) > cardinality(θ’).

To clean up loose ends, both theories will sometimes “tack on” a guess of the remaining θ, but this is usually a half-hearted effort. Probably because these guesses can never be checked (parameters cannot be observed). Anyway, X is said to be “explained” by Mθ’.

Recall that we are still in the case that we expect that no new X will obtain. We are using M to say how the only X we will have arose. We subjectively pick an M and then, if it is indexed by parameters, we go through a procedure to set some of these parameters to predefined levels, usually 0. We then announce to ourselves that our theory of how X arose is true or false depending on whether certain θ are set to 0 or not. Again, the Bayesian theory enjoys a slight advantage because it allows us to say with what probability these θ are near the predefined levels. Frequentist theory just states they are zero, period.

These analyses both assume the truth of M, which you might recall was what we wanted to know in the first place. Remember we already knew X and we were after the “best” M which explains X. But since there is no unique M that is “best”, we just have to (subjectively) pick some M, and we are left playing with its parameters. We picked an M and set some of its parameters to 0. The Mθ’ we are left with is said to be true. Since we will see no new data, we will never be able to confirm this.

Now this conclusion would be the same if we had started with a different model (necessarily with different parameters). This new model with a reduced set of parameters would also be claimed as the true explanation of X. There would be no way to check this claim, either.

We could on ad infinitum, claiming each new model is the “true” explanation of X. Remember: we can’t use how well any M from this inexhaustible list explains X, because we can always find many M which explains X perfectly, or to any level of closeness we desire.

So unless we are in a “jury trial”-type situation, where we have a strong E which delineates the set of rival models in advance, if we do not expect new data, there is no solution to finding “the” model which best explains X. Or, rather, the solution is to fix E (independently of X) so that the set of models is fixed in advance. But even then, unless we coalesce on one model which, given X, is true “beyond a reasonable doubt” there will always exist, well, reasonable doubt about which model is true.

Next time: new data.


  1. Big Mike

    And they call “economics” the dismal science 😉

    I’m just new to your site, and think it’s very interesting and entertaining. Thank you for the effort you put into it. Your discussions are comprehensible even to a dilettante (in the “amateur” sense) statistician such as me.

    Over the past few days, I have learned much from you and the discussions on your site, and look forward to that continuing!

    Thanks again.



  2. Will

    This is fantastic stuff Mr. Briggs. Thank you!

    Hopefully this isn’t a stupid question, but:

    If it’s true that an infinite number of models could describe the data, then why not test a few million models, entirely at random? It’s not that hard to have a computer spit out a sequence of randomly generated functions (f=ma, f=m-a, f=a/m, etc…).

    In other words, rather than ‘reject the null’, why not ‘compare to alternatives’? Could it make more sense to accept the proposed hypothesis if 95% of the models with a MSE < n% embody the proposed hypothesis?


  3. …we can always invent an explanation M (from fully to partially causal)…

    Careful here. Causality is a complicated affair, one which statistics alone, of any stripe, are inadequate to determine. Correlation yes, causality no. Perhaps a good topic for a future post?

Leave a Reply

Your email address will not be published. Required fields are marked *