Class 52: The Origin Of Parameters

Class 52: The Origin Of Parameters

Unobservable parameters abound in probability models. Why? Where do they come from? Are they needed? This is our first hardcore math lesson. You must read the written lesson today. WARNING for those reading the email version! The text below might appear to be gibberish. If so, it means the LaTeX did not render in the emails. I’m working on this. Meanwhile, please click on the headline and read the post on the site itself. Thank you.

Video

Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty

HOMEWORK: Given below; see end of lecture.

Lecture

This is an excerpt from Chapter 8 of Uncertainty.

There has been an inordinate and unfortunate fascination with unobservable parameters which are found inside most probability models. Parameters relate the X to Y, but are understood in an ad hoc fashion. Since models are often selected through custom or ignorance of alternatives (and recall we’re talking about actual and not ideal practice), the purposes of parameters are not well considered, to say the least. Most statistical practice, frequentist or Bayesian, revolves solely around parameters, which has led to the harmful misconception that the parameters are themselves the X, and the X causal. P-values, confidence intervals, and posterior distributions, hypothesis tests and other classic measures of model “fit” are abused with shocking frequency and with destructive force. Probability leakage is the least of these problems; mis-ascribed causality the worst. It’s time for it to stop. People want to know $\Pr(\mbox{Y} | \mbox{X})$: tell them that and not about some mysterious parameters.

Parameters arise from considering measurement. All measurement is finite and discrete, regardless of the way the universe might happen to be (I use universe in the philosophical sense of all that exists). Measurement drives X, which in turn are probative of Y. Parameters are not necessary when all is finite and discrete, but they may be used for mathematical convenience. But their origin must first be understood. Parameters-as-approximations arise from taking a finite discrete measurement process (which all are processes) to the limit. The interpretation of parameters in this context then becomes natural. This area, as will soon be clear, is wide open for research. Below, I’ll show how parameters arise in a familiar set up; but how they come about in others is mostly an open question.

Where do parameters come from? Here is one example, which originates with Laplace, and which necessitates some mathematics, which, I remind us, are not our main purpose here. The parallels to Solomonoff’s approach (cited in Chapter 5) will be obvious to those familiar with algorithm information theory. Begin with the premise E that before us in an urn which contains $N$ objects, objects which can take one of two states. From this language we infer $N$ is finite, which is absolutely no restriction, because $N$ can be very large indeed. Call them “success” and “failure”, or “1” and “0”, if you like. From this we deduce there can be anywhere from 0 to N successes. Given these premises—and no others—or rather this evidence E, we deduce the probability that there are $M = i, i=0,\dots,N$ successes is $1/(N+1)$. No number of success (or failures) is more likely than any other.

Now suppose we reach in and grab a sample of size $n$. In this sample there will be $n_1$ success and $n_0$ failures, so that $n_1 + n_0 = n$. To say something about these observations, we want the probability of $j$ successes in $n$ draws, without replacement where the urn has $N$ total successes. It will also be helpful to rewrite, or rather parameterize this by considering $N\theta$, where $\theta = i/N$, which is the fraction of successes. Note that $\theta$ is observable. The probability is (with the obvious restrictions on $j$):
$$\Pr(n_1 = j | n,\theta,N, \mbox{E}) = \frac{ {N\theta \choose j} {N-N\theta \choose n-j}}{ {N \choose n}},
$$
which is a hypergeometric. We are still interested in the fraction $\theta$ (out of all $N$) of successes. Since we saw $n_1$ successes so far, $\theta$ must be at least as large as $n_1/N$, but it might be larger. We can use Bayes’s theorem to write (again, with the obvious restrictions on $j$)
$$
\Pr(\theta = j/N | n,n_1,N, \mbox{E}) \propto \Pr(n_1 = j | n,\theta=j/N,N, \mbox{E})\Pr(\theta = j | n,N, \mbox{E}).
$$
This is the posterior “parameter” distribution on $\theta$, which turns out to be
$$
\Pr(\theta = j/N | n,n_1,N, \mbox{E}) = {N-n \choose j-n_1}\frac{\beta(j+1,N-j+1)}{(n+1)\beta(n_1+1,n_0+1)},
$$
where $\beta()$ denotes the beta function.

Here is where parameters arise. Following Ross (p. 180) in showing how the hypergeometric is related to the binomial for large samples, let $N\to\infty$ in (our equation above). The result is
$$
\lim_{N\to\infty} \frac{1}{N+1}\frac{ {N\theta \choose j} {N-N\theta \choose n-j}}{ {N \choose n}} = {n \choose n_1} \theta^{(n_1+1)-1}(1-\theta)^{ (n_0+1)-1},
$$
which is the standard beta distribution posterior on $\theta$ when the prior on $\theta$ is “flat”, i.e. equal to a beta distribution with parameters $\alpha=\beta=1$.

We started with hard-and-fast observable propositions, and a finite number of successes and failures, and considered how expanding their number in a specific way towards infinity, and we end up with unobservable parameters. As Jack Aubrey would say, Ain’t you amazed? The key is that we don’t really need the infinite version of the model; the finite one worked just fine, albeit that it is harder to calculate for large $N$. But then there is no arguments over where the prior for $\theta$ came from. It arises naturally. This small demonstration is like de Finetti’s representation theorem (see below), only it also gives the prior instead of saying only that it exists.

What does the parameter $\theta$ mean? With a finite $N$—which will always be true of all real-world situations—it was the total fraction of successes (given the premises). This is sensible and measurable, at least in theory. Whether anybody ever measures all $N$ mentioned in the premises is another matter. $\theta$ is discrete: it can take only the values $0/N, 1/N, \dots, N/N$, and no value inside this set is impossible; at least, not on the evidence we have assumed. At the limit, $\theta$ is continuous and can take any value in the unit interval. Which is to say, it can take none of them, not empirically, because as Keynes said, in the long run we shall all be dead: infinity can never be reached. The parameter is no longer the fraction of successes, only something like it. But what? The mind should boggle at imagining the ratio infinite successes in infinite chances; indeed, I cannot imagine it. I can only picture large finite approximations to it. This $\theta$ is not, as it is often called, “the probability of success.” We already deduced the probability of success given the premises. So what is it? An index with a complex definition involving limits, a definition so complex that its niceties are forgotten and people speak of it as if it were its finite cousin, that is, as it if were a probability.

Notice very carefully that the parameter-solution is an approximation. We don’t need it. Though calculating [the discrete-finite probability] may be cumbersome, we have the exact result. We don’t need to quarrel about the prior, impropriety, ignorance, non-informativity or anything else because everything has been deduced from the premises. This situation is also well behaved. Approaching the limit (in a certain specified way) produced a result which is familiar. The continuous-valued parameter ties nicely to a finite-sample result: it keeps the roughly same meaning. I have no idea whether this will be true for all stock distributions in our cookbook, but we have great reason to doubt it. In his book, Jaynes (Chapter 15) shows how the so-called marginalization paradox disappears when one very carefully tracks how one heads off to infinity. Buffon’s needle paradox is another well known example where the path matters.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.

3 Comments

  1. Lou

    Hello,
    You requested some feedback;

    Please just keep going, I like the math

    Regards

  2. Briggs

    Thanks, Lou.

  3. JH

    Notation is crucial in math. I still remember how confused I was when my professors made writ-os in the equations shown on the blackboard.

    For example,
    $$
    \Pr(\theta = j/N | n,n_1,N, \mbox{E}) \propto \Pr(n_1 = j | n,\theta=j/N,N, \mbox{E})\Pr(\theta = j | n,N, \mbox{E}).
    $$

    $n_1$ value is not specified on the left-hand side of the equation.

    And $\Pr(\theta = j | n,N, \mbox{E}) $ is zero unless $j=0$ as $\theta$ is supposedly a fraction with values between 0 and 1. Perhaps, \theta = i/N? i or j?

    In the last equation (8.7), how does $n_1$ appear on the right-hand side when it is not present on the left-hand side of the equation? Moreover, the approximation does not hold with 1/(N+1) on the left-hand side, does it?

    The key is that we don’t really need the infinite version of the model; the finite one worked just fine, albeit that it is harder to calculate for large $N$.

    Yes, the finite one worked fine in this case. The purpose of the approximation is usually to approximate the value for large N under certain assumptions. Many approximations were introduced due to the limitations of computing power, and modern computing has rendered these approximations less valuable. Is this related to the rise of parameters? I’m not sure.

Leave a Reply

Your email address will not be published. Required fields are marked *