This is where it starts to get weird. The first part of the chapter introduces the standard notation of “random” variables, and then works through a binomial example, which is simple enough.
Then come the so-called normals. However, they are anything but. For probably most people, it will be the first time that they hear about the strange creatures called continuous numbers. It will be more surprising to learn that not all mathematicians like these things or agree with their necessity, particularly in problems like quantifying probability for real observable things.
I use the word “real” in its everyday, English sense of something that is tangible or that exists. This is because mathematicians have co-opted the word “real” to mean “continuous”, which in an infinite amount of cases means “not real” or “not tangible” or even “not observable or computable.” Why use these kinds of numbers? Strange as it might seem, using continuous numbers makes the math work out easier!
Again, what is below is a teaser for the book. The equations and pictures don’t come across well, and neither do the footnotes. For the complete treatment, download the actual Chapter.
Recall that random means unknown. Suppose x represents the number of times the Central Michigan University football team wins next year. Nobody knows what this number will be, though we can, of course, guess. Further suppose that the chance that CMU wins any individual game is 2 out of 3, and that (somewhat unrealistically), a win or loss in any one game is irrelevant to the chance that they win or lose any other game. We also know that there will be 12 games. Lastly, suppose that this is all we know. Label this evidence E. That is, we will ignore all information about who the future teams are, what the coach has leaked to the press, how often the band has practiced their pep songs, what students will fail their statistics course and will thus be booted from the team, and so on. What, then, can we say about x?
We know that x can equal 0, or 1, or any number up to 12. It’s unlikely that CMU will loss or win every game, but they?ll prob ably win, say, somewhere around 2/3s, or 6-10, of them. Again, the exact value of x is random, that is, unknown.
Now, if last chapter you weren?t distracted by texting messages about how great this book is, this situation might feel a little familiar. If we instead let x (instead of k?remember these letters are place holders, so whichever one we use does not mat
ter) represent the number of classmates you drive home, where the chance that you take any of them is 10%, we know we can figure out the answer using the binomial formula. Our evidence then was EB . And so it is here, too, when x represents the number of games won. We?ve already seen the binomial formula written in two ways, but yet another (and final) way to write it is this:
x|n, p, EB ? Binomial(n, p).
This (mathematical) sentence reads “Our uncertainty in x, the number of games the football team will win next year, is best represented by the Binomial formula, where we know n, p, and our information is EB .” The “?” symbol has a technical definition: “is distributed as.” So another way to read this sentence is “Our uncertainty in x is distributed as Binomial where we know n, etc.” The “is distributed as” is longhand for “quantified.” Some people leave out the “Our uncertainty in”, which is OK if you remember it is there, but is bad news otherwise. This is because people have a habit of imbuing x itself with some mystical properties, as if “x” itself had a “random” life. Never forget, however, that it is just a placeholder for the statement X = “The team will win x games”, and that this statement may be true or false, and it?s up to us to quantify the probability of it being true.
In classic terms, x is called a “random variable”. To us, who do not need the vague mysticism associated with the word random, x is just an unknown number, though there is little harm in calling it a “variable,” because it can vary over a range of numbers. However, all classical, and even much Bayesian, statistical theory uses the term “random variable”, so we must learn to work with it.
Above, we guessed that the team would win about 6-10 games. Where do these number come from? Obviously, based on the knowledge that the chance of winning any game was 2/3 and there?d be twelve games. But let?s ask more specific questions. What is the probability of winning no games, or X = “The team will win x = 0 games”; that is, what is Pr(x = 0|n, p, EB )? That’s easy: from our binomial formula, this is (see the book) ? 2 in a million. We don’t need to calculate n choose 0 because we know it?s 1; likewise, we don?t need to worry about 0.670^0 because we know that?s 1, too. What is the chance the team wins all its games? Just Pr(x = 12|n, p, EB ). From the binomial, this is (see the book) ? 0.008 (check this). Not very good!
Recall we know that x can take any value from zero to twelve. The most natural question is: what number of games is CMU most likely to win? Well, that’s the value of x that makes (see the book) the largest, i.e. the most probable. This is easy for a computer to do (you’ll learn how next Chapter). It turns out to be 8 games, which has about a one in four chance of happening. We could go on and calculate the rest of the probabilities, for each possible x, just as easily.
What is the most likely number of games the team will win is the most natural question for us, but in pre-computer classical statistics, there turns out to be a different natural question, and this has something to do with creatures called expected values. That term turns out to be a terrible misnomer, because we often do not, and cannot, expect any of the values that the “expected value” calculations give us. The reason expected values are of interest has to do with some mathematics that are not of especial interest here; however, we will have to take a look at them because it is expected of one to do so.
Anyway, the expected value for any discrete distribution, like the binomial, is calculated like this:
Ex (x) = 0 ? Pr(x = 0|E) + 1 ? Pr(x = 1|E) + ? ? ? + n ? Pr(x = n|E)
where discrete means that x can only take on measurable, actual values (there are other distributions that are called continuous which I’ll describe below). The expectation (another name for it) is the sum of every value that x can be times the probability that x takes those numbers. Think of it as a sort of probability-weighted average of the xs. The little sub x on the expected values means “calculate the expected value of the variable with respect to x”; that is, calculated E(x) with respect to the probability distribution of x. Incidentally, we can also calculate Ex (x2 ) or Ex (g(x)), where g(x) is some function of x that might be of interest to us, and sometimes it can get confusing what we?re doing, hence placing the subscript as a reminder. As always, it is important to be precise.
Turns out that there is a shortcut for the binomial, which is Ex (x) = np. So, for the CMU team, Ex (x) = 12 ? 3 = 8…which sounds like I?m complaining about nothing, because this is the same as the most likely number of games won! But what if the probability of winning individual games was 3/5 instead of 2/3? Then (a computer shows us) the most likely number of games won is 7, but the expected value is Ex (x) = 12 ? 3 = 7.2. Now, according to the rules of football as I understand them, you can
only win whole games; that is, winning the expected number of games is an impossibility.
There is another quantity related the expected value called the variance. It has a similar birth story and a precise mathematical definition, which for discrete distributions is
Vx (x) = Ex ((x ? Ex (x))2 )
= (0 ? Ex (x))2 ? Pr(x = 0|E) + . . . + (n ? Ex (x))2 ? Pr(x = n|E).
It’s purpose is to give some idea of the precision of the expected value. Look at the definition: it is a function of the value of x minus the “expected” value of x, for each possible value of x (that’s the outer expectation). High values of variance, relative to the expected value, imply that the expected value is imprecise; low values have the opposite implication. There is a binomial shortcut to the variance: Vx (x) = np(1?p). For the CMU football example, V (x) = 12 ? 0.67 ? 0.33 ? 2.7.
Why talk about expected values and variances when they are not terribly informative? Well, let’s be generous and recall that these theoretical entities had great value in the days before computers. Nowadays, we can easily calculate the probability that x equals any number, but back in the technolithic days this could only have been done with great effort. Besides, the expected value is not too far from the most likely value, and is even the same sometimes. The variance gives an idea of the plus and minus range of the expected value, that is, the most likely values x could take. And you could do it all on the back of an envelope! But since expectations still fill pages of nearly every statistics book, you at least have to be aware of them. Next, we learn how to quantify uncertainty the modern way.
2. Probability Distributions
Remember what will be our mantra: if we do not know the truth of a thing, we will quantify our uncertainty in that thing using probability. Usually, we will use a probability distribution, like the binomial. A probability distribution gives us the probability for every single thing that can happen in a given situation.
You already know lots of probability distributions (they go by the technical name “probability mass functions” for discrete data), you just didn?t know they were called that. Here are two you certainly have memorized, shown in pictures:
(see the book)
The first is for a coin flip, where every single thing that can happen in a H(ead) or T(ail). The information we are given is Ecoin = “This is a coin with two sides labeled H and T, and a flip will show one of them.” Given this information and no other, we get the picture on the left, which shows the distribution of probability for every single thing that can happen. Easy, right? It’s just a spike at 0.5 for an H, and another at 0.5 for a T. The total probability is the sum of the spikes, or 1.
The second is for the roll of a die, where every single thing that can happen is a 1, or 2, or etc. The information is Edice = “This is a die with six sides labeled 1, 2,…,6, and a roll will show one of them.” Given just this information, we get the picture with a spike of 1/6 for every possible number. Again, the total probability is the sum of the spikes, which is still 1. It is always equal to 1 for any probability distribution.
We can also picture the binomial for the CMU football victories.
(see the book)
Here, it is drawn for three possible values of p: p = 1/5, p = 1/2, and p = 2/3. Every single thing that can happen is that CMU wins 0 games, 1 games, etc., up to 12 games. The information we are given is EB = “The probability of winning any individual game is fixed at exactly p (=1/5, 1/2, or 2/3), there are n = 12 games, winning or losing any game gives no information about winning or losing any others, and we will use a binomial distribution to represent our uncertainty.” If p = 1/5, you can see that there is at least a reasonable chance, about 7%, that CMU wins no games, while winning all games is so improbable that it looks close to 0.
Wait a minute, though. It is not 0. It just looks like 0 on this picture. The total number of games won by CMU is contingent on certain facts of the universe being true (like the defense not being inept, the quarterback not being distracted by job proposals or cheerleaders, and so on). Remember that the probability of any contingent event is between 0 and 1; it is never exactly 0 or 1. So even though the picture shows that winning all games when p = 1/5 looks around 0, it is not, because that would mean that winning all 12 is impossible. To say something is impossible is to say it has probably 0, which we know we cannot be so for a contingent event. Incidentally, using the computer shows that the probability of winning at 12 games is about 4e-09, which is a decimal point, followed by eight 0s, then a 4, or 0.000000004. Small, but greater than 0.
The most likely number of games won, with p = 1/5, is 2?there is about a 28% chance of this happening. What is the expected value? And variance?
Notice that when we switch to p = 1/2, the chance of winning games becomes symmetric around 6, the most likely number won. This means that the chance of winning all 12 is equal to the chance of winning none. Does it also mean the chances of winning 1 is the same as winning 11?
When p = 2/3, the most likely number of games won is again 8, but right behind that in probability is 9 games, which is actually more likely than winning 7, and so on.
The reason to show three pictures at different values of p is because we don?t know what the value of p is, but EB requires that we specify a known value of p, else we cannot draw the picture. We learn how to guess the value of p later.
3. What is Normal?
What will be tomorrow’s high temperature? This value is, of course, unknown. But we can always guess. Suppose we guess x? C. Are we certain it will be x? C? No. It may be a little higher, it may be a little lower. It’s unlikely to be too high or too low, or too far from x? C. So, the question you’re undoubtedly asking yourself is: “Hasn’t some brilliant and intriguing statistician come up with a way that I can quantify my uncertainty in x?” Why, yes, of course (and aren?t all statisticians brilliant and intriguing?). It?s called the normal, sometimes a Gaussian, distribution.
This distribution is different than a binomial in many ways. With the binomial, we had a fixed number of chances, or trials, for successes to occur. With the normal, there is no such thing as a success, and no fixed number of chances, except for one: the outcome itself. The binomial was useful for discrete numbers, while the normal is used for…something else, to be discussed below.
Here is one of the ways we can write it:
(see the book)
m is called the central parameter and s^2 is called the variance parameter: sometimes, s, the square root of s^2 , is called the standard deviation parameter. Some books will, using sloppy language, call m the mean and s the standard deviation. You will never make this mistake! The e is equal to about 2.718, and ? is about 3.142. Anyway, it is a sufficiently complicated formula such that we’ll never calculate it by hand.
Let’s review this to make certain where we are. Just like using a binomial, the x is shorthand for X = “The value of the temperature will be x.” Certain information is given as known: the m and the s2 , plus EN =”?We use a normal distribution to quantify our uncertainty in X.” Looking at the formula (15) might not show it, but there are some screwy things going on with the normal. First recall that the probability distribution gives us the probability of every single thing that can happen. So just what is every single thing that can happen in a normal distribution?
Well, (this is true for any situation that uses a normal and not just the temperature example), x can equal, say, 18.0000001, or 18.0000000001, or -19.124828184341, and on and on. Turns out, with the mathematical device used for creating normal distributions, an infinity of things can happen: every number between negative and positive infinity is possible with a normal. How many numbers is that? So many that they can?t be counted. Numbers like this are said to be continuous, that is, there is an unbroken continuity between any two numbers a and b. How many numbers are there between a = 17 and b = 82? An infinity. How many between a = 6.01 and b = 6.1? An infinity. The binomial, on the other hand, used discrete numbers, where there is a definite space between numbers (the outcome could only be certain, fixed numbers, there was no continuity).
Normal distributions are used to specify the uncertainty of a wide range of variables: from blood pressures, to real estate prices, to heat in chemical equations, to just about anything. But there is a huge problem with this. Recall our primary purpose in using probability distributions: they are meant to quantify our uncertainty in some thing about which we do not know the value. Normal distributions, though ubiquitous, never accurately capture our uncertainty in any real life X.
This is because the uncertainty in any real-life thing cannot be exactly quantified by a normal, because no real thing has an infinite number of possible values. Also, no real thing, like temperature, has a maximum going toward positive infinity, nor a minimum going toward negative infinity. We can accurately measure outdoor temperature to maybe a tenth of even a hundredth of a degree (eg. 18.11? C). But we cannot measure it to infinite precision. And the temperature of any thing can never be less than absolute zero (given certain physical arguments), and certainly cannot be infinitely high.
All these complications mean that equation (15) isn?t a probability itself (it is called a density). We first have to make it into a probability (via some hidden mathematics). When we do, any normal distribution says that
Pr(x|m, s, EN ) = 0
no matter what the value of x is, no matter what m is, and no matter what s is. The probability that x equals any number is always 0 (no continuous number can be measured to infinite precision). To help see this, imagine I pick a number out of an infinite number of choices. What are the chances that you guess this number correctly? Even worse, I cannot even really pick my number! Some (an infinite amount of) continuous numbers cannot even be measured, though we know how to compute them; that is, nobody can ever fully write one down, because that would require writing down an infinite number of digits. Worse still, most (another, larger kind of infinity of) continuous numbers, we don?t even know how to calculate their digits! Incidentally, not all mathematicians are happy about using these kinds of numbers (the ones who are dissatisfied, like myself, are called constructivists, because we like to be able to actually construct what they use). After all, if you cannot actually write down or discover a number, it has little use in measuring real things. See the books by Chaitin (2005) and Kline (1980) for more information on this odd subject.
Continuous numbers are a major burden, which seems to put us at an impasse. Since we can’t answer questions about the truth of statements like X = “The value of tomorrow?s maximum temperature will be x”, we change the question and instead ask about intervals. For example, X = “The value of tomorrow?s maximum temperature will be less than x.” X no longer makes a statement about a single number, but a range of numbers, namely, all those less than x. Other examples of intervals: all the numbers between 0 and 1; all numbers smaller than 4; all numbers less than 17 and greater than 52; etc. Pick any two numbers, and as long as they are not the same, you have an interval. Then, for example,
Pr(x < 4|m, s, EN ) = a (where a is some real number) can be answered. Again, to emphasize, we cannot ascertain the truth of statements like X = "The value of tomorrow's maximum temperature will be 20? C." We can only quantify the uncertainty of statements about intervals like X = "The value of tomorrow?s maximum temperature will be less than or equal to 20? C." If the normal can't handle questions we need answered, like giving us the probability of single numbers, why is it used? The biggest reason is habit, another is ignorance of any alternative. But there's more to it than that. Let's go back to our temperature example to see why. We know, say, in our situation that we can measure temperature to the nearest tenth of a degree. We can even suppose that temperature can only even be at every tenth degree, so that the temperature can be 20? C or 20.1? C, but it cannot be, say, 20.06? C or any other numbers that aren't even tenths of a degree. Using a normal distribution to represent our uncertainty will give probability to statements like Y = "Tomorrow?s temp will be between, and not including, 20? C or 20.1? C." We then know that this probability is 0, which is to say, the statement is false, which we know based on our knowledge that temperature can only be at tenths of a degree. But the normal will say something like Pr(20? < y < 20.1? |m, s, EN ) = 0.0001. Although this is a mistake, is it a big one? "Ah, so what," you say to yourself, "this is so small a probabil- ity as not to be worthy of my attention. The normal distribution will give me an answer that is close enough." You might be right, too. In later Chapters, we?ll have to see if the normal makes a reasonable approximation to what we really need. Besides, if you don?t want to use a normal distribution, you still have to use something. What?2 Using a normal distribution does allow you to bypass two very tricky problems. Remember that a normal distribution, regardless of the values of m or s, says something about all numbers going towards positive and negative infinity. If you eschew the normal in the temperature example, then you at least have to say what are the maximum and minimum possible temperature. Do you know? I mean, do you know with certainty? (Think about this.) If not, then you have, in a sense, double uncertainty: the future value of the temperature, plus some uncertainty in the distribution you use in representing that uncertainty. This situation is already beyond most statistics books, even the tough ones, so for now, until we talk about the subject of modelling, we'll ignore this question and say that the normal is "close enough." Whew. A lot of facts and we haven?t even thought about our example. So why bring them up? To show you now that people are too certain of themselves. Normal distributions are so often used by that great mass of people who compute their own statistics, that you might think there are no other distributions. Since we now know that normals can only be, at best, approximations, when we hear somebody authoritatively state a statistical result must be believed, and we learn they used a normal to quantify their uncertainty, we know they are too confident. We'll meet a lot more of this kind of thing as we go along. On to something more concrete. Let?s look at an abbreviated picture of a normal distribution and see what we can learn (it is abbreviated because we cannot picture the whole thing). The point m is the central point, and is the most likely value (not forgetting that no single value is actually possible?just weird, right?); m plus or minus s contains about 68% of all possible values of x, and m plus or minus about 2 times s contains about 95% of all possible values of x. (see the book) What does this mean? Specifically, that Pr(m ? s < x < m +s|m, s, EN ) ? 0.68 and Pr(m ? 2s < x < m + 2s|m, s, EN ) ? 0.95. The normal is symmetric about the point m: meaning there is equal probability of being above m as being below it. Incidentally, the expected value of x is m, and the variance s2 , which is easy to remember. You might have noticed that there is no y-axis on the picture. This is to remind you that the curve itself does not represent probability. The (missing) y-axis is instead something called the density, which is the continuous-number equivalent of probability. It cannot picture the probability, because the probability of x equaling any number is 0. Because of this, these pictures are only useful for estimating probability by area under the curve. The entire area under the curve equals 1, just as it did with the coin flip and dice example, because this picture shows the probability of every single thing that can happen, and every thing that can happen is any number between ?? and +?. Since the actual value of x will take place somewhere in this interval, the area must equal 1. An example: what is Pr(x < m|m, s, EN ). By symmetry, it is 0.5, because it is half the area under the curve, from the point m to ?? (everything to the left of the thin, dotted curve starting at m). Another example, Pr(x > m + 2s|m, s, EN ), which we know is about 0.025, is that tiny area from the point m + 2s to +? (everything to the right of the thin, dotted curve starting at m + 2s). Why do we know it is about 0.025? Because the probability of numbers in the interval (m ? 2s, m + 2s) is 0.95, which leaves 0.05 probability of numbers outside that interval. Then, since the normal is symmetric, this leaves 0.025 probability for numbers less than m ? 2s and 0.025 for numbers greater than m + 2s.
One more picture showing two normal distributions. The curveto the left has a smaller variance parameter and central parameter than the one to the right. The area under either curve is 1 (it is always equal to 1 for any distribution). Notice that the one with the larger variance parameter is wider, meaning you are less certain about the values of x. Obviously, then, smaller variance parameters mean you are more certain, in the sense that the most of the probability is for a narrower ranges of xs.
(see the book)
If you quantified the uncertainty in x using the distribution to the left, and I used the one to the right, which of us thinks there is a higher probability of large values of x? For example, pick the point indicted by the dotted vertical line. Obviously, I am more certain that I will see values this large or larger because the area under the curve to the right of the dotted line is larger for me than for you. You can see that we can answer lots of questions like this by reference to pictures. Next Chapter, we?ll learn how to do this on a computer.
Nonsense alert! You sometimes hear this, “Our observation was drawn from a normal distribution” (with given parameter values). When this person says “drawn” they do not mean they drew a picture like we just did. They instead intend that nature (the modern equivalent of a deity) “randomly generated” the observation using a normal. Somehow, through randomness, the value appeared, almost as if by magic. Well, dear reader, something caused that observation to take the value it did. If you knew the exact casual mechanism, including the initial conditions, the starting point that gives rise to a particular value, then you would have known in advance exactly what x would be. However, just because you do not know, does not mean the cause did not exist. If you want clarification of this, see the gorgeous Chapter 10 of Jaynes (2003) wherein he discusses the physics and probability of a coin flip.