Publisher needed: Stats 101
How to Cheat: Stats 101 Chapter 14
Starting to lose you: Stats 101 Chapter 9
Stats 101: Chapter 8
This is where it starts to get complicated, this is where old school statistics and new school start diverging. And I don’t even start the new new school.
Parameters are defined and then heavily deemphasized. Nearly all of old and new school statistics entire purpose is devoted to unobservable parameters. This is very unfortunate, because people go away from a parameter analysis far, far too certain about what is of real interest. Which is to say, observable data. New new school statistics acknowledges this, but not until Chap 9.
Confidence intervals are introduced and fully disparaged. Few people can remember that a confidence interval has no meaning; which is a polite way of saying they are meaningless. In finite samples of data, that is, which are the only samples I know about. The key bit of fun is summarized. You can only make one statement about your confidence interval, i.e. the interval you created using your observed data, and it is this: this interval either contains the true value of the parameter or it does not. Isn’t that exciting?
Some, or all, of the Greek letter below might not show up on your screen. Sorry about that. I haven’t the time to make the blog posting look as pretty as the PDF file. Consider this, as always, a teaser.
For more fun, read the chapter: Here is the link.
CHAPTER 8
Estimating
1. Background
Let?s go back to the petanque example, where we wanted to quantify our uncertainty in the distance x the boule landed from the cochonette. We approximated this using a normal distribution with parameters m = 0 cm and s = 10 cm. With these parameters in hand, we could easily quantify uncertainty in questions like X = “The boule will land at least 17 cm away” with the formula Pr(X|m = 0 cm, s = 10 cm, EN ) = Pr(x > 17 cm|m = 0 cm, s = 10 cm, EN ). R even gave us the number with 1-pnorm(17,0,10) (about 4.5%). But where did the values of m = 0 cm and s = 10 cm come from?
I made them up.
It was easy to compute the probability of statements like X when we knew the probability distribution quantifying its uncertainty and the value of that distribution?s parameters. In the petanque example, this meant knowing that EN was true and also knowing the values of m and s. Here, knowing means just what it says: knowing for certain. But most of the time we do not know EN is true, nor do we know the values of m and s. In this Chapter, we will assume we do in fact know EN is true. We won?t question that assumption until a few Chapters down the road. But, even given EN is true, we still have to discern the values of its parameters somehow.
So how do we learn what these values are? There are some situations where are able to deduce either some or all of the parameter’s values, but these situations are shockingly few in number. Nearly all the time, we are forced to guess. Now, if we do guess?and there is nothing wrong with guessing when you do not know?it should be clear that we will not be certain that the values we guessed are the correct ones. That is to say, we will be uncertain, and when we are uncertain what do we do? We quantify our uncertainty using probability.
At least, that is what we do nowadays. But then-a-days, people did not quantify their uncertainty in the guesses they made. They just made the guesses, said some odd things, and then stopped. We will not stop. We will quantify our uncertainty in the parameters and then go back to what is of main interest, questions like what is the probability that X is true? X is called an observable, in the sense that it is a statement about an observable number x, in this case an actual, measurable distance. We do not care about the parameter values per se. We need to make a guess at them, yes, otherwise we could not get the probability of X. But the fact that a parameter has a particular value is usually not of great interest.
It isn’t of tremendous interest nowadays, but again, then-a-days, it was the only interest. Like I said, people developed a method to guess the parameter values, made the guess, then stopped. This has led people to be far too certain of themselves, because it?s easy to get confused about the values of the parameters and the values of the observables. And when I tell you that then-a-days was only as far away as yesterday, you might start to be concerned.
Nearly all of classical statistics, and most of Bayesian statistics is concerned with parameters. The advantage the latter method has over the former, is that Bayesian statistics acknowledges the uncertainty in the parameters guesses and quantifies that uncertainty using probability. Classical statistics?still the dominate method in use by non-statisticians1?makes some bizarre statements in order to avoid directly mentioning uncertainty. Since classical statistics is ubiquitous, you will have to learn these methods so you can understand the claims people (attempt to) make.
So we start with making guesses about parameters in both the old and new ways. After we finish with that, we will return to reality and talk about observables.
2. Parameters and Observables
Here is the situation: you have never heard of petanque before and do not know a boule from a bowl from a hole in the ground. You know that you have to quantify x, which is some kind of distance. You are assuming that EN is true, and so you know you have to specify m and s before you can make a guess about any value of x.
Before we get too far, let?s set up the problem. When we know the values of the parameters, like we have so far, we write them in Latin letters, like m and s for the Normal, or p for the binomial. We always write unknown and unobservable parameters as Greek letters, usually ? and ? for the normal and ? for the binomial. Here is the normal distribution (density function) written with unknown parameters:
(see the book)
where ? is the central parameter, and ? 2 is the variance parameter, and where the equation is written as a function of the two unknowns, N(?, ?). This emphasizes that we have a different uncertainty in x for every possible value of ? and ? (it makes no difference if we talk of ? or ? 2 , one is just the square root of the other).
You may have wondered what was meant by that phrase “unobservable parameters” last paragraph (if not, you should have wondered). Here is a key fact that you must always remember: not you, not me, not anybody, can ever measure the value of a parameter (of a probability distribution). They simply cannot be seen. We cannot even see the parameters when we know their values. Parameters do not exist in nature as physical, measurable entities. If you like, you can think of them as guides for helping us understand the uncertainty of observables. We can, for example, observe the distance the boule lands from the cochonette. We cannot, however, observe the m even if we know its value, and we cannot observe ? either. Observables, the reason for creating the probability distributions in the first place, must always be of primary interest for this reason.
So how do we learn about the parameters if we cannot observe them? Usually, we have some past data, past values of x, that we can use to tell us something about that distribution?s parameters. The information we gather about the parameters then tell us something about data we have not yet seen, which is usually future data. For example, suppose we have gathered the results of hundreds, say 200, of past throws of boules. What can we say about this past data? We can calculate the arithmetic mean of it, the median, the various quantiles and so on. We can say this many throws were greater than 20 cm, this many less. We can calculate any function of the observed data we want (means and medians etc. are just functions of the data), and we can make all these calculations never knowing, or even needing to know, what the parameter values are. Let me be clear: we can make just about any statement we want about the past observed data and we never need to know the parameter values! What possible good are they if all we wanted to know was about the past data?
There is only one reason to learn anything about the parameters. This is to make statements about future data (or to make statements about data that we have not yet seen, though that data may be old; we just haven?t seen it yet; say archaeological data; all that matters is that the data is unknown to you; and what does “unknown” mean?). That is it. Take your time to understand this. We have, in hand, a collection of data xold , and we know we can compute any function (mean etc.) we want of it, but we know we will, at some time, see new data xnew (data we have not yet seen), and we want to now say something about this xnew . We want to quantify our uncertainty in xnew , and to do that we need a probability distribution, and a probability distribution needs parameters.
The main point again: we use old data to make statements about data we have not yet seen.
Stats 101: Chapter 7
Update #2. I moronically uploaded a blank document. I have no idea how. It’s all better now.
Update. I idiotically forgot to put a link. Here it is.
Chapter 7 is Reality. This is usually Chapter 1 in most intro stats books. Those other books invariably start students with topics like “measures of central tendency” and “kinds of experiments” etc. Nothing necessarily wrong with any of this, but the student usually has no idea why he should care about “central tendency” in the first place. Why memorize formulas for means and (population or other) standard deviations? What use are these things in understanding how to quantify uncertainty?
So I put these topics off until the reader realizes that understanding uncertainty is paramount. The whole chapter is nuts and bolts about how to read data into R and do some elementary manipulations. Like Chapter 5, it’s not thrilling reading, but necessary. The homework for 7 asks readers to download a set of R functions at https://www.wmbriggs.com/book/Rcode.R, but it’s not there yet because I’m still polishing the code.
Some of the formatting is off in the Latex source, but I won’t fix that until I’m happy with the final text. No pictures are here; all are in the book.
CHAPTER 7
Reality
1. Kinds of data
Somewhere, sometime, somehow, somebody is going to ask you to create some kind of data set (that time is sooner than you think; see the homework). Here is an example of such a set, written as you might see it in a spreadsheet (a good, free open-source spreadsheet is Open Office, www.openoffice.org):
Q1, | …, | Sex, | Income, | Nodules, | Ridiculous |
rust, | …, | M, | 10, | 7 , | Y |
taupe, | …, | F, | , | 3 , | N |
…. | |||||
ochre, | …, | F, | 12, | 2 , | Y |
This data is part of a survey asking people their favorite colors (Q1), while recording their sex, annual income, the number of sub-occipital nodules on their brain, and whether or not the interviewee thought the subject ridiculous or not. There is a lot we can learn from this simple fragment.
The first is always use full, readable, English names for the variables. What about Q1, which was indeed the first question on the survey. Why not just call it “Q1”? “Q1” is a lot easier to type than “favorite color”. Believe me, two weeks after you store this data, you will not, no matter how much you swear you will, remember that Q1 was favorite color. Neither will anybody else. And nobody will be able to guess that Q1 means favorite color.
Can you suggest a better name? How about “favcol”, which has fewer letters than “favorite color”, and therefore easier to type? What are you, lazy? You can?t type a few extra letters to save yourself a lot of grief later on?
How about just “favorite color.” Well, not so good either, because why? Because of that space between “favorite” and “color”; most software cannot handle spaces in names. Alternatives are to put underscore or period between words “favorite color”, or “favorite ? color”. Some people like to cram the words together camel style, like “favoriteColor” (the occasional bump of capital letters is supposed to look like a camel: I didn?t name it). Whichever style you choose, be consistent! In any case, nobody will have any trouble understanding that “favoriteColor” means “favorite color”.
Notice, too, that the colors entered under “Q1” use the full English name for the color. Spaces are OK in the actual data, just not in variable names: for example, “burnt orange” is fine. Do not do what many sad people do and use a code for the colors. For example, 1=taupe, 2=envy green, 3=fuschia, etc. What are you trying to do with a code anyway? Hide your work from Nazi spies? Never use codes.
That goes for variables like “Sex”, too. I cannot tell you how many times I have opened up a data set where I have seen Sex coded as “1” and “2”, or “0” and “1”. How can anybody remember which number was which sex? They cannot. And there is no reason too. With data like this, abbreviation is harmless. Nobody, except for the politically correct, will confuse the fact that “M” means male and “F” female. But if you are worried about it, then type out the whole thing.
Similarly for “Ridiculous”, where I have used the abbreviation “Y” for yes and “N” for no. Sometimes a “0” and “1” for “N” and “Y” are acceptable. For example, in the data set we?ll use in a moment, “Vomiting” is coded that way. And, after all, 0/1 is the binary no/yes of computer language, so this is OK. But if there is the least chance of ambiguity for a data value, type the whole answer out. Do not be lazy, you will be saving yourself time later.
It should be obvious, but store numbers as numbers. Height, weight, income, age, etc., etc. Do not use any symbols with the numbers. Store a weight as “213” and not “213 lbs”. If you are worried you will forget that weight is in pounds, name the variable Weight.LBS or something similar.
What if one of your interviewees refused to answer a question? This will often happen for questions like “Income”. How should you code that? Leave his answer blank! For God’s sake, whatever you do, do not think you are being clever and put in some mystery code that, to you, means “missing.” I have seen countless times where somebody thought that putting in a “99” or a “999” for a missing income was a good idea. The computer does not know that 999 means “missing”; it thinks it is just what it looks like—the number 999. So when you compute an average income, that 999 becomes part of the average. Also don?t use a period, the full stop. That?s a holdover from an ancient piece of software (that some people are still forced to use).
There are times when an answer is purposely missing, and a blank should not be used. For example, if “Income” is less than 20000, then the interviewee gets an extra question that people who make more than 20000 do not get. Usually, this kind of rule can be handled trivially in the analysis, but if you want to show that somebody should not have answered and not that they did not answer, then use a code such as “PM” for “purposely missing”. Even better would be to write “purposely missing”, so that somebody who is looking at your data three months down the road doesn?t have to expend a great deal of energy on interpreting what “purposely missing” means.
Try to use a real database to store your data, and keep away from spreadsheets if you can. A real database can be coded so that all possible responses for a variable like ?Race? are pre-coded, eliminating the chance of typos, which are certain to occur in spreadsheets.
Here?s something you don?t often get from those other textbooks, but which is a great truth. You will spend from 80 to 90% of your time, in any statistical analysis just getting the data into the form readable for you and your software. This may sound like the kind of thing you often hear from teachers, while you think to yourself, “Ho, ho, ho. He has to tell us things like that just to give us something to worry about. But it’s a ridiculous exaggeration. I’ll either (a) spend 10-15% of my time, or (b) have somebody do it for me.” I am here to tell you that the answers to these are (a) there is no known way in the universe for this to be true, and (b) Ha ha ha!
2. Databases
The absolute best thing to do is to store you data in a database. I often use the free and open source MySQL (.com, of course). Knowing how to design, set up, and use such a database is beyond what most people want to do on their own. So most, at least for simple studies, opt for spreadsheets. These can be fine, though they are prone to error, usually typos. For instance, the codings “Y” and “Y ” might look the same to you, but they are different inside a computer: one has a space, one doesn’t. The computer thinks these are as different as “Q” and “W”. This kind of typo is extraordinarily common because you cannot see blank spaces easily on a computer screen. To see if you have suffered from it, after you get your data into R type levels(my variable name) and each of the levels, like “Y” and “Y ” will be displayed. If you see something like this, you’ll have to go back to your spreadsheet and locate the offending entries and correct them.
A lot of overhead is built into spreadsheets. Most of it has to do with prettifying the rows and columns?bold headings, colored backgrounds, and so on. Absolutely none of this does anything for the statistical analysis, so we have to simplify the spreadsheet a bit.
The most common way to do this is to save the spreadsheet as a CSV file. CSV stands for Comma Separated Values. It means exactly what it says. The values from the spreadsheet are saved to an ordinary text file (ASCII file), and each column is separated by a comma. An example from one row from the dataset we’ll be using is
0,0,0,0,39,"black","male","Y",17.1,80,102.4,0
Note the clever insertion of commas between each value.
What this means is that you cannot actually use commas in your data. For example, you cannot store an income value as “10,000”; instead, you should use “10000”. Also note that there is no dollar sign.
Now, in some countries, where the tendrils of modern society have not yet reached, people unfortunately routinely use commas in place of decimal points. Thus, “3.42” written here is “3,42” written there. You obviously cannot save the later in a CSV file because the computer will think that comma in “3,42” is one of the commas that separates the values, which it does not. The way to overcome this without having to change the data is to change the delimiter to something other than a comma; perhaps a semicolon or a pound sign; any kind of symbol which you know won?t be in the regular data. For example, if you used an @ symbol, your CSV file would look like
0@0@0@0@39@"black"@"male"@"Y"@17.1@80@102.4@0
The only trick will be figuring out how to do this. In Open Office, it?s particularly easy: after opening up the spreadsheet and selecting “Save As”, select the box “Edit Filter settings” and choose your own symbol instead of the default comma. A common mistake is to type an entry into, say, an Opinion variable, where a person’s exact words are the answer. Guard against using a comma in these words else the computer will think you have extra variables: the computer thinks there is a variable between each comma.
3. Summaries
It?s finally time to play with real data. This is, in my experience, another panic point. But it need not be. Just take your time and follow each step. It is quite easy.
The first trick is to download the data onto your computer. Go to the book website and download the file appendicitis.csv and save it somewhere on your hard disk in a place where you can remember. The place where it is is called the path. That is, your hard drive has a sort of hierarchy, a map where the files are stored. In you are on a Windows machine, this is usually the C:/
drive (yes, the slash is backwards on purpose, because R thinks like a Linux computer, or Apple, which has the slashes the other way). Create your own directory, say, mydata (do not put a space in the name of the folder), and put the appendicitis file there. So the path to the file is C:/mydata/appendicitis.csv
. Easy, right? If you are on a Linux or Mac, it?s the same idea. The path on a Mac is usually something like /Users/YOURNAME/mydata/appendicitis.csv
. On a Linux box it might be /home/YOURNAME/mydata/appendicitis.csv
. Simple!
Open R. Then type this exact command:
x = read.csv(url("https://www.wmbriggs.com/book/appendicitis.csv"))
There is a lot going on here, so let?s go through it step by step. Ignore the x =
bit for a moment and concentrate on the part that reads read.csv(...)
. This built-in R function reads a CSV file. Well, what else would you have expected from its name? Inside that function is another one called url()
, whose argument is the same thing you type into any web browser. The thing you type is called the URL, the Uniform Resource Locater, or web address. What we are doing is telling R to read a CSV file directly off the web. Pretty neat!
If you had saved the file directly to your hard drive, you would have loaded it like this
x = read.csv("C:/mydata/appendicitis.csv")
where you have to substitute the correct path, but otherwise is just as easy.
The last thing to know is that when the CSV file is read in it is stored in R?s memory in the object I called x. R calls these objects data frames. Why didn?t they call them data sets? I have no idea. How did I know to use an x, why did I choose that name to store my data? No reason at all except habit. You can call the dataset anything you want. Call it mydata if you want. It just doesn?t matter.
Now type just x
and hit enter. You?ll see all the data scroll by. Too much to look at, so let?s summarize it:
summary(x)
This is data taken on patients admitted to an emergency room with right lower quadrant pain (in the area the appendix is located) in order to find a model to better predict appendicitis (Birkhahn et al., 2006). Each of the variables was thought to have some bearing on this question. We?ll talk more about this data later. Right now, we?re just playing around. When we run the command we get the summary statistics for each variable in x. What it shows is the mean, which is just the arithmetic average of the data, the median, which is the point at which 50% of the data values are larger and 50% smaller, the 1st Qu., which is the first quartile and is the point at which 25% of the data values are smaller, the 3rd Qu. which is the third quartile and is the point at which 75% of the data values are smaller (and 25% are larger, right?). Also given in the Min. which is the minimum value and Max which is the maximum. Last is NA’s, which are the number, if any, of missing values. These kinds of statistics only show for data coded as numbers, i.e. numerical data. For data that is textual, also called categorical or factorial data, the first few levels of categories are shown with a count of the number of rows (observations) that are in that category.
You will notice that variables like Pregnancy are not categorical, but are numerical, which is why we see the statistics and not a category count. Pregnancy is a 0/1 variable and is technically categorical; however, like I said above, it is obvious that “0” means “not pregnant”, so there is no ambiguity. The advantage to storing data in this way is that the numerical mean is then the proportion of people having Pregnancy =1 (think about this!).
Let’s just look at the variable Age for now. It turns out we can apply the summary function on individual variables, and not just on data frames. Inside the computer, the variable age is different than Age (why?). So try summary(Age)
. What happens? You get the error message Error in summary(Age) : object "Age" not found.
But it?s certainly there!
You can read lots of different datasets into R at the same time, which is very convenient. I work on a lot of medical datasets and every one of them has the variable Age. How does R know which Age belongs to which dataset? By only recognizing one dataset at a time, through the mechanism of attaching the dataset directly to memory, to R?s internal search path. To attach a dataset, type
attach(x)
Yes, this is painful to remember, but necessary to keep different datasets separate. Anyway, try summary(Age)
again (by using the up arrow on your keyboard to recall previously typed commands) and you’ll see it works.
Incidentally, summary is one of those functions that you can always try on anything in R. You can?t break anything, so there is no harm in giving it a go.