# The only two reasons for statistics

This short post is for reference. I will point back to it from time to time.

Reason 1: to say something about the past

Examples: counting seasonal numbers of wins by the Detroit Tigers, or the number of Republican state senators, or how many people you had over last Christmas.

All are raw numbers, counts, tallies, collected to say something about a historical circumstance and for no other reason.

No probability models are needed here, or they are all trivial. For example: what is the probability the Tigers won more than 90 games in 2008? It is either 0 or 1 just in case they either did win more than 90 games or they did not (they did not).

In order to say something about the past—about data we have already collected—we just need to look and count and nothing more.

Most sports statistics fits here, as do other areas of trivia. Any kind of record keeping counts.

Reason 2: to say something about things not yet seen

If you have not yet seen a thing, you are uncertain about what state that thing will take.

If you are uncertain, you quantify that uncertainty using probability. All probability statements are conditional on some evidence.

Evidence usually consists of two things: (1) historical data and a probability model that accounts for that data plus (2) the probability model said to explain the thing we have not yet seen.

(1) and (2) are frequently the same; sometimes we do not need (1); we always need (2).

For example, given just the evidence that “This is a six-sided die, and just one side is labeled a 3” then the probability of the thing “We see a 3 when the die is tossed” is 1/6. No historical data was needed to make this statement.

To quantify the probability of other unseen things, historical data is typically used. For example, the thing “The Detroit Tigers will win more than 90 games in 2009” is unknown as yet. To say what the probability of it is, we can collect historical data, assign a probability to it, and then make a quantification.

More than one probability model can be assigned to the historical data and the thing. This leads to two consequences, both crucial to remember.

(a) If the evidence that implies what probability to model to use is ambiguous, then that evidence that leads to the model you use should be made explicit; and

(b) The probability statements made by conflicting models are all correct (assuming no computational errors, of course).

If model A says the probability of a thing is x and model B says it has a probability of y, and x does not equal y, neither probability is wrong before we see the thing.

After we have seen the thing, we can compute the probability that model A or model B is correct.

All that is found in statistics books falls under this branch. Anytime a prediction, or forecast, or prognostication is made, it is this type of statistics.

To specify a probability model means specifying the value of certain parameters. In the die example, the value of the parameter was deduced. In models that use historical data, most or all parameters cannot be specified example and usually remain unknown to some extent.

Do not be fooled that most statistical procedures revolve around finding estimates to the parameters of the probability models. These estimates are not necessary and are at best proxies to what is of interest: real, tangible, observable things.

Modern statistical methods is designed to make probability statements about observable things (like the numbers of Tigers wins) in such a way that the uncertainty in the parameters is accounted for.

Example

Suppose you have observed global mean temperatures (suppose, too, this quantity is unambiguously and suitably defined) from 1900 up through 2009. What branch of statistics can answer the following:

(i) What is the probability the temperature increased from 1900?

(ii) What is the probability that the temperature in 2009 will be larger than that in 2008?

———————————-

If anything above is ambiguous, let me know and I’ll fix it. In a big hurry today.

Where does the present come into it?

2. Briggs

Assuming you’re not teasing me…

The second regime can also be stated: to quantify the value of things not yet known. This is for things that have happened in the past but we do not yet know the outcome of, might happen in the future, or are happening now but we do not yet know the outcome of. There is no explicit time restraint for the second regime. Only for the first.

3. George

Which present? (I’d like a present.)

What about analyzing past data to find anomalies – unexpected trends not satisfactorily explained by existing models? I guess this applies to things like data dredges, as well as generally working out how confident you are that a model works well based on existing observed data.

4. Briggs

George,

Excellent question. It is useless in the following sense.

There are an infinite number of models to explain any past set of data. Unless you can deduce which model best quantifies the uncertainty in that model, you have to pick from one of them. Which one do you use to say “The probability of the observed data is low”?

The model you pick must then be able to skillfully predict future data (or data not yet known). If it cannot, then whatever you have said about the past data is not interesting.

Besides, you do not need to have a model for past data if all you want to say is it was “high” or “low” or whatever.

5. Doug M

I have a tough time with this statement:

If model A says the probability of a thing is x and model B says it has a probability of y, and x does not equal y, neither probability is wrong before we see the thing.

After we have seen the thing, we can compute the probability that model A or model B is correct.

Suppose my model says that the Tigers have a 5% chance of winning 90 or more games, and your model says that they have a 20% chance of winning 90+ games. If the Tigers win 77 games in 2009, is it impossible to say whos model was better? Even if they won 91 games, it would still be a stretch to say your model was in fact better that mine.

If we could repeat the 2009 season several times, in the parallel universes, we could say that one model was superior. Otherwise, I don’t understand how to validate a model that is forecasting a one-off event.

I read a lot of economic research for my job. Hundreds of economists forecast GDP and inflation. Some number get it right. Do they have better models, or are they just lucky? Is there any way to separate the good from the fortunate?

6. Rich

Of 1000 people, 200 are infected with the tingle worm, 80 of whom report scalp tingling. Of the 800 who are not infected 160 report scalp tingling.

Why is useless to ask, “which is more likely: that the tingle worm increases the incidence of scalp tingling or that the results are just happenstance?”?

Note that the question is entirely about the 1000 who were investigated. Sure, the incidence is higher in the infected members of the population but I want to know the probability that it’s a sampling artefact. I don’t believe just counting will tell me.

I would think there is an uncountable number of models simply because some statistics may be on the R^n probability space. Or is emphasis on “infinite” merely for emphasis sake? Is it because we measure in discrete values?

I always wondered when to use “uncountable” and when to use “infinite”.

Thoughts?

8. Briggs

Countable means you can line whatever you’re counting up with the integers. The squares (1^2, 2^2, 3^2, …) are countable.

Uncountable means you can’t line whatever you’re counting up with the integers. Irrational numbers are not countable.

Countable sets are usually infinite, uncountable ones always are. You can distinguish, mathematically, between the cardinalities of the infinities, but there is no practical difference.

Doug,

Excellent question. After the predictions are realized, we can certainly say which model is better, like in your example. But not until after.

There are lots of ways to say how good a model has done. I’ll write about that soon—because I happen to be working on this now. Check back, maybe next week.

Rich,

But counting tells you everything. 80 out of 200 worm-infested people tingle. And 120 out of 800 clear people tingle. End of story.

What is the probability that more clear people tingled than worm-infested people? 1. Because 120 clear people tingled and 80 infested did not.

What is the probability that a greater proportion of clear people tingled than worm-infested people? 0. Because 15% of clear people tingled and 40% of infested people tingled.

Models only become important, or interesting, if you want to extrapolate beyond the sample you collected.

9. Rich

So are you going to say that any attempt to theorise over a cause-effect relationship is an implicit prediction about the future?

Henry is being kept awake at night by his tingling scalp. “Damn those tingle worms”, he says. “Not all,” replies his wife, “the supposed statistical relationship was simply a coincidence”. Henry was the 43rd tingler in the infected tinglers and the 97th carrier. We counted him. It hasn’t helped.

10. Briggs

Rich,

Nah. The cause-effect relationship isn’t something you’re predicting, it’s something your observing or assuming. You either have the worm or not and your scalp tingles or it doesn’t. If you tingle, something is causing it. If Henry tingles, then something caused that. If you have other candidates than the worm—other conditioning information—then you enter into probability. Here, you assume it’s the worm.

11. Schnoerkelman

Well, maybe there’s another reason for statistics…

http://xkcd.com/563/

The link that I pasted is xkcd.com number 563 (about five back from the current one).
This is a nerds comic that I find rather amusing from time to time. I hope others will enjoy it too.

bob

12. Schnoerkelman

And since that last one worked so well here is one for the Lukewarmers out there :-))

http://xkcd.com/563/