No, data do not have means. Nor do they have variances, autocorrelations, partial or otherwise, nor moments; nor do they have any other statistical characteristic you care to name. Data only have causes.
“What?! Briggs, you fool. Are you saying that I can’t calculate means? If you’d take the trouble to open any statistics book, you’d see that data do have means.”
No, they don’t.
“You’re being obstinate. Just like in that recent article about time series in which you didn’t understand, because you couldn’t open a book on time series, that time series need to have stationarity in order to be modeled in the usual way.”
Sorry. Data do not have stationarity, nor do they have the lack of it. What’s stationarity? From the government itself comes this definition of stationarity:
A common assumption in many time series techniques is that the data are stationary. A stationary process has the property that the mean, variance and autocorrelation structure do not change over time.
This is a perfect instance of the Deadly Sin of Reification. Of any actual series of numbers, including a series of length 1, a mean may be calculated. But that series does not have or possess a mean. The data do not sense this fictive mean, nor is the data influenced any way by this hobgoblin, because, simply, the mean does not exist; and since it doesn’t exist, it doesn’t have any causative power. And the same is true for any statistical characteristic of any data.
The calculation we also call a mean certainly can exist, if somebody troubles himself to flick the stones on his abacus. But for a series of size 1, a calculation for variance or autocorrelation cannot be done, yet the series still exists (the fallacy of limiting relative frequency lurks here; frequentists are obliged to think the impossible, that every datum is embedded in an infinite sequence). This, then, is the problem: to have a mean is equivocal. The phrase can be used correctly (as it is on this blog), but it usually isn’t.
The Deadly Sin of Reification is so rife in probability and statistics that to find it absent is a surprise. And this is so even though every statistician will say he agrees with the statement, “The data are not the model used to represent the uncertainty in that data.” She will say she agrees, but her actions will be contrary.
This is why you hear talk of data being “normally distributed” and the like. No data in the universe is normally distributed, or distributed in any way whatsoever by any probability. Probability has no power; probability is not a cause! The uncertainty in much data can, of course, be modeled using a normal distribution, at least to a first approximation. It’s proper to say, “Given some evidence which led me to this conclusion, the uncertainty in this data is represented by a normal.”
That means, with some light qualifications, any data can be modeled by any probability distribution (this follows from the fact that all probability is conditional). In particular, data lacking the criterion (lacking the calculation) for “stationarity” can be modeled by a distribution which requires it. The model may or may not be any good, naturally, but we tell model goodness by its predictive ability.
Glance at any paper which describes simulation. The entire field reeks of the DSR. Terrible troubles are taken to, it is said, ensure random numbers go into routines so that the resultant simulation has (possesses) the proper means, variances, autocorrelations and the like. Data are generated, they say, by this or that probability distribution, which, it is said, have these certain characteristics.
Now to generate means to cause, and probability isn’t a cause, and random only means unknown and everything in a simulation is known. So there are two central fallacies here. It is true, as in the data series where a mean can be calculated, certain things in simulations can be calculated, but any resemblance to live things is in the minds of users and not in the simulated numbers themselves.
To say data has a mean or any other probabilistic characteristic is thus always a fallacy. Data always has a cause or causes, of course, and knowledge of these causes is always our real goal. Probability doesn’t get us to this goal. Knowledge of cause can.
Everything that was something else and is now this must have been caused to be this by something that actually exists. This cause or these causes do have certain characteristics and powers because, of course, to be responsible for a change is what being a cause means. But these causes won’t be means, autocorrelations or anything like that.
Again, our understanding of the uncertainty of data is influenced by means and so forth. That’s because probability is an epistemological and not ontological concern. So I reassert the true proposition: data do not “have” means.
“I don’t understand a word you’re saying, Briggs. None of this is accepted by statisticians. It isn’t even in any books!”
That so? So much for the field of statistics, then. But you’re wrong: it is in one book. We’re still waiting to see who will publish it.
Addendum Wiley sent comments from three reviewers, one unconditionally recommended publication and two conditionally recommended it, but the conditions are nothing much (better title, ensure certain literature cited, etc.); so it’s good news.