Don’t smooth your data and then use that smoothed data as input to other analysis. You will fool yourself. You will make over-confident decisions. It is the wrong thing to do. It is a mistake. It is a guarantee of over-certainty. I don’t know how to put it more plainly. Lord knows I have tried. See below for a non-success story.
Smoothing means any kind of modeling, which includes running means, just-plain-means, filtering of any kind, regression, wavelets, Fourier analysis, ARIMA, GARCH; in short, any type of function where actual data comes in and something that is not data comes out.
Do not use the something-that-is-not-data as if it is data. This is a sin.
Don’t believe me. Try it yourself. The picture is from an upcoming paper I and some friends are writing.
It shows two simulated normal noise time series, with successively higher amounts of smoothing applied by a k-rolling mean. From top left clockwise: k = 1, 10, 20, 30; a k = 1 corresponds to no smoothing. The original time series are shown faintly for comparison. The correlation between the two series is indicated in the title.
More smoothing equals higher correlations. Since there are no causes between these series, the correlation should be hovering around 0, which it is in the first panel. And that correlation stays near 0—for the original real not fake un-smoothed data. But if you calculate the correlation between the smoothed series…the sky’s the limit!
Now it is not true that in each and every and all instances that smoothing will increase the correlation between two smoothed series. It might be that (in absolute value), for your one-time smoothing, correlation decreases or stays put. But it usually will increase, and usually by a lot.
Why? Imagine any two straight lines with non-zero slopes. These two straight lines will have perfect Pearson correlation, either +1 or -1. Regression and other measures will also show perfect agreement. The proof of this is trivial, and I leave it as an exercise (don’t be lazy; try it). Smoothing makes time series data look more like straight lines, as the pictures show. Simple as that.
There are all manner of fine points I’m skipping and would make wonderful Masters projects. Just what kind of data and what kind of smoothing and what statistical measures are affected and by what magnitude? All these questions are quantifiable and will make for fun puzzles. My experience with actual data and actual smoothing and typical measures shows that magnitude is large.
Now, without betraying any confidences, let me tell you of the latest in a long and growing string of bad examples. Two companies, one internationally known for their quantitative prowess, another even better known for its ability to make vast wads of money. Call them A (stats) and B (client). I did not work for either A or B, but know and advised certain parties.
B advertised and wondered how much of an effect this had on its measure of success. A said they could tell, using sophisticated Bayesian models incorporating social media data.
Wowzee! Tell people you have busted open the secrets of social media and they will dump buckets of cold cash on you. Hint: everybody who says they have it figured out is either exaggerating to themselves or to their clients. (Say, that’s a pretty bold statement.)
Anyway, smoothing occurred. And correlations greater than 0.95 were boasted of. I’m not kidding about this number. Company A really did brag of enormous “impacts” of its smoothed measures. And Company B believed them—because they wanted to believe. Sophisticated Bayesian models incorporating social media data! How could you go wrong?
The real correlations, using unsmoothed data, were near 0. Just as you’d expect them to be for such noisy data as “social media” predicting a company’s measure of success. Do you really think Twitter streams contain magic?
I told all involved. I explained pictures like those above. I was emphatic and clear. I stood neither to gain nor lose regardless of the decision. Only two people (at B) believed me, neither of whom were in a position to make decisions.
At least I am comforted that Reality is my friend here. The company’s will eventually realize, but probably never admit, that their measures are spurious. Because they will realize but not admit, these measures will be quietly abandoned…
…As soon as the next computer self-programmed big data machine learning artificially intelligent smart-phone-data algorithm comes along and seduces them.