Class 56: The Best Model!

Class 56: The Best Model!

The best model is no model! Display your data instead. We bring cause to models, we do not extract cause from models. Models do not say what happened: what happened says what happened.

No class next week (the 3rd). Next class 10 July!

Video

Links: YouTube * Twitter – X * Rumble * Bitchute * Class Page * Jaynes Book * Uncertainty

HOMEWORK: Given below; see end of lecture.

Lecture

This is an excerpt from Chapter 9 of Uncertainty.

The Best Model
The best model happens to be the least used among professionals, but the most resorted to by civilians, who thus gain a decided edge. The best model is no model. The best model is to just look at the data, evidence, and premises gathered and ponder them. No model is ever needed to tell us what we observed, the first step in gaining an understanding of essence and nature. We know what we observed because we have observed what we have observed. That sounds a useless sentence, but it isn’t. It is emphasized because it is everywhere (in professional circles) doubted and perhaps even disbelieved. I have often put it to statisticians and the most positive response I have received was a blank stare.

You’re a doctor (your mother is proud) and have invented a new pill, profitizol, said to cure the screaming willies. You give this pill to 100 volunteer sufferers, and to another 100 you give an identical looking placebo. Here are the facts, doc: 71 folks in the profitizol group got better, whereas only 60 in the placebo group did. Here is the question: in what group were there a greater proportion of recoverers?

Every statistician hearing that believes there’s a trick, and to solve it he will propose a model, say, a “z test”, or whatever. Yet the untrained—he must be untrained—civilian will say, “The drug group”, which is the right answer. Of course it is!

Question two: what caused the difference in observed recovery rates? I don’t know. But I do know that some thing or things caused each person in each group to get better or not. I also know that “chance” or “randomness” weren’t the causes. They can’t be, because they are measures of ignorance and not physical objects, as we have already seen. Lack of an allele of a certain gene can, say, cause non-recovery, or a diet of carrots in sufficient quantity can speed the recovery, but “chance” is without any power whatsoever. Results are never due to chance, they are due to real causes, which we may or may not know. If our goal is only to make the statement which group got better at greater rates, no model is needed. Why substitute perfectly good reality with a model? That is to commit the Deadly Sin of Reification, which is explored in greater detail next Chapter.

If our goal is to say something about patients not yet seen [or, rather, not yet considered], then a model is not only just the thing, it is required. How to build such a creature we next discover. It is without question a model is needed because we did not measure anything that could have caused the observed difference, and it is a certainty that more than one cause was at work. If there was only cause at work every result would always be the same, as we learned earlier; or if there were only two causes at play, then everybody would have the same outcome in each group, but each group would be different. It is also unlikely that our model will discover the pertinent causes; the best we’ll be able to do is to characterize the uncertainty we have in new observations, which is usually an adequate goal.

Perhaps my admonition now seems needlessly strong. After all, it is obvious which group had the higher proportion of recoverers, and, the objection will continue, the experiment was run for the purposes of making statements about yet-to-observed patients, so a model was requirement after all. There are two reasons why, if anything, my caution was not strong enough. One is due to the various uses to which data is put, which I answer next. And the second is the “modeling reflex”.

This reflex is so strong that it isn’t realized that there is no model deducible from the stated premises. All we know is that pills were given to 200 folks, and this many got better and this many didn’t. We know nothing about the pills or the people, not any of their demographic or biological conditions nor how many folks in the future will be eligible for the pills and so forth. There is surely positive information in our data (premises) that the drug is possibly doing something (but only sometimes) the placebo is not. But it’s unlikely we’ll be able to make sufficiently detailed measurements if we were to repeat the experiment such that we identify the precise causes of a cure in each individual. In order to model using the information provided, something additional has to be assumed. Perhaps this something will be right or approximately right, or again perhaps it will be wrong. There is a good case that in medical or other situations where measurement is careful, assumptions necessary to make models are often reasonable. But that case cannot be made where measurements are sloppy or crudely conducted and where theories of causation are fanciful, as they often are in some fields (those which use questionnaires, mainly, as we shall see). Models have to be justified. They usually are not. That they were done by “competent” researchers is usually the only justification offered, and that is insufficient. All probability is conditional and so all models are conditional on the premises assumed. Different premises lead to different models. The public (including politicians) are far too accepting of the justification of models used to rule and regulate their lives, and they are blissfully unaware that models are not unique. As we’ll see below, there are good ways to test the assumptions that create models.

Time series data, or rather observations that occur in time, are abused the most. One example will suffice, though I’ll change the names and details to protect the guilty. One author claimed that violent deaths were decreasing in time, and he offered a model and a reason why this was so. A second author claimed the first author was wrong: deaths were not decreasing, but he offered no reason beyond a model why this was so. The same, or similar data, produced two models with diametrically different conclusions. The reader will recognize many similar situations.

Both parties pointed to their models to say, “Deaths (are) (are not) decreasing”. It was the model in each case which was the arbiter of the truth. This is the Deadly Sin of Reification. No model is needed to say what happened. Either author had merely to look at the data and conclude, with perfect certainty, whether deaths were increasing, decreasing, or holding steady. Only one small additional premise is needed: a definition of “decreasing” (or its obverse “increasing”). As stated at the beginning of this book, the meaning of the words used in argument are themselves tacit premises. These tacit premise must be made to “come out into the open” when arguing formally. Next Chapter, we’ll see the mistakes which are made when the tacit premises are left tacit when I discuss “trends” in time series. For now, only assume that a definition is had and is agreed to by all parties: “decrease” means this, “increase” that, and “hold steady” something else. I beg the reader will attend closely: with the definition settled, all we have to do is to look. Either the conditions of the definition will have been met, or they will not have been. And since, it is presumed, these three categories are exhaustive (about death trends), somebody will be certainly right and the other certainly wrong. There is no need to argue! And with even greater force, there is no need, there is absolutely no need, for any kind of model.

Models in these cases are replacements for reality. Why possible reason could there be to replace a perfectly good reality with a fiction? I ask because it is done all the time. Each other was sure of his model and (of course) his cause. That’s what really caused the dispute. Love of models. Scientists these days are like modern-day Pygmalions, falling in love with their creations. Only they are not as blessed as Pygmalion; a scientist’s model is forever lifeless.

Now if it were true that the data in a given situation were measured with error and we wanted to quantify the uncertainty in what the truth might have been, then we need a model, just as we need a model if we want to quantify the uncertainty in what the future will be. Models are used to quantify the uncertainty in what we don’t know; when we are certain, they are not needed. Unknown is unknown. On the other hand, if the data is known it is known. Don’t play with it. Unnecessary fiddling is rife. As I say next Chapter, an entire book could be written of the abuses spoken of in this small section, but that will have to wait for another day.

But aren’t models also needed for hypothesis tests, to see if the observed differences are “due to chance”? No. This is always a mistake: hypothesis testing should never be used. This is proved below.

Subscribe or donate to support this site and its wholly independent host using credit card click here. Or use the paid subscription at Substack. Cash App: \$WilliamMBriggs. For Zelle, use my email: matt@wmbriggs.com, and please include yours so I know who to thank. BUY ME A COFFEE.


Discover more from William M. Briggs

Subscribe to get the latest posts sent to your email.

1 Comment

  1. Uncle Mike

    “The best model is no model! Display your data instead.”

    Yes. Absolutely. However, easier said than done. There are lots of bad ways to show data.

    To rectify I highly recommend the classic text: The Visual Display of Quantitative Information by Edward R. Tufte. This book is great for practitioners but also for the Gen Pop. It’s a fun read and very informative. Beware the pitfalls of graphical deception. Mind the axes. Detect fakery in the data bakery. Data display is an art an well as science. I heart that book!

Leave a Reply

Your email address will not be published. Required fields are marked *