This is just a rough prototype meant to be easy to play with inside a post. READ the help and guidebook! Suggestions for new canned examples welcome—the hard part is deriving historical performance data.
- Read the Decision Calculator guidebook below!
- Fill in the Performance Table, or click on one of the predefined examples.
- Fill in the Cost Comparison Table, or click on one of the predefined examples. You do not need to calculate the total: that’s done automatically.
- Click Calculate (or Reset between examples).
- Accuracy comparison rates are given between the Expert System and the Naive Guess.
- Cost results are found in the Expected Cost Comparison Table.
- Finally, a solution saying which option you should choose is given. Skill should be > 0!
- Important: Use this software at your own risk. No warranties of any kind are given or implied. Always consult a competent medical professional. .
This article provides you with an introduction and a step-by-step guide of how to make good decisions in particular situations. These techniques are invaluable whether you are an individual or a business.
These results hold for all manner of examples—from deciding whether to have a PSA test or mammography, to get a vaccine, to finding a good stock broker or movie reviewer, to situations that require intense statistical modeling, to financial forecasts, to lie detector usefulness. Any situation that has a dichotomous outcome can use these techniques.
Many people opt for precautionary medical tests—frequently because a television commercial or magazine article scares them into it. What people don’t realize is that these tests have hidden costs. These costs are there because tests are never 100% accurate. So how can you tell when you should take a test?
When is worth it?
Under what circumstances is it best for you to receive a medical test? When you “Just want to be safe”? When you feel, “Why not? What’s the harm?”
In fact, these are not good reasons to undergo a medical test. You should only take a test if you know that it’s going to give you useful information. You want to know the test performs well and that it makes few mistakes, mistakes which could end up costing you emotionally, financially, and even physically.
Let’s illustrate this by taking the example of a healthy woman deciding whether or not to have a mammogram to screen for breast cancer. She read that all women over 40 should have this test “Just to be sure.” She has heard lots of horror stories about breast cancer. Testing almost seems like a duty. She doesn’t have any symptoms of breast cancer and is in good health. What should she do?
What can happen when she takes this (or any) medical test? One of four things:
- The test could correctly indicate that no cancer is present. This is good. The patient is assured.
- The test could correctly indicate that a true cancer is present. This is good in the sense that treatment options can be investigated immediately.
- The test could falsely indicate no cancer is present when it truly is. This error is called a false negative. This is bad because it could lead to false hope and could cause the patient to ignore symptoms because, “The test said I was fine.”
- The test could falsely indicate that cancer is present when it truly is not. This error is called a false positive. This is bad because it is distressing and could lead to unnecessary and even harmful treatment. The test itself, because it uses radiation, even increases the risk of true cancer because of the unnecessary exposure to x-rays.
This table shows all the possibilities in a test for the presence of absence of a thing (like breast cancer, prostate cancer, a lie, AIDS, and so on). For mammograms, “Present” means that cancer is actually there, and “Absent” means that no cancer is there. For a PSA test, “Present” means a prostate cancer is actually there, and “Absent” means that it is not. For a Movie Reviewer, “Present” means you liked a movie, and “Absent” means you did not.
|Test +||Good: True Positive||Bad: False Positive|
|Test –||Bad: False Negative||Good: True Negative|
“Test +” says that the test indicates the test said the thing (cancer) is present. “Test -” says that the test indicates the absence of the thing. For the Movie Reviewer example, “Test +” means the reviewer recommended a film.
There are two cells in this graph that are labeled “Good,” meaning the test has performed correctly. The other two cells are labeled “Bad,” meaning the test has erred. Study this table to be sure you understand how to read it because it will be used throughout this article.
The main point is this: all tests and all measurements have some error. There is no such thing as a perfect test or perfect measurement! Mistakes always happen. This is an immutable law of the universe. Some tests are better than others, and tables like this are necessary to understand how to rate how well a particular test performs.
The same Table can be used to examine the costs of the test’s performance.
|Test +||0||False Positive Cost|
|Test –||False Negative Cost||0|
When the test performs correctly (true positives and true negatives) there are no costs; except, perhaps, for minor monetary costs in setting up the tests; we’ll ignore these here, but in general, the mathematics can accommodate more complex costs. The table shows this by putting a 0 in these cells. There may also be, in the case of true positives, subsequent treatment (or other) costs—but this is not because of the test. It is assumed that you would want to pay these costs as the test was correct—there is no error cost of the test.
But when the test gives a False Positive or False Negative there is a definite cost: these costs are labeled “False Positive Cost” and “False Negative Cost”. They do not have to have strict dollars figures attached to them — for example, they may be emotional costs. In some cases it will be possible to specify exact dollar amounts. Examples will be given that will make this distinction clear. Meanwhile, costs are only part of judging the goodness of a test. Performance is another.
Our framework can now be used to examine actual test performance. In this graph are the performance statistics from actual mammograms (From Gigerenzer, 2002).
This is called a historical performance table, and the cells have the same meaning as before except that the entries are the numbers from actual mammograms.
The data in this table are for an average 1000 women, ages 40 – 60, who have had “first screening” mammograms. Of these 1000 women, 922, or 92.2% did not have cancer, and the test correctly indicated this.
Seven women out of every 1000, or 0.7%, had their cancer correctly identified by the mammogram. One woman out of every 1000, or 0.1%, will have her breast cancer missed by the test. A full 70 out of 1000, or 7%, will show a false positive.
The first question to ask of any test that you are considering having is: how accurate is it? Accuracy is found by adding the True Positives and True Negatives and then dividing by the total number of tests. In the mammogram example, this is (922 + 7)/1000, or 92.9%.
An accuracy of 92.9% sounds impressive, but is it the best that can be done? Obviously not. The best that can be done is 100%! We already know that this is impossible (all tests have error). But is there an even simpler test than a mammogram that is more accurate? A test that could be substituted for the mammogram for no cost? The answer, perhaps surprisingly, is yes.
Look at this performance table for what I’ll call a Naive-O-Gram, which is an exam that I perform and which simply says that every woman who comes to me for the test does not have cancer. Do you get it? No matter what, when a woman asks for my test, I always say “No cancer!”
It’s important to understand how we get the naive-o-gram results. We know from the Mammogram Performance Table that 70 + 92 = 992 women out of every 1000 do not have breast cancer. The naive-o-gram, which says “No Cancer” each time, would identify all of these 992 women correctly.
We also know that 7 + 1 = 8 women out of every 1000 do have cancer. The naive-o-gram will make a mistake for these 8 women (a False Negative). So we can fill in the Naive-o-gram Performance Table without having to do the test by only knowing the background rates of cancer in the population (more on this later).
The naive-o-gram will never have a false positive, nor will it have a true positive because it never labels a woman as having cancer. These top cells are always 0.
Here’s the crazy thing: the accuracy of the naive-o-gram is 99.2%, which is much more accurate than the real mammogram! So, considering only accuracy, which test would you rather have? The naive-o-gram or mammogram?
Of course, you don’t have to have to come to me to take the naive-o-gram, you can do it yourself. Just stand up and say, “I don’t have cancer” and you’re finished. Doing that is more accurate than the best scientific test. So why aren’t more doctors using the naive-o-gram?
The difficult part: calculating costs
Accuracy isn’t everything, and it could turn out that—for you—a less accurate test is better than a more accurate test. How could this be?
Describing how it can be first requires an understanding of the two costs mentioned above. The answer will depend on how the ratio of these two costs interacts with the predictive accuracy of the test. Just how will be shown below, but first let’s go through an example of how to calculate the costs for a mammogram. (Similar tables can be built for any predictive test: examples will be added in time for lie detectors, stock picks, movie reviewers, and so on.)
Shown in this Cost Comparison Table are examples for a mammogram. These are only examples: I am not a physician and am only estimating what I believe are the most likely costs. Your actual costs are best filled in by you and your doctor. The examples that I list are from Gigerenzer (2002).
This is very important! The only way to find and value these costs is to first imagine that the case that lead to them is true: the costs are conditional on these states being true. For example, you have to imagine yourself in the case that you know that the mammogram has made a mistake (false positive or false negative). Then fill in the table. You must imagine all the bad scenarios that can happen if the test makes a mistake.
|False Positive Costs||Score||False Negative Costs||Score|
|Stress, worry, depression. These affect health and well being.||Cancer allowed to develop to a potentially dangerous size.|
|Follow-up tests necessary: prolongs time of worry.||Cancer symptoms can be ignored because “The test said I was fine.”|
|Possible biopsy required, with risk of infection or even, rarely, death.||Treatment is delayed (although no treatment is guaranteed, there is evidence that earlier treatment is more effective at extending life).|
|Unnecessary and possibly harmful surgical procedures used (like mastectomies and lumpectomies).|
|The finding and unnecessary treatment of harmless growths.||Total:||Total:|
Now the hard part. Each of these costs must be rated and assigned some sort of score. The good news is that there is no need to give a dollar figure to each cost. All that is needed is to assign a relative difference between the two. An example will clear up what I mean by that.
Say that you are desperately scared of breast cancer. The very thought of it fills you with a terrible dread. You don’t care about false positives, you don’t care if you have to take dozens of mammograms, suffer through biopsies, and possibly undergo unnecessary treatment, and suffer harmful x-ray exposure. Anything, to you, is better than not starting treatment should cancer be present. Likewise, the thought of missing the cancer in a mammogram is frightening. You want to know as soon as possible.
If you felt like this you would certainly rate False Negative Costs higher than False Positive Costs. Would you say False Negative Costs was twice as high as False Positive Costs, four times, ten? It’s up to you to pick a number after going through each list. Higher numbers reflect higher costs. Of course, if you have actual dollar figures, use these. Some situations, like stock picks, will have natural interpretation (dollars won and lost, for example), others will not.
One way to do this is to go through each point of the lists and assign a score, a number which reflects your feeling. For example, you might assign the first item under False Positive Costs a “10.” The 10 is, of course, arbitrary and it only has meaning in relation to the other items in the list. The 10 could mean dollars or “stress units” or anything. It’s up to you. Your only goal is to be consistent across all items. Here is one possible table.
|False Positive Costs||Score||False Negative Cost||Score|
|Stress, worry, depression. These affect health and well being.||10||Cancer allowed to develop to a potentially dangerous size.||100|
|Follow-up tests necessary: prolongs time of worry.||15||Cancer symptoms can be ignored because “The test said I was fine.”||50|
|Possible biopsy required, with risk of infection.||15||Treatment is delayed (although no treatment is guaranteed, there is evidence that earlier treatment is more effective at extending life).||60|
|Unnecessary and possibly harmful surgical procedures used (like mastectomies and lumpectomies).||20|
|The finding and unnecessary treatment of harmless growths.||10||Total:||70||Total:||210|
I pretended that I was a woman and labeled the costs for each possible error. As you can see, I thought the items under False Negative Costs were much worse than those under False Positive Costs. You probably feel the same. And remember: you are not judging the likelihood of any error here. You are assuming the error is true, that it actually has happened to you, and then you’re scoring its cost. I’ll show you how to fit it in with test performance in a minute.
I thought that the total error for False Negatives was 210, and for False Positives it was 70. It will become important to look at these numbers through their ratio. The decision calculator will do the math. The ratio will always be (Total Cost False Positives) / (Total Cost False Negatives) = 70 / 210 = 1/3. Thus, I thought that False Negatives were three times worse than False Positives.
Your costs are different than the doctors!
Is that it? Not quite. These are my costs, yours may be slightly different. But your costs are not the same as for the doctor (or advocacy group) who orders the test. This means that your goals are not the same as your doctor’s (or stock brokers, or movie reviewers, or polygraph examiners, and so on). You may not be able to estimate their costs, but it is important for you to understand that these cost differentials can lead you and your doctor to reach different conclusions about whether to have test. It make, then, make sense for your doctor to rationally say “Take the test” but just as rational for you to say “No thanks!”
Here’s an example of a doctor’s cost. Again, I am making these up. These will be different for any particular physician. The important thing for you to understand is that these costs are almost always going to be different for you.
|False Positive Costs||Score||False Negative Costs||Score|
|Follow-up tests necessary.||10||Cancer allowed to develop to a potentially dangerous size.||500|
|Possible biopsy required, with risk of infection.||10||Cancer symptoms can be ignored||100|
|Unnecessary and possibly harmful surgical procedures used (like mastectomies and lumpectomies).||20||Treatment is delayed (although no treatment is guaranteed, there is evidence that earlier treatment is more effective at extending life).||200|
|The finding and unnecessary treatment of harmless growths.||10||Possible malpractice suit brought for missing cancer.||200||Total:||50||Total:||1000|
As you can see, not only are the costs different, but the items are not all the same. For example, as a doctor I have to worry about malpractice suits (or other disciplinary action) from missing cancers. These will cause my (already outrageously high) insurance rates to rise, and I could lose respect and business. The ratio is (Total Cost False Positives) / (Total Cost False Negatives) = 50 / 1000 = 1/20.
You cannot directly compare the scores from the doctor’s table to your scores. The units are arbitrary and meaningless, and only have relevance for one person. A 10 for me may be a 112.8 for you. So how can you compare tables if you can’t compare scores?
Skill is defined as the ability of an expert predictive system to perform better than a naive prediction system. In the mammogram example, a mammogram would have skill if it performed better than the naive-o-gram. We have already seen that the mammogram is worse than the naive-o-gram with regard to accuracy. The mammogram thus has no skill. But we have yet to see how the naive-o-gram compares to the mammogram with regard to cost.
You would never want to use a predictive system that does not have skill. But notice that part of the definition of skill requires us to supply a “naive” prediction system. The naive-o-gram came about because it turned out that the probability of any woman having breast cancer was so small. A natural naive prediction was to say “you do not have cancer” for each woman. Different systems, such as lie detectors and stock predictions, may have different naive systems.
First, here’s how we handle cost for the expert and naive prediction systems. The Expected Cost Comparison Table below has the results. Here’s how it works.
As we already know, all predictive systems have error. We learned how to rate the costs of these errors through filling in a Cost Comparison Table for the two kinds of errors, False Positive Cost and False Negative Cost. We also know how to look at a Performance Table (we don’t yet know how to get the numbers of a Historical Performance Table, which generally must be supplied by experiment). These data now give us enough to let us make a decision. It’s about time!
We know have to define the concept of expected cost. That is the error cost of the test that we would expect any random person to experience (given they had your values in the Cost Comparison Table). This is a statistical concept and it means the costs that the average person will experience—it does not imply that the expected cost is the cost that any given person will experience, just the average person.
Calculating this cost is easy, but it does require some work. Don’t worry about the math, because the decision calculator does the work for you. We first need to modify the Historical Performance Table into a Performance Probability Table. This is simple because all we need to do is to divide each cell by the total of all cells. The Mammogram example is given below.
This was an easy case because the total was 1000. Now we have to multiply each error cell’s probability by your cost estimate and then calculate the total. That sounds complicated, but here’s an example.
|Test||Expected False Positive Cost||Expected False Negative Cost||Total|
|Mammogram||0.07 * 70 = 4.9||0.001 * 210 = 0.21||5.11|
|Naive-o-gram||0 * 70 = 0||0.008 * 210 = 1.68||1.68|
|Mammogram||0.07 * 50 = 3.5||0.001 * 1000 = 1||4.5|
|Naive-o-gram||0 * 50 = 0||0.008 * 1000 = 8||8|
Focus on the row where it says “Woman: Mammogram.” We know, from the Mammogram Cost Comparison Table, that the Cost + is 70, and we know that the probability of a False Positive is 0.07. Multiplying these numbers together gives 4.9. We further know that the Cost – is 210 and that the probability of a False Negative is 0.001. Multiplying is 0.21. We add these two together and get the expected error cost of the mammogram, which is 5.11.
We can now do the exact same calculations for the naive-o-gram. The costs remain the same, and all that changes are the probability estimates. There is no chance of a False Positive by definition, so the expected cost of a False Positive is 0. There is a higher chance of a False Negative, here 0.008, so the expected cost is 1.68.
And that’s it. To make the best decision all you need to do is to choose the test with the lowest expected cost. For this example, that choice is the naive-o-gram, which has an expected cost three times lower than that of the mammogram.
What if there are competing versions of the expert test and we want to rate them? How can we tell which is best? We do that using a skill score. This is a score that lets us compare different expert predictors even though they have different underlying base rates. Calculation of the skill score is somewhat complicated, so I won’t give the details here (the calculator does it for you). All you need to remember is that the skill score must be positive for you to choose the expert prediction. If the score is zero or negative, then you should choose the naive guess.
mammography skill score = -2.042
The mammography skill score is negative, so we would choose the naive guess.
We can also do the same calculations for the doctor, which are given in the table above. As you can see, his best bet is to opt for ordering the mammogram! Why? Because he weights the costs differently; he’s far more worried about False Negatives and this worry shows up in the expected cost. Again: you cannot compare the expected costs of the doctor with your expected costs — the numbers have different meanings. These costs can only be compared against themselves, or between predictive systems for one person (and that person is who specified the Cost Comparison Table).
The doctor’s best decision is to order the mammogram, your best decision is to not take it. Who wins in the end? Probably whoever has the stronger will (usually the doctor). But remember: It is always your decision to accept any medical procedure—so far, anyway. And you should never make these decisions lightly. And I can only hope that this guide helps to make this decision easier.
Is this all? Not quite. It can be that the expected loss for the expert prediction is less than that of the naive guess, but it may be so only because of chance. There is a statistical test based on the skill score that lets us tell. If you have questions about this, please email me.
What about lie detectors, stock picks, and other decision types? Ask me about examples. If you have information about historical performance statistics for any decision, please send them to me and I can help you fit them into the decision calculator. My email address is email@example.com.
Naturally, I have left out a lot of details, so questions are always welcome.
Some predictions are not yes or no, like the examples given here. One example is a high temperature forecast, which is a number like “82 degrees.” Can forecasts like this be fit into the decision calculator? The answer is yes, although the math becomes more complicated. It also becomes more useful.
If you click from one example to the next without hitting reset first, the calculator doesn’t properly clear out all of the cells. For example, the lie detector example stills includes “treat harmless growths” as a false positive cost.
Its not a big deal (and maybe it works correctly in IE – I use Firefox), but someone could get confused if they aren’t paying attention.
thank you very much! I was wondering if I should go to work today or stay in bed. Your Decision Calculator solved this vexing problem for me.
Interesting to note that for doctors, false positive “costs” may actually be negative (i.e. not costs at all on net, as they are potentially revenue producing). Talk about skill!
Yep, it’s a limitation of the DHTML. The place where it causes a problem is the cost table; the other tables are automatically replaced.
That’s my solution every day.
It looks like your tests don’t have any benefits. There’s no benefit to discovering that you have cancer using the test only a lack of cost.
Why would I take any of these tests? They only create costs.
You need to include a benefit for a correct detection. The table under “The Costs” section needs numbers on the diagonal and probably a large negative number in the upper left. The table as it is represents the cost matrix for a free test for a disease that has no treatment.
Sure, it’s possible to add in “negative” costs—that is, benefits—for correct forecasts. But then you have to get into what are those costs (actual positive costs) of correct forecasts. For example, it’s a good thing to be correctly told you have cancer, which has some utility value, then you have to factor in the costs of having that cancer, which can of course, be substantial. If you don’t mind the math, click over to my “resume” page and you’ll see some papers in this line.
And, anyway, the whole point is not the costs of correct forecasts after the forecast is made—and you have to live with what is, after all—-but to decide whether a system is making useful predictions.
Excellent summary. This is similar, better though, than the summary in J.A. Paulos book, ‘Innumeracy.’
I bet you could compile a collection of topics similar to this bit and publish at least as good a book, certainly a more entertaining book as Paulos’ (with a bit of a Freakonomics twist as well) — even incude a CD with it for people to use as games or other exercises.
In a land where all women of a certain age are screened every five years routinely for breast cancer, the decision part is taken away from patients. This could be a good thing as more cancers will be picked up than missed. Using your figures, seven would be picked up where they otherwise wouldnâ€™t and one would be missed. A different way of looking at the screening question. Maybe, in fact just a different question as the post explains that tests are not completely accurate to people who would otherwise assume over confidence in itâ€™s result. This decision calculator is applicable in a case where an individual is trying to make a decision for themselves about whether or not to take a test as you have indicated. It is likely that the cancers that are missed by the mammogramme are at too early a stage for detection, the site or tissue consistency has interfered, or human error has occurred.
A long time ago I commented on a local issue in the Epping area about mammography. More has come to light since then.
One of the things that went wrong was that women had to be recalled and results re-reviewed from older results due to a â€˜backlogâ€™. This meant that a much higher incidence appeared to occur where, in fact, there were just more positive results due to a bunching together of a larger sample into what seemed like a smaller timeframe. I donâ€™t know the technical way to describe that apart from a Clustering?
Here is the answer to part? of the Epping Forest breast cancer hot spot.
Ken, stop it, or I’ll blush.
Though I have been thinking of investing in a wig, perhaps a Halloween-style fright wig. Then I could hit the lecture circuit. Nobody disbelieves a scientist with crazy hair like Paulos’s.
I looked at a paper and you also assume no possible benefit (in the language of the paper k_11= k_00 = 0). I think this is a deficiency. Firstly, from a pragmatic point of view I would never participate in a test without any possible chance of a benefit. More specifically, the test should have an average benefit (not cost) otherwise the test is effectively cheating me. The other reason for setting k11 != 0 is that with a benefit, the decision about whether or not to use the test changes.
For example, in your mammogram example, you compute that the test as presented has no skill. However if the benefit for a correct positive were greater than 2.042/.007 ~ 292 in magnitude (i.e., cost <= -292), you would conclude that the test does have skill. It’s hard to say if that’s big or not but the point is that it would change the answer. The benefit that would give the test zero cost is -5.11/.007 = -730 (< -292) which would change the skill outcome.
I'm not sure I agree with your point about the positive costs of testing positive correctly. The test isn't giving you cancer (Schrodinger's cat aside). It's informing you of an existing state. Presumably you wouldn't take the test if there was no treatment or you didn't care about the outcome.
I'll admit I just scanned [Briggs, WM, 2005. A general method of incorporating forecast cost and loss in value scores. Monthly Weather Review, 133(11), 3393-3397] so I may have missed something.
Thanks. I think you’re confusing the costs of the forecast with the utility of the event. Maybe it helps to think that if you have the disease you will die will 100% certainty in 6 months, and that if you don’t you will happen upon 1 billion dollars with certainty in that same time. Now, those “costs” (one negative, one positive) are the same and will happen regardless if they are forecast or not. You either have the disease or not no matter how well you predict whether you have it or not. But if you want to decide on a system that makes predictions, you want one with the smallest expected cost of the system.