How To Do Predictive Statistics: Part X: Survival Analysis

How To Do Predictive Statistics: Part X: Survival Analysis

Query: now that I’m a video master, would people like videos of these lessons? I don’t see how they’d help much, but who knows. Let me know below or via email.

Next up is survival analysis, a.k.a. time-to-event analysis. We could treat times to events as regular numbers, and use regression, or even tobit regression, or the like, except for a twist. And that twist is called censoring.

Suppose we’re studying when people hand in their dinner pails for the final time after shooting them up with some new Ask-Your-Doctor-About drug. Some will during our study gladden the hearts of undertakers, yet others will have frustratingly remained above ground. Yet we know they have an unbreakable appointment with the Lord. We know the keeling over times of the dead, but only the so-far times of the living. These so-far times are said to be censored.

There are other kinds of censoring, but today all we want to do is this so-called “right” censoring. (Wrong censoring happens in your more mature Western democracies.)

The first problem is finding useful software. Predictive methods are not yet so common that every package contains them. And the rare ones that rely on MCMC-type methods, about which more below. But if you don’t recall why these creatures are not what most think, then you must review: this and this at the least, and this too for good measure.

We’ve been using rstanarm, and it has a method that sorta kinda doesn’t not work, called stan_jm, for joint longitudinal-survival models. Longitudinal models measures things over time, like time-series. There is a prediction method for this model, but it only produces predictions for the longitudinal part. There are some laborious workarounds, but our point here is not software per se, but understanding the philosophy of models. So we’ll leave it behind.

There is also spBayesSurv, which works, and which allows spatial processes to join the fun. The difficulty with it is that you have to work directly with design matrices, which aren’t especially hard to grasp, but again the code requirements will become a distraction for us.

So we’re going to use brms. The weakness here is resources. It is a memory hog (which is why I’ve been avoiding it up to now). It externally compiles models before running them. First time I tried this for the model below, I had several other instances of R running, along with a video editor, and it locked up my system. Close extraneous programs before beginning.

Obligatory anti-MCMC (mini) rant. Since you reviewed, or you remembered the cautions, you recall MCMC doesn’t do what it says, not exactly. The problem with it is not that useful answers can’t be extracted from MCMC methods—of course they can. The problem is that people find it so useful I fear not enough thinking is going into finding better analytical approximations to complex integrals, solutions which would bypass all these fictional “random” simulations. For one, we could learn to embrace discrete finite models, which are exact, and not approximations as all continuous models are (and which everybody forgets because of the Deadly Sin of Reification). Bonus: discrete finite models don’t need integrals, thus don’t need MCMC. End of rant.

The Code

install.packages('brms', dependencies=TRUE)


We’ll use the built-in kidney data. Type ?kidney to learn about it. It has a time to event (infection), a censoring indicator, age, sex, and disease type. The package authors already wrote the model code for us, to which I make only one change: assigning the data to x (for consistency).

x = kidney

fit = brm(time | cens(censored) ~ age + sex + disease, data = x,
family = weibull, inits = "0")

The code form is mostly familiar from other models, except for the addition of time | cens(censored) to indicate this is time given censoring. The “weibull” is to characterize uncertainty in the time. ?brmsfamily shows the other options. When run, this will first show “Compiling the C++ model“. It could take considerable time here, too; minutes, maybe, depending on your resources. Then the MCMC bits begin.

As before, we could take time to examine all the MCMC diagnostics which give information about the parameters. This should be done. Do plot(fit). Look for “convergence”. I won’t do that here, because this example works fine.

In the end, we do not give a rat’s kidney about the parameters. They do not exist. There are mathematical struts that make the model work. That and nothing more. As always, we care about this:

     Pr( time in t | New age, sex, disease, D, M) (1),

where t is a some time of interest, where we make guesses of new values of the measures, where D is the old data, and M the model. Where we remember that the priors and the MCMC supposeds all form the model M. Change the model—change any part of the model—change the probability! This is not a bug, it’s a feature. Change anything on the right hand side, change the probability. All probability is conditional.

We want predictions. Which is to say, we want equation (1). Which means we must supply guesses of age, sex, and disease type. If we didn’t want to specify guesses of age, sex, and disease type, we shouldn’t have put them into the model. Luckily, we have a ready supply of such guesses: the old data. The probabilities produced by (1) will not be for these old patients, though (unlike the supposition of classical hypothesis testing). We already know all about them. They will be for new patients who are “like” the old ones, where “like” is defined by us: an implicit right-hand-side assumption. It is there even though it doesn’t appear visibly.

p = predict(fit, probs = c(0.10, 0.90))

The probs = c(0.10, 0.90) is not the default, which instead is the old familiar number. But why on earth do we want 95% prediction intervals? That’s a mighty and harsh requirement for time predictions. It produces great uncertainty; why shouldn’t it? Such an interval says “Conditional on the model and so on, there is 95% chance this patient will have an event in this time interval.” Considering a 100% chance would be the interval (0, infinity), you can see a 95% interval would be wide, too. The default is there only because old habits die hard.

First few rows of p:

> p
       Estimate  Est.Error       Q10        Q90
 [1,]  44.91669   54.83210  3.636403  105.41077
 [2,] 197.94483  223.35119 17.389400  478.37263
 [3,]  42.76444   51.70631  3.689502   98.58186
 [4,] 211.71675  233.77522 18.804205  494.05413
 [5,]  46.70474   58.45127  3.842455  111.61812
 [6,] 221.68699  263.71488 19.417692  519.37713
 [7,]  40.13264   52.69928  3.114666   95.80451
 [8,] 196.17201  237.87493 16.869121  470.98349
 [9,] 123.36390  146.84011 10.551705  294.04778

And the first few rows of x (which are matched to these p):

> x
   time censored patient recur age    sex disease
1     8        0       1     1  28   male   other
2    23        0       2     1  48 female      GN
3    22        0       3     1  32   male   other
4   447        0       4     1  31 female   other
5    30        0       5     1  10   male   other
6    24        0       6     1  16 female   other
7     7        0       7     1  51   male      GN
8   511        0       8     1  55 female      GN
9    53        0       9     1  69 female      AN

Doesn’t look so hot, this model. We have to be careful how we interpret its performance, though, because of the censoring (none of the first nine were censored, meaning all had the event).

Let’s look at the empirical cumulative distribution functions for the data, and for the point predictions, busted out by censoring.

plot(ecdf(x$time[x$censored==0]), main = 'ECDF', xlab='Time' )
   legend('bottomright', c('X not-cens', 'X cens', 'P not-cens', 'P cens'), col=1:4, lwd=3, bty='n')

There is a clear difference in distributions of times for censored and uncensored data. What is assumed is that the times for the censored patients will be larger that what is seen (obviously). There is no censoring in the predictions, of course; the breaking out by censoring is only to show the matching points with the data. That is, the model is not predicting whether a new patient will be censored, for that concept has no place in guessing a person’s eventual time to event—which may be “infinite”, i.e. they never get an infection. Censoring only happens in limited-time studies.

This all means the predicted times must be larger than what was seen. The censored points “push out” the ECDFs to higher numbers. Thus (at least with a Weibull) the model tends to over predict in a way.

And what all that means is that we can’t really compare the model’s predictions with the observed data. The only way to verify this model is to test it on new times.

Wait. Build a model, make predictions, then test how well the model performs in real life? Where have we heard that before? You know it, baby. Rightched here. And in this gorgeous award-eligible book.

So let’s example the predictions themselves, knowing (as we knew for all our past efforts) that we’re limited to making statements only about the predictions and not their quality.

Here we run back into the screwiness of MCMC. Ideally, we’d specify a new age, sex, disease and compute (1), which would produce the same number (same prediction) for every duplicate entry of age, sex, and disease. Not so with MCMC, which produces hazy numbers. Run this:

i = order(x[,5], x[,6],x[,7]) # order by age, sex, disease
x = x[i,]
p = p[i,]

Then look at the first few rows of x:

> x
   time censored patient recur age    sex disease
5    30        0       5     1  10   male   other
36   12        0       5     1  10   male   other
24  132        0      27     1  10 female      GN
50  156        0      27     2  10 female      GN
6    24        0       6     1  16 female   other
14  536        0      15     1  17 female   other
37  245        0       6     1  17 female   other
68   25        1      15     2  17 female   other
31  119        0      35     1  22 female   other
57    8        0      35     2  22 female   other
1     8        0       1     1  28   male   other
33   16        0       1     1  28   male   other
20  402        0      22     1  30 female   other
72   24        1      22     2  30 female   other

The first two patients are the same! As are many of the others. At this point somebody will chirp up “But those data are correlated! This is the wrong model!” Which, I have to tell you, is empty of meaning, or ambiguous. Correlated to us means that when conditioned on a probability changes. It does not mean cause. Since probability is conditional on the sort of model we choose, and on everything else on the right hand side, it is not clear how multiple measures on patients would change the probability. Other models are easy to explore; the package authors even thought of some. ?kidney will show them to you (scroll to the bottom).

We’re going to ignore the multiple measures aspect (we’re not in this lesson trying to build the world’s most predictive model of kidney infections). Instead we’ll suppose, as happens, we have some rows of data that are the same. The first two rows of data are identical, as far as (1) goes. The have predictions

> p
       Estimate  Est.Error       Q10        Q90
 [1,]  46.70474   58.45127  3.842455  111.61812
 [2,]  46.26424   56.88700  3.454595  111.20644

Again, if these were analytical results, or non “simulated” results, these rows would be identical. They are not. They’re close, and whether “close” is close enough depends on the decisions that would be made—and on nothing else. I’m going to say close enough. But you might not. If not, you have to find a way to merge them, either by some kind of averaging, say, or by working though the parent code and hacking the simulation to group like rows, or whatever. Far-apartness would then be an indication the model did not “converge”.


Okay! So hypothesis testing is out. How do we test measures? Class? Class?

If you said relevance, you’re right! Good memory. Let’s first look at all the predictions in some useful way. (The reordering of x and p won’t matter.) We could do something like this.

jit = rnorm(nrow(p),0,.2)
plot(x$age+jit,p[,1],pch=as.numeric(x$sex), col=as.numeric(x$disease), 
    cex=2, ylab='Times',xlab='Age', ylim=c(0, max(p)))
    for(i in 1:nrow(p)){
      lines(c(x$age[i]+jit[i],x$age[i]+jit[i]),c(p[i,3],p[i,4]), col=as.numeric(x$disease[i]))
   legend('topleft', c('male', 'female', 'other', 'CN','AN','PKD'), col=c('#999999','#999999',1:4), pch =c(1,2,NA,NA,NA,NA),lty=c(NA,NA,1,1,1,1), lwd=3, bty='n')

The jit adds a bit of jitter (which needs to be saved) to separate points. Remember: we looking for differences in probability and not just point predictions.

The changes in probabilities for sex are obvious, and they are for diseases AN, and PKD versus the other two. The changes in probabilities is not so great for age, except for two females with PKD (it’s the same two patients measured twice each). These are the only females with PKD, and the suspicion is age doesn’t matter too much, but the combination of female and PKD does. I’m not a kidneyologist so I don’t know what this means.

We can test these easily.

y = data.frame(age = seq(10,70,10), sex=rep('female',7), disease=rep('PKD',7))
p = predict(fit, newdata=y, probs = c(0.10, 0.90))

> cbind(round(p),y)
  Estimate Est.Error Q10  Q90 age    sex disease
1     1084      1884  56 2509  10 female     PKD
2     1022      1539  63 2481  20 female     PKD
3      951      1375  67 2353  30 female     PKD
4      914      1340  64 2176  40 female     PKD
5      942      1468  63 2279  50 female     PKD
6      876      1279  68 2066  60 female     PKD
7      901      1309  58 2140  70 female     PKD

So age does change the probability (and in a way that makes sense to naive readers like me). Is this change enough to make a difference? I have no idea, and unless you are kidney guy, neither do you. I don’t know what kind of decisions are pertinent. These kinds of decisions are not up to the statistician make. There are no magic numbers in the predictive way. P-values presume to give probabilities and make decisions simultaneously. We won’t make that enormous mistake.


You can repeat the same thing but for sex and disease. If you know something about kidneys, let us know below.

Then fit the second model, where it says (from ?kidney) “adding random intercepts over patients”. That’s a misnomer. Everything not known in a Bayesian analysis is “random”, which his nothing but a synonym for unknown.

Compare directly the predictions (don’t forget you sort p above) from both. We cannot say which of these models is better in a predictive sense per se: not until we get new data in. But what can you say?

Advanced readers should try this. brms is limited, unlike rstanarm, because its prediction method only spits out a point and predictions bounds. In rstanarm you get the whole distribution.

First pick a combination of the measures, and then a time you think is interesting. Suppose it’s 300. Now find the probability of exceeding that time with your given combination. This is trivial in rstanarm. Here you need to optimize. Next have a systematic series of measures (age, sex, disease) and plot these exceedance probability for this sequence. Different times will lead, of course, to different curves. The differences in those curves may be big or small depending on the decisions to be made conditional on the model. Proving, yet again, that the same model may be useful to one man, and useless for the next.


  1. Mike

    Thank you for the nice example! Does brms allow plotting something like adjusted Kaplan-Meier curves for each time point?

  2. a

    There are multiple datasets named kidney, I think the following one is needed:

    data(“kidney”, package=”brms”)

Leave a Reply

Your email address will not be published. Required fields are marked *