Survival curves

Peter Ralph

Advanced Biological Statistics

Overview

“Survival” analysis:

We are interested in how long until some particular thing happens, and how this depends on some covariates.

Example: how years until death depends on cancer grade at diagnosis and drug treatment.

Example: how day of first budburst depends on species and local temperature.

Surivival data

Ex: For each patient, date of diagnosis and cancer grade; date of death or last follow-up.

Ex: For each plant, species, date of first budburst or last survey.

Both examples are right censored: for some subjects, we don’t know how the actual time, only a lower bound on it.

Key assumption: any censoring is noninformative,

i.e., our data collection does not depend on the status of the subjects.

Example of informative censoring: patient dropout due to worsening symptoms.

What do we want to know?

The survival curve:

\[\begin{aligned} S(t) &= \P\{\text{still 'alive' after $t$ time units}\} , \end{aligned}\]

Note: this is always decreasing.

and the hazard rate:

\[\begin{aligned} h(t) &= \text{(mean number of 'deaths' per still-alive subject,} \\ &\qquad \text{per unit time at $t$)} , \end{aligned}\]

which is

\[\begin{aligned} h(t) = - \frac{d}{dt} \log S(t) . \end{aligned}\]

The hazard rate,

the rate the event of interest happens at per unit time,

is the slope of the survival curve on a log scale.

Nonparametric versus parametric?

We’ll look at some methods across the nonparametric-to-parametric continuum.

Nonparametric: fewer assumptions.
Parametric: fits a full, generative probability model.

Replacing “death” with “your next donut”, which of these curves would you rather describe the distribution of time until your next donut? plot of chunk r exs

More information:

A series of nice reviews, by Clark, Bradburn, Love, and Altman.
A tutorial by Emily Zabor.

Rightarrow the CRAN task view

Datasets:

NCCTG Lung Cancer Data

library(survival)
data(lung)
## help(lung)

NCCTG Lung Cancer Data

Description:

     Survival in patients with advanced lung cancer from the North
     Central Cancer Treatment Group.  Performance scores rate how well
     the patient can perform usual daily activities.

Usage:

     lung
     cancer
     
Format:

       inst:       Institution code                                                    
       time:       Survival time in days                                               
       status:     censoring status 1=censored, 2=dead                                 
       age:        Age in years                                                        
       sex:        Male=1 Female=2                                                     
       ph.ecog:    ECOG performance score as rated by the physician. 
                   0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed 
                   <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = 
                   bedbound 
       ph.karno:   Karnofsky performance score (bad=0-good=100) rated by physician     
       pat.karno:  Karnofsky performance score as rated by patient                     
       meal.cal:   Calories consumed at meals                                          
       wt.loss:    Weight loss in last six months                                      
      
Note:

     The use of 1/2 for alive/dead instead of the usual 0/1 is a
     historical footnote.  For data contained on punch cards, IBM 360
     Fortran treated blank as a zero, which led to a policy within the
     section of Biostatistics to never use "0" as a data value since
     one could not distinguish it from a missing value.  The policy
     became a habit, as is often the case; and the 1/2 coding endured
     long beyond the demise of punch cards and Fortran.

Source:

     Terry Therneau

References:

     Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ.  Kugler
     JW. Bartel J. Law M. Bateman M. Klatt NE. et al.  Prospective
     evaluation of prognostic variables from patient-completed
     questionnaires. North Central Cancer Treatment Group.  Journal of
     Clinical Oncology. 12(3):601-7, 1994.

Simulation: Constant hazard rate

Time until arrival of a high-energy neutrino in each of many detectors:

nrate <- 1 / 365
study_time <- 2 * 365
nobs <- 228
neutrinos <- data.frame(
        detector_id = 1:nobs,
        time = round(rexp(nobs, rate=nrate), 1))
neutrinos$status <- (neutrinos$time < study_time)
neutrinos$time <- pmin(study_time, neutrinos$time)
head(neutrinos)

##   detector_id  time status
## 1           1  72.4   TRUE
## 2           2 241.2   TRUE
## 3           3 103.5   TRUE
## 4           4  13.9   TRUE
## 5           5 172.7   TRUE
## 6           6 534.2   TRUE

plot of chunk r plot_neutrs

Simulation: Increasing hazard rate

Time until failure of lightbulbs, that wear out as time goes on:

lmean <- 2 * 365
bulbs <- data.frame(
        bulb_id = 1:nobs,
        time = abs(rnorm(nobs, sd=lmean)))
bulbs$status <- (bulbs$time < study_time)
bulbs$time <- pmin(study_time, bulbs$time)
head(bulbs)

##   bulb_id     time status
## 1       1 511.5466   TRUE
## 2       2 404.5635   TRUE
## 3       3 610.5038   TRUE
## 4       4 730.0000  FALSE
## 5       5 149.6198   TRUE
## 6       6 251.9142   TRUE

plot of chunk r plot_lightbulbs

Kaplan-Meier curves

Estimating probabilities

Suppose that:

we start with 100 chickens
in the first year, 25 of them die
in the second year, 25 of them die
at the end of the second year, we give 25 of the survivors away
in the third year, 15 of them die

Questions:

What percent survived each year? (of those that began that year alive)
Find the probability of surviving for 1, 2, and 3 years,
and draw the survival curve.

The Kaplan-Meier survival curve is a purely empirical, nonparametric estimate of survival probability:

one interval per unique event time
estimated probability of surviving an interval is the proportion that did
and so the probability of surviving until $t$ is the product of all the survival probabilities before $t$

plot(Surv(time=rep(1:3, times=c(25, 50, 25)),
          event=c(rep(2, 25), rep(2, 25), rep(1, 25), rep(2, 15), rep(1, 10))), xlab="years", ylab="probability of survival", ylim=c(0,1))

plot of chunk r chickens

Neutrinos:

neutrino_sf <- survfit(Surv(time, status) ~ 1, data=neutrinos)
plot(neutrino_sf, xlab='number of days', ylab='prob of detection', main="Neutrino detection")
lines(1:study_time, pexp(1:study_time, rate=nrate, lower.tail=FALSE), col='red')

plot of chunk r plot_neutrinos

Light bulbs:

bulb_sf <- survfit(Surv(time, status) ~ 1, data=bulbs)
plot(bulb_sf, xlab='number of days', ylab='prob of detection', main="Lightbulb failure")
lines(1:study_time, 2 * pnorm(1:study_time, sd=lmean, lower.tail=FALSE), col='red')

plot of chunk r plot_lbs

Hazard rate is slope of survival on a log scale:

layout(t(1:2))
plot(neutrino_sf, xlab='number of days', ylab='prob of detection', main="Neutrino detection", log='y')
plot(bulb_sf, xlab='number of days', main="Lightbulb failure", log='y')

plot of chunk r plot_both

Lung cancer survival:

lung_sf <- survfit(Surv(time, status) ~ 1, data=lung)
plot(lung_sf, conf.int=TRUE, xlab='number of days', ylab='prob of survival', main="Lung cancer survival")

plot of chunk r plot_km

Hazard rate is slope of survival on a log scale:

plot of chunk r plot_km1

Survival, by sex:

lung_sf_bysex <- survfit(Surv(time, status) ~ sex, data=lung)
plot(lung_sf_bysex, col=1:2, conf.int=TRUE, xlab='number of days', ylab='prob of survival', main="Lung cancer survival")
legend("topright", lty=c(1,1,2), col=c(1,2,1), legend=c("Male", "Female", "95% conf int"))

plot of chunk r plot_km2

Interpretation interlude:

Interpret this:

a retrospective analysis of 2,733 patients with confirmed COVID-19 admitted at five New York City hospitals within the Mount Sinai system between 3/14 and 4/11

What is the “number at risk”?
The blue curve hits 50% at 20 days. What does that mean?
“Patients treated with anticoagulation were generally sicker” - so, what do we conclude?

From Paranjpe et al via emcrit.org/pulmcrit/sinai