\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Survival curves

Peter Ralph

Advanced Biological Statistics

Overview

“Survival” analysis:

We are interested in how long until some particular thing happens, and how this depends on some covariates.

Example: how years until death depends on cancer grade at diagnosis and drug treatment.

Example: how day of first budburst depends on species and local temperature.

Surivival data

Ex: For each patient, date of diagnosis and cancer grade; date of death or last follow-up.

Ex: For each plant, species, date of first budburst or last survey.

Both examples are right censored: for some subjects, we don’t know how the actual time, only a lower bound on it.

Key assumption: any censoring is noninformative,

i.e., our data collection does not depend on the status of the subjects.

Example of informative censoring: patient dropout due to worsening symptoms.

What do we want to know?

The survival curve:

\[\begin{aligned} S(t) &= \P\{\text{still 'alive' after $t$ time units}\} , \end{aligned}\]

Note: this is always decreasing.

and the hazard rate:

\[\begin{aligned} h(t) &= \text{(mean number of 'deaths' per still-alive subject,} \\ &\qquad \text{per unit time at $t$)} , \end{aligned}\]

which is

\[\begin{aligned} h(t) = - \frac{d}{dt} \log S(t) . \end{aligned}\]

The hazard rate,

the rate the event of interest happens at per unit time,

is the slope of the survival curve on a log scale.

Nonparametric versus parametric?

We’ll look at some methods across the nonparametric-to-parametric continuum.

  • Nonparametric: fewer assumptions.
  • Parametric: fits a full, generative probability model.

Replacing “death” with “your next donut”, which of these curves would you rather describe the distribution of time until your next donut? plot of chunk r exs

More information:

Datasets:

NCCTG Lung Cancer Data

library(survival)
data(lung)
## help(lung)
NCCTG Lung Cancer Data

Description:

     Survival in patients with advanced lung cancer from the North
     Central Cancer Treatment Group.  Performance scores rate how well
     the patient can perform usual daily activities.

Usage:

     lung
     cancer
     
Format:

       inst:       Institution code                                                    
       time:       Survival time in days                                               
       status:     censoring status 1=censored, 2=dead                                 
       age:        Age in years                                                        
       sex:        Male=1 Female=2                                                     
       ph.ecog:    ECOG performance score as rated by the physician. 
                   0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed 
                   <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = 
                   bedbound 
       ph.karno:   Karnofsky performance score (bad=0-good=100) rated by physician     
       pat.karno:  Karnofsky performance score as rated by patient                     
       meal.cal:   Calories consumed at meals                                          
       wt.loss:    Weight loss in last six months                                      
      
Note:

     The use of 1/2 for alive/dead instead of the usual 0/1 is a
     historical footnote.  For data contained on punch cards, IBM 360
     Fortran treated blank as a zero, which led to a policy within the
     section of Biostatistics to never use "0" as a data value since
     one could not distinguish it from a missing value.  The policy
     became a habit, as is often the case; and the 1/2 coding endured
     long beyond the demise of punch cards and Fortran.

Source:

     Terry Therneau

References:

     Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ.  Kugler
     JW. Bartel J. Law M. Bateman M. Klatt NE. et al.  Prospective
     evaluation of prognostic variables from patient-completed
     questionnaires. North Central Cancer Treatment Group.  Journal of
     Clinical Oncology. 12(3):601-7, 1994.

Simulation: Constant hazard rate

Time until arrival of a high-energy neutrino in each of many detectors:

nrate <- 1 / 365
study_time <- 2 * 365
nobs <- 228
neutrinos <- data.frame(
        detector_id = 1:nobs,
        time = round(rexp(nobs, rate=nrate), 1))
neutrinos$status <- (neutrinos$time < study_time)
neutrinos$time <- pmin(study_time, neutrinos$time)
head(neutrinos)
##   detector_id  time status
## 1           1  72.4   TRUE
## 2           2 241.2   TRUE
## 3           3 103.5   TRUE
## 4           4  13.9   TRUE
## 5           5 172.7   TRUE
## 6           6 534.2   TRUE

plot of chunk r plot_neutrs

Simulation: Increasing hazard rate

Time until failure of lightbulbs, that wear out as time goes on:

lmean <- 2 * 365
bulbs <- data.frame(
        bulb_id = 1:nobs,
        time = abs(rnorm(nobs, sd=lmean)))
bulbs$status <- (bulbs$time < study_time)
bulbs$time <- pmin(study_time, bulbs$time)
head(bulbs)
##   bulb_id     time status
## 1       1 511.5466   TRUE
## 2       2 404.5635   TRUE
## 3       3 610.5038   TRUE
## 4       4 730.0000  FALSE
## 5       5 149.6198   TRUE
## 6       6 251.9142   TRUE

plot of chunk r plot_lightbulbs

Kaplan-Meier curves

Estimating probabilities

Suppose that:

  • we start with 100 chickens
  • in the first year, 25 of them die
  • in the second year, 25 of them die
  • at the end of the second year, we give 25 of the survivors away
  • in the third year, 15 of them die

Questions:

  • What percent survived each year? (of those that began that year alive)
  • Find the probability of surviving for 1, 2, and 3 years,
  • and draw the survival curve.

The Kaplan-Meier survival curve is a purely empirical, nonparametric estimate of survival probability:

  • one interval per unique event time
  • estimated probability of surviving an interval is the proportion that did
  • and so the probability of surviving until \(t\) is the product of all the survival probabilities before \(t\)
plot(Surv(time=rep(1:3, times=c(25, 50, 25)),
          event=c(rep(2, 25), rep(2, 25), rep(1, 25), rep(2, 15), rep(1, 10))), xlab="years", ylab="probability of survival", ylim=c(0,1))

plot of chunk r chickens

Neutrinos:

neutrino_sf <- survfit(Surv(time, status) ~ 1, data=neutrinos)
plot(neutrino_sf, xlab='number of days', ylab='prob of detection', main="Neutrino detection")
lines(1:study_time, pexp(1:study_time, rate=nrate, lower.tail=FALSE), col='red')

plot of chunk r plot_neutrinos

Light bulbs:

bulb_sf <- survfit(Surv(time, status) ~ 1, data=bulbs)
plot(bulb_sf, xlab='number of days', ylab='prob of detection', main="Lightbulb failure")
lines(1:study_time, 2 * pnorm(1:study_time, sd=lmean, lower.tail=FALSE), col='red')

plot of chunk r plot_lbs

Hazard rate is slope of survival on a log scale:

layout(t(1:2))
plot(neutrino_sf, xlab='number of days', ylab='prob of detection', main="Neutrino detection", log='y')
plot(bulb_sf, xlab='number of days', main="Lightbulb failure", log='y')

plot of chunk r plot_both

Lung cancer survival:

lung_sf <- survfit(Surv(time, status) ~ 1, data=lung)
plot(lung_sf, conf.int=TRUE, xlab='number of days', ylab='prob of survival', main="Lung cancer survival")

plot of chunk r plot_km

Hazard rate is slope of survival on a log scale:

plot of chunk r plot_km1

Survival, by sex:

lung_sf_bysex <- survfit(Surv(time, status) ~ sex, data=lung)
plot(lung_sf_bysex, col=1:2, conf.int=TRUE, xlab='number of days', ylab='prob of survival', main="Lung cancer survival")
legend("topright", lty=c(1,1,2), col=c(1,2,1), legend=c("Male", "Female", "95% conf int"))

plot of chunk r plot_km2

Interpretation interlude:

Interpret this:

a retrospective analysis of 2,733 patients with confirmed COVID-19 admitted at five New York City hospitals within the Mount Sinai system between 3/14 and 4/11

  • What is the “number at risk”?
  • The blue curve hits 50% at 20 days. What does that mean?
  • “Patients treated with anticoagulation were generally sicker” - so, what do we conclude?
Survival curves
// reveal.js plugins