\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Parametric Survival Analysis

Peter Ralph

Advanced Biological Statistics

Fully parametric models

Modeling

The idea: choose a (parametric) distribution for the time-until-event.

Some choices:

  • Exponential (constant rate)
  • Weibull
  • Gompertz
  • logNormal

Model:

  • lifetimes are Weibull distributed
  • mean lifetime depends on covariates

Model:

  • lifetimes are Weibull distributed
  • mean lifetime depends on covariates

\[\begin{aligned} T_i &\sim \Weibull(\text{shape}=k, \\ &\qquad \text{mean}=\mu_i) \\ \mu_i &= \exp(y_i) \\ y_i &= \beta_0 + \sum_j \beta_j x_{ij} \end{aligned}\]

Three ways to do parametric survival analysis:

  • survival::survreg : maximum likelihood
  • brms::brm : Baysian, with Stan
  • MC Stan

Maximum likelihood: survreg

weifit <- survreg(Surv(time, status) ~ sex + ph.ecog + ph.karno,
                  data = lung, dist='weibull')

Bayesian: brms

lung$censored <- ifelse(lung$status == 1, 'right', 'none')

brmfit <- brm(time | cens(censored) ~ sex + ph.ecog + ph.karno,
              data=lung, family='weibull')
## Compiling Stan program...
## Start sampling


Call:
survreg(formula = Surv(time, status) ~ sex + ph.ecog + ph.karno, 
    data = lung, dist = "weibull")
               Value Std. Error     z       p
(Intercept)  6.66434    0.65785 10.13 < 2e-16
sex          0.41141    0.12287  3.35 0.00081
ph.ecog     -0.47729    0.12618 -3.78 0.00016
ph.karno    -0.00906    0.00675 -1.34 0.17983
Log(scale)  -0.32452    0.06207 -5.23 1.7e-07

Scale= 0.723 

Weibull distribution
Loglik(model)= -1126.3   Loglik(intercept only)= -1141.1
    Chisq= 29.49 on 3 degrees of freedom, p= 1.8e-06 
Number of Newton-Raphson Iterations: 5 
n=226 (2 observations deleted due to missingness)
 Family: weibull 
  Links: mu = log; shape = identity 
Formula: time | cens(censored) ~ sex + ph.ecog + ph.karno 
   Data: lung (Number of observations: 226) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     6.61      0.65     5.34     7.89 1.00     2792     2974
sex           0.42      0.13     0.17     0.67 1.00     3328     2444
ph.ecog      -0.49      0.13    -0.73    -0.24 1.00     2675     2826
ph.karno     -0.01      0.01    -0.02     0.00 1.00     2699     2859

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     1.36      0.09     1.20     1.53 1.00     3453     2961

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Random effects, with brms

Recall there are 19 different institutions:

table( lung$inst <- factor(lung$inst) )
## 
##  1  2  3  4  5  6  7 10 11 12 13 15 16 21 22 26 32 33 
## 36  5 19  4  9 14  8  4 18 23 20  6 16 13 17  6  7  2
brmfit2 <- brm(time | cens(censored) ~ sex + ph.ecog + ph.karno + (1|inst),
               data=subset(lung, !is.na(inst)), family='weibull')
## Compiling Stan program...
## Start sampling

summary(brmfit2)
##  Family: weibull 
##   Links: mu = log; shape = identity 
## Formula: time | cens(censored) ~ sex + ph.ecog + ph.karno + (1 | inst) 
##    Data: subset(lung, !is.na(inst)) (Number of observations: 225) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Group-Level Effects: 
## ~inst (Number of levels: 18) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.12      0.08     0.01     0.32 1.00     1409     1540
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     6.62      0.68     5.26     7.97 1.00     2918     2918
## sex           0.42      0.13     0.18     0.68 1.00     4429     3074
## ph.ecog      -0.51      0.13    -0.78    -0.26 1.00     2800     2653
## ph.karno     -0.01      0.01    -0.02     0.00 1.00     2962     3149
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape     1.36      0.09     1.20     1.53 1.00     4190     3144
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The conditional effect of sex on mean survival time (with mean values of other variables):

conditional_effects(brmfit2, effects='sex')

plot of chunk r insts_sex

The conditional effect of ECOG on mean survival time:

conditional_effects(brmfit2, effects='ph.ecog')

plot of chunk r insts_ecog

The conditional effect of Karnofsky score on mean survival time:

conditional_effects(brmfit2, effects='ph.karno')

plot of chunk r insts_karno

A posterior predictive check: data are dark points, posterior medians are large points, and lines give 50% and 90% posterior ranges:

pp_check(brmfit2, type='intervals')

plot of chunk r ppcheck

Consulting vignette(“brms_families”), the survival curve is \(\exp(-(t / \lambda)^k)\) where \(\lambda = \mu / \Gamma(1 + 1/k)\). Here are 100 samples from the posterior: plot of chunk r surv_curvs

Here’s the code for that last one:

post_means <- posterior_epred(brmfit2, newdata=data.frame(age=60, sex=2, ph.ecog=0:3, ph.karno=50), re_formula=NA, dpar="mu")[1:100,]
post_shape <- posterior_epred(brmfit2, newdata=data.frame(age=60, sex=2, ph.ecog=0:3, ph.karno=50), re_formula=NA, dpar="shape")[1:100,]
tvals <- seq(0, 2000, length.out=101)
Svals <- array(NA, dim=c(length(tvals), nrow(post_means), ncol(post_means)))
for (k in 1:nrow(post_means)) {
    lambda <- post_means[k,] / gamma(1 + 1/post_shape[k,])
    for (j in 1:ncol(post_means)) {
        Svals[,k,j] <- exp(- (tvals / lambda[j])^post_shape[k,j])
    }
}
layout(matrix(1:4, nrow=2, byrow=TRUE))
for (j in 1:dim(Svals)[3]) {
    plot(tvals, rowMeans(Svals[,,j]), lty=1, ylim=c(0,1), type='l', xlab='time (t)', ylab='S(t)', col=j, main=paste("ECOG =", j-1))
    matlines(tvals, Svals[,,j], type='l', col=adjustcolor(j, 0.25), lty=1)
    abline(v=1000, lty=3)
}

Application

Fit (and select) a proportional hazards model to the veteran dataset. What are the most important predictors of survival?

data(cancer, package='survival')
## help(veteran)
Veterans' Administration Lung Cancer study

Description:

     Randomised trial of two treatment regimens for lung cancer.  This
     is a standard survival analysis data set.

Usage:

     veteran
     
Format:

       trt:       1=standard 2=test                            
       celltype:  1=squamous,  2=smallcell,  3=adeno,  4=large 
       time:      survival time                                
       status:    censoring status                             
       karno:     Karnofsky performance score (100=good)       
       diagtime:  months from diagnosis to randomisation       
       age:       in years                                     
       prior:     prior therapy 0=no, 10=yes      

Review

Nonparametric:

  • simple
  • does not depend on model fit

Parametric:

  • makes more sense: real hazard rates are not piecewise constant
  • better power, if the model fits
  • can simulate

The Kaplan-Meier curve, by hand:

f1 <- survfit(Surv(time, status) ~ 1, data = lung)

y <- data.frame(time=sort(unique(lung$time[lung$status == 2])))
y$k <- sapply(y$time, function (t) sum(lung$status == 2 & lung$time == t))
y$n <- sapply(y$time, function (t) sum(lung$time > t | (lung$time == t & lung$status == 2)))
y$p <- cumprod(1 - y$k/y$n)

plot(f1)
points(p ~ time, data=y, col='red', pch=20, cex=0.5)

plot of chunk r byhand

// reveal.js plugins