Peter Ralph
7 January 2020 – Advanced Biological Statistics
We are interested in how long until some particular thing happens, and how this depends on some covariates.
Example: how years until death depends on cancer grade at diagnosis and drug treatment.
Example: how day of first budburst depends on species and local temperature.
Ex: For each patient, date of diagnosis and cancer grade; date of death or last follow-up.
Ex: For each plant, species, date of first budburst or last survey.
Both examples are right censored: for some subjects, we don’t know how the actual time, only a lower bound on it.
Key assumption: any censoring is noninformative,
i.e., our data collection does not depend on the status of the subjects.
Examples of informative censoring: patient dropout due to worsening symptoms.
The survival curve:
\[\begin{aligned} S(t) &= \P\{\text{still 'alive' after $t$ time units}\} , \end{aligned}\]
Note: this is always decreasing.
and the hazard rate:
\[\begin{aligned} h(t) &= \text{(mean number of 'deaths' per still-alive subject,} \\ &\qquad \text{per unit time at $t$)} , \end{aligned}\]
which is
\[\begin{aligned} h(t) = - \frac{d}{dt} \log S(t) . \end{aligned}\]
We’ll look at some methods across the nonparametric-to-parametric continuum.
Interpret the differences between these survival curves:
the CRAN task view
NCCTG Lung Cancer Data
Description:
Survival in patients with advanced lung cancer from the North
Central Cancer Treatment Group. Performance scores rate how well
the patient can perform usual daily activities.
Usage:
lung
cancer
Format:
inst: Institution code
time: Survival time in days
status: censoring status 1=censored, 2=dead
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score as rated by the physician.
0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed
<50% of the day, 3= in bed > 50% of the day but not bedbound, 4 =
bedbound
ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score as rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months
Note:
The use of 1/2 for alive/dead instead of the usual 0/1 is a
historical footnote. For data contained on punch cards, IBM 360
Fortran treated blank as a zero, which led to a policy within the
section of Biostatistics to never use "0" as a data value since
one could not distinguish it from a missing value. The policy
became a habit, as is often the case; and the 1/2 coding endured
long beyond the demise of punch cards and Fortran.
Source:
Terry Therneau
References:
Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler
JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective
evaluation of prognostic variables from patient-completed
questionnaires. North Central Cancer Treatment Group. Journal of
Clinical Oncology. 12(3):601-7, 1994.
Time until arrival of a high-energy neutrino in each of many detectors:
Time until failure of lightbulbs, that wear out as time goes on:
Suppose that:
Questions:
The Kaplan-Meier survival curve is a purely empirical, nonparametric estimate of survival probability:
Survival analysis is used to estimate the chance of surviving a given type of cancer for a given number of years.
Question: How accurate is that?
Goal: understand how covariates affect survival time.
Idea: modify a nonparametric hazard rate by a linear predictor: \[ h(t) = h_0(t) e^{\beta x} .\]
How does survival probability depend on the covariates?
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score as rated by the physician.
(good=0-bad=4)
ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score as rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months
How does survival probability depend on the covariates?
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score as rated by the physician.
(good=0-bad=4)
ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score as rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months
\[\begin{aligned} h(t) &= \text{(baseline hazard rate at $t$)} \\ y_i &= \beta_0 + \sum_j \beta_j x_{ij} \\ h_i(t) &= \text{(hazard rate for $i$ at $t$)} \\ &= h(t) \times \exp\left( y_i \right) \\ S_i(t) &= \P\left\{\text{survival of $i$ until $t$}\right\} \\ &= \exp\left( - \int_0^t h_i(s) ds \right) . \end{aligned}\]
fullcox <- coxph(Surv(time, status)
~ age + sex + ph.ecog + ph.karno + pat.karno
+ meal.cal + wt.loss,
data = lung)
summary(fullcox)
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + ph.ecog + ph.karno +
## pat.karno + meal.cal + wt.loss, data = lung)
##
## n= 168, number of events= 121
## (60 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 1.065e-02 1.011e+00 1.161e-02 0.917 0.35906
## sex -5.509e-01 5.765e-01 2.008e-01 -2.743 0.00609 **
## ph.ecog 7.342e-01 2.084e+00 2.233e-01 3.288 0.00101 **
## ph.karno 2.246e-02 1.023e+00 1.124e-02 1.998 0.04574 *
## pat.karno -1.242e-02 9.877e-01 8.054e-03 -1.542 0.12316
## meal.cal 3.329e-05 1.000e+00 2.595e-04 0.128 0.89791
## wt.loss -1.433e-02 9.858e-01 7.771e-03 -1.844 0.06518 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0107 0.9894 0.9880 1.0340
## sex 0.5765 1.7347 0.3889 0.8545
## ph.ecog 2.0838 0.4799 1.3452 3.2277
## ph.karno 1.0227 0.9778 1.0004 1.0455
## pat.karno 0.9877 1.0125 0.9722 1.0034
## meal.cal 1.0000 1.0000 0.9995 1.0005
## wt.loss 0.9858 1.0144 0.9709 1.0009
##
## Concordance= 0.651 (se = 0.029 )
## Likelihood ratio test= 28.33 on 7 df, p=2e-04
## Wald test = 27.58 on 7 df, p=3e-04
## Score (logrank) test = 28.41 on 7 df, p=2e-04
subcox <- coxph(Surv(time, status) ~ sex + ph.ecog, data = lung, subset=rowSums(is.na(lung[,2:10]))==0)
anova(fullcox, subcox)
## Analysis of Deviance Table
## Cox model: response is Surv(time, status)
## Model 1: ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss
## Model 2: ~ sex + ph.ecog
## loglik Chisq Df P(>|Chi|)
## 1 -498.75
## 2 -503.14 8.7792 5 0.1182
If \(T\) has a Weibull distribution, with scale \(\lambda\) and shape \(k\), then
It is mostly used in survival analysis, because its hazard rate is: \[\begin{aligned} h(t) = k \frac{1}{\lambda} \left(\frac{t}{\lambda}\right)^{k-1} . \end{aligned}\] which allows rates to go down (\(k<1\)), up (\(k>1\)), or stay constant (\(k=1\)) over time.
kvals <- c(0.5, 0.75, 1, 1.5, 2)
xx <- seq(0, 1.5, length.out=101)
x <- sapply(kvals, function (k) dweibull(xx, shape=k)/pweibull(xx, shape=k, lower.tail=FALSE))
matplot(xx, x, type='l', lty=1, xlab='time', ylab='hazard rate', lwd=2)
legend("top", lty=1, col=1:5, legend=sprintf("shape=%0.2f", kvals))
The idea: choose a (parametric) distribution for the time-until-event.
Some choices:
Model:
Model:
\[\begin{aligned} T_i &\sim \Weibull(\text{shape}=k, \text{mean}=\mu_i) \\ \mu_i &= \exp(y_i) \\ y_i &= \beta_0 + \sum_j \beta_j x_{ij} \end{aligned}\]
Three ways to do parametric survival analysis:
survival::survreg
: maximum likelihoodbrms::brm
: Baysian, with Stanlung$censored <- ifelse(lung$status == 1, 'right', 'none')
brmfit <- brm(time | cens(censored) ~ sex + ph.ecog + ph.karno,
data=lung, family='weibull')
## Compiling the C++ model
## Start sampling
Call: survreg(formula = Surv(time, status) ~ sex + ph.ecog + ph.karno, data = lung, dist = "weibull") Value Std. Error z p (Intercept) 6.66434 0.65785 10.13 < 2e-16 sex 0.41141 0.12287 3.35 0.00081 ph.ecog -0.47729 0.12618 -3.78 0.00016 ph.karno -0.00906 0.00675 -1.34 0.17983 Log(scale) -0.32452 0.06207 -5.23 1.7e-07 Scale= 0.723 Weibull distribution Loglik(model)= -1126.3 Loglik(intercept only)= -1141.1 Chisq= 29.49 on 3 degrees of freedom, p= 1.8e-06 Number of Newton-Raphson Iterations: 5 n=226 (2 observations deleted due to missingness)
Family: weibull Links: mu = log; shape = identity Formula: time | cens(censored) ~ sex + ph.ecog + ph.karno Data: lung (Number of observations: 226) Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1; total post-warmup samples = 4000 Population-Level Effects: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS Intercept 6.61 0.67 5.36 7.99 1.00 2739 2666 sex 0.42 0.12 0.18 0.67 1.00 3502 2945 ph.ecog -0.49 0.13 -0.74 -0.24 1.00 2531 2818 ph.karno -0.01 0.01 -0.02 0.00 1.00 2644 2799 Family Specific Parameters: Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS shape 1.36 0.08 1.20 1.53 1.00 3708 3039 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).
brmfit2 <- brm(time | cens(censored) ~ sex + ph.ecog + ph.karno + (1|inst),
data=lung, family='weibull')
## Compiling the C++ model
## Start sampling
## Family: weibull
## Links: mu = log; shape = identity
## Formula: time | cens(censored) ~ sex + ph.ecog + ph.karno + (1 | inst)
## Data: lung (Number of observations: 226)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~inst (Number of levels: 19)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.12 0.08 0.01 0.31 1.00 1639 1858
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 6.63 0.67 5.36 7.95 1.00 3514 3133
## sex 0.42 0.13 0.18 0.67 1.00 5396 2613
## ph.ecog -0.50 0.13 -0.76 -0.25 1.00 3342 3061
## ph.karno -0.01 0.01 -0.02 0.00 1.00 3398 3148
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape 1.37 0.09 1.20 1.54 1.00 4942 2869
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## $inst
## , , Intercept
##
## Estimate Est.Error Q2.5 Q97.5
## 1 -0.077081047 0.1121902 -0.34453070 0.1052658
## 2 -0.051925091 0.1448818 -0.40060675 0.1929411
## 3 -0.003140926 0.1052136 -0.22609347 0.2210849
## 4 0.017833583 0.1226034 -0.22299699 0.3069706
## 5 -0.038681462 0.1353150 -0.35932986 0.2186037
## 6 -0.042325779 0.1144701 -0.31423089 0.1645995
## 7 0.025663604 0.1249884 -0.21553215 0.3238278
## 10 -0.046230231 0.1427945 -0.37326527 0.2203388
## 11 0.039240593 0.1136011 -0.17209673 0.3056447
## 12 -0.053260802 0.1130651 -0.31811782 0.1354898
## 13 0.047404908 0.1136277 -0.15599395 0.3150589
## 15 0.007206870 0.1268800 -0.25446182 0.2833138
## 16 0.061820871 0.1195758 -0.13391314 0.3491895
## 21 -0.066473590 0.1286672 -0.37756584 0.1363838
## 22 0.083172155 0.1223941 -0.09947608 0.3754763
## 26 0.039314096 0.1300442 -0.19835246 0.3580281
## 32 0.007893967 0.1329952 -0.27541795 0.3088573
## 33 0.006010918 0.1419936 -0.30783839 0.3107077
## 100 0.014532219 0.1370704 -0.26147744 0.3371779
Fit (and select) a proportional hazards model to the veteran
dataset. What are the most important predictors of survival?
Veterans' Administration Lung Cancer study
Description:
Randomised trial of two treatment regimens for lung cancer. This
is a standard survival analysis data set.
Usage:
veteran
Format:
trt: 1=standard 2=test
celltype: 1=squamous, 2=smallcell, 3=adeno, 4=large
time: survival time
status: censoring status
karno: Karnofsky performance score (100=good)
diagtime: months from diagnosis to randomisation
age: in years
prior: prior therapy 0=no, 10=yes
brm(formula = z ~ x + y + (1 + y|f), data = xy,
family = gaussian(link='identity'))
formula
: almost just like lme4
data
: must contain all the variablesfamily
: distribution of responselink
: connects mean to linear predictorThere are several classes of parameter in a brms model:
b
: the population-level (or, fixed) effectssd
: the standard deviations of group-level (or, random) effectssigma
for the GaussianExamples:
b_x
: the slope of x
: class="b", coef="x"
sd_f
: the SD of effects for levels of f
: class="sd", coef="f"
Pass a vector of “priors”, specified by
set_prior(prior, class="b", ...)
where prior
is some valid Stan code.
brm(formula = z ~ x + y + (1 + y|f), data = xy,
family = gaussian(link='identity'),
prior=c(set_prior("normal(0, 5)", class="b"),
set_prior("cauchy(0, 1)", class="sd", coef="f")))
Default:
## prior class coef group resp dpar nlpar bound
## 1 b
## 2 b x
## 3 b y
## 4 lkj(1) cor
## 5 cor f
## 6 student_t(3, 1, 10) Intercept
## 7 student_t(3, 0, 10) sd
## 8 sd f
## 9 sd Intercept f
## 10 sd y f
## 11 student_t(3, 0, 10) sigma
Choose modifications:
## Compiling the C++ model
## Start sampling
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: z ~ x + y + (1 + y | f)
## Data: xy (Number of observations: 100)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~f (Number of levels: 3)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.07 0.19 0.00 0.43 1.00 978 1165
## sd(y) 1.44 0.66 0.59 3.11 1.01 803 1031
## cor(Intercept,y) -0.12 0.56 -0.95 0.90 1.00 1051 1277
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept -0.02 0.10 -0.14 0.08 1.00 1842 1471
## x 1.00 0.01 0.98 1.01 1.00 2405 2051
## y 1.93 0.87 0.05 3.64 1.01 599 455
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.09 0.01 0.08 0.10 1.00 2735 2410
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Or…
launch_shinystan(the_fit)
Summaries of, or samples from, the posteriors of:
fixef( )
: “fixed” effectsranef( )
: “random” effectsfitted( )
: mean of responsespredict( )
: actual responseshypothesis( )
: functions of various parameters (e.g., difference between two classes)marginal_effects( )
: effect of one predictor conditioned on values of othershelp(brmsformula)
help(brmsfamily)
(but note can use those in help(family)
also)help(set_prior)
and also check what can have a prior with get_prior( )
stancode(the_fit)
(and standata(the_fit)
)loo( )
Nonparametric:
Parametric:
f1 <- survfit(Surv(time, status) ~ 1, data = lung)
y <- data.frame(time=sort(unique(lung$time[lung$status == 2])))
y$k <- sapply(y$time, function (t) sum(lung$status == 2 & lung$time == t))
y$n <- sapply(y$time, function (t) sum(lung$time > t | (lung$time == t & lung$status == 2)))
y$p <- cumprod(1 - y$k/y$n)
plot(f1)
points(p ~ time, data=y, col='red', pch=20, cex=0.5)