Peter Ralph
Advanced Biological Statistics
We are interested in how long until some particular thing happens, and how this depends on some covariates.
Example: how years until death depends on cancer grade at diagnosis and drug treatment.
Example: how day of first budburst depends on species and local temperature.
Ex: For each patient, date of diagnosis and cancer grade; date of death or last follow-up.
Ex: For each plant, species, date of first budburst or last survey.
Both examples are right censored: for some subjects, we don’t know how the actual time, only a lower bound on it.
Key assumption: any censoring is noninformative,
i.e., our data collection does not depend on the status of the subjects.
Example of informative censoring: patient dropout due to worsening symptoms.
The survival curve:
\[\begin{aligned} S(t) &= \P\{\text{still 'alive' after $t$ time units}\} , \end{aligned}\]
Note: this is always decreasing.
and the hazard rate:
\[\begin{aligned} h(t) &= \text{(mean number of 'deaths' per still-alive subject,} \\ &\qquad \text{per unit time at $t$)} , \end{aligned}\]
which is
\[\begin{aligned} h(t) = - \frac{d}{dt} \log S(t) . \end{aligned}\]
The hazard rate,
the rate the event of interest happens at per unit time,
is the slope of the survival curve on a log scale.
We’ll look at some methods across the nonparametric-to-parametric continuum.
Replacing “death” with “your next donut”, which of these curves would you rather describe the distribution of time until your next donut?
the CRAN task view
NCCTG Lung Cancer Data
Description:
Survival in patients with advanced lung cancer from the North
Central Cancer Treatment Group. Performance scores rate how well
the patient can perform usual daily activities.
Usage:
lung
cancer
Format:
inst: Institution code
time: Survival time in days
status: censoring status 1=censored, 2=dead
age: Age in years
sex: Male=1 Female=2
ph.ecog: ECOG performance score as rated by the physician.
0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed
<50% of the day, 3= in bed > 50% of the day but not bedbound, 4 =
bedbound
ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: Karnofsky performance score as rated by patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months
Note:
The use of 1/2 for alive/dead instead of the usual 0/1 is a
historical footnote. For data contained on punch cards, IBM 360
Fortran treated blank as a zero, which led to a policy within the
section of Biostatistics to never use "0" as a data value since
one could not distinguish it from a missing value. The policy
became a habit, as is often the case; and the 1/2 coding endured
long beyond the demise of punch cards and Fortran.
Source:
Terry Therneau
References:
Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler
JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective
evaluation of prognostic variables from patient-completed
questionnaires. North Central Cancer Treatment Group. Journal of
Clinical Oncology. 12(3):601-7, 1994.
Time until arrival of a high-energy neutrino in each of many detectors:
nrate <- 1 / 365
study_time <- 2 * 365
nobs <- 228
neutrinos <- data.frame(
detector_id = 1:nobs,
time = round(rexp(nobs, rate=nrate), 1))
neutrinos$status <- (neutrinos$time < study_time)
neutrinos$time <- pmin(study_time, neutrinos$time)
head(neutrinos)
## detector_id time status
## 1 1 72.4 TRUE
## 2 2 241.2 TRUE
## 3 3 103.5 TRUE
## 4 4 13.9 TRUE
## 5 5 172.7 TRUE
## 6 6 534.2 TRUE
Time until failure of lightbulbs, that wear out as time goes on:
lmean <- 2 * 365
bulbs <- data.frame(
bulb_id = 1:nobs,
time = abs(rnorm(nobs, sd=lmean)))
bulbs$status <- (bulbs$time < study_time)
bulbs$time <- pmin(study_time, bulbs$time)
head(bulbs)
## bulb_id time status
## 1 1 511.5466 TRUE
## 2 2 404.5635 TRUE
## 3 3 610.5038 TRUE
## 4 4 730.0000 FALSE
## 5 5 149.6198 TRUE
## 6 6 251.9142 TRUE
Suppose that:
Questions:
The Kaplan-Meier survival curve is a purely empirical, nonparametric estimate of survival probability:
a retrospective analysis of 2,733 patients with confirmed COVID-19 admitted at five New York City hospitals within the Mount Sinai system between 3/14 and 4/11