\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \newcommand{\given}{\;\vert\;} \]

Crossvalidation and prediction

Peter Ralph

3 December 2019 – Advanced Biological Statistics

Count data

A hypothetical situation:

  1. We have counts of transcript numbers,

  2. from some individuals of different ages and past exposures to solar irradiation,

  3. of two genotypes.

The model

\[\begin{aligned} Z_i &\sim \Poisson(y_i) \\ y_i &= \exp(a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i ) \end{aligned}\]

The result

How’d we do?

Here are posterior distributions of the parameters, with the true values in red. plot of chunk true_fit_1

100 posterior predictive datasets

True data are overdispersed relative to the data

plot of chunk plot_post_sims1

True data are overdispersed relative to Poisson

Recall that if \(X \sim \Poisson(\lambda)\) then \[ \E[X] = \var[X] = \lambda, \] and so a “\(z\)-score” is \[\begin{aligned} \E\left(\frac{X - \lambda}{\sqrt{\lambda}}\right) = 0, \qquad \qquad \text{SD}\left(\frac{X - \lambda}{\sqrt{\lambda}}\right) = 1. \end{aligned}\]

plot of chunk plot_overdisp

Add overdispersion

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &\sim \log\Normal(y_i, \sigma) \\ y_i &= a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i \end{aligned}\]

overdispersed_model <- stan_model(model_code="
data {
    int N; // number of data points
    vector[N] age;
    vector[N] exposure;
    int counts[N];
    int genotype[N];
    int ngenotypes;
}
parameters {
    vector[ngenotypes] a; // intercepts
    real b; // slope for age
    vector[ngenotypes] c; // slopes for exposure
    real<lower=0> sigma; // SD on lognormal
    vector[N] mu; // mean of the poissons
}
model {
    vector[N] y; // mean of the lognormals
    y = a[genotype] + b * age + c[genotype] .* exposure;
    mu ~ lognormal(y, sigma);
    counts ~ poisson(mu);
    a ~ normal(0, 100);
    b ~ normal(0, 10);
    c ~ normal(0, 20);
    sigma ~ normal(0, 10);
}")

How’d we do now?

Here are posterior distributions from the full model, with the true values in red. plot of chunk true_fit

Posterior predictive simulations, again

Now we cover the true data

plot of chunk plot_post_sims2

Now we cover the true data

plot of chunk plot_post_sims3

Exercise:

Explain the model, in words.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &\sim \log\Normal(y_i, \sigma) \\ y_i &= a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i \end{aligned}\]

overdispersed_model <- stan_model(model_code="
data {
    int N; // number of data points
    vector[N] age;
    vector[N] exposure;
    int counts[N];
    int genotype[N];
    int ngenotypes;
}
parameters {
    vector[ngenotypes] a; // intercepts
    real b; // slope for age
    vector[ngenotypes] c; // slopes for exposure
    real<lower=0> sigma; // SD on lognormal
    vector[N] mu; // mean of the poissons
}
model {
    vector[N] y; // mean of the lognormals
    y = a[genotype] + b * age + c[genotype] .* exposure;
    mu ~ lognormal(y, sigma);
    counts ~ poisson(mu);
    a ~ normal(0, 100);
    b ~ normal(0, 10);
    c ~ normal(0, 20);
    sigma ~ normal(0, 10);
}")

Exercise 2:

Compare the result of fitting a quasipoisson GLM to the data.

Prediction

is it Christmas? No. (99.73% accurate)
is it Christmas? No. (99.73% accurate)

https://xkcd.com/2236/

Out-of-sample prediction

To test predictive ability (and diagnose overfitting!):

  1. Split the data into test and training pieces.
  2. Fit the model using the training data.
  3. See how well it predicts the test data.

If you do this a lot of times, it’s called crossvalidation.

Prediction in Stan

Posterior predictive samples

plot of chunk plot_pred

Posterior predictive samples

plot of chunk plot_pred2

Application: parasite counts

The data

Tavalire et al 2016, Genotypic variation in host response to infection affects parasite reproductive rate, International Journal for Parasitology, has:

  • Strain: inbred line ID
  • Inf_Dose: number of parasites exposed to: 0 (control), 1, and 10
  • Tray: housing for each individual
  • Infected: whether or not was infected
  • Initial_Size_mm: initial size of the snail
  • Weeks_to_first_shed: time until the snail first produced parasites
  • Total_weeks_shed: number of weeks that parasites were produced; right censored at 30
  • Total_Parasites: number of parasites produced in the first 30 weeks
  • Avg_Parasites_per_Shed: Total parasites/weeks shed
  • Week_Died: Survival time, right censored at 30 weeks

A question

How much do genotype and parasite dose affect parasite production?

Here’s the data

##   Snail_ID Strain Inf_Dose Tray Infected Initial_Size_mm Weeks_to_first_shed Total_weeks_shed Total_Parasites Avg_Parasites_per_Shed Week_Died
## 1       60     36        0    k        0            6.67                  NA               NA              NA                     NA         2
## 2       51     36        0    o        0            5.44                  NA               NA              NA                     NA         3
## 3      431     36        0    g        0            4.89                  NA               NA              NA                     NA         3
## 4      432     36        0    e        0            3.97                  NA               NA              NA                     NA         3
## 5       56     36        0    h        0            7.97                  NA               NA              NA                     NA         6
## 6       49     36        0    c        0            7.86                  NA               NA              NA                     NA        10

in class

mod1 <- glm(Total_Parasites ~ Strain * Total_weeks_shed + Strain : dose, data=snails, family=poisson("log"))
summary(mod1)

# mean predicted load at 0 weeks for default strain:
exp(9.6)

# default strain is the first one
levels(snails$Strain)

# predicted load of second strain relative to first one
exp(-.66)

# plot data and predicted means
plot(Total_Parasites ~ Total_weeks_shed, data=snails, col=Strain, pch=20)
newdata <- expand.grid(Strain=levels(snails$Strain), Total_weeks_shed=0:30, dose=TRUE)
for (st in levels(snails$Strain)) {
  lines(0:30, exp(preds[newdata$Strain == st]), col=match(st, levels(snails$Strain)))
}

hm, intercept doesn’t look great: try some other models?

This one fails to fit because of negative values

> mod2 <- glm(Total_Parasites ~ Strain * Total_weeks_shed + Strain : dose, data=snails, family=poisson("identity"))
Error: no valid set of coefficients has been found: please supply starting values
In addition: Warning message:
In log(y/mu) : NaNs produced

Using Stan doesn’t help:

> library(brms)
> mod2 <- brm(Total_Parasites ~ Strain * Total_weeks_shed + Strain : dose, data=snails, family=poisson("identity"))
Compiling the C++ model

some other models

mod3 <- glm(Total_Parasites ~ 0 + Strain : Total_weeks_shed + Strain : dose, data=snails, family=poisson("identity"))
mod4 <- glm(Total_Parasites ~ 0 + Strain : Total_weeks_shed + Strain : dose, data=snails, family=poisson("log"))
preds3 <- predict(mod3, newdata=newdata)
preds4 <- predict(mod4, newdata=newdata)
plot(Total_Parasites ~ Total_weeks_shed, data=snails, col=Strain, pch=20)
for (st in levels(snails$Strain)) {
    lines(0:30, exp(preds[newdata$Strain == st]), col=match(st, levels(snails$Strain)))
    lines(0:30, exp(preds3[newdata$Strain == st]), col=match(st, levels(snails$Strain)), lty=3)
    lines(0:30, exp(preds4[newdata$Strain == st]), col=match(st, levels(snails$Strain)), lty=4)
}

conclusions:

  • poor model fit
  • try a different link function (sqrt?)
  • perhaps adding overdispersion? need to look at noise in more detail.

Review

Concepts

  • statistic versus parameter
  • quantifying uncertainty
  • experiment vs observation
  • controls
  • statistical power/sensitivity
  • tidy data
  • Markov chain Monte Carlo
  • permutation test
  • multiple comparisons
  • shrinkage and sharing power
  • probability models
  • simulation
  • \(p\)-values
  • hypothesis testing
  • confidence and credible intervals
  • linear models
  • random effects
  • prior, likelihood, and posterior
  • goodness-of-fit

Distributions:

  • Central Limit Theorem
  • Gaussian/Normal
  • Student’s \(t\)
  • Binomial
  • Beta
  • Beta-Binomial
  • Exponential
  • Cauchy
  • Poisson

Visualization:

  • center, spread, outliers
  • histograms
  • scatter plots
  • boxplots
  • maximize information per unit of ink

Statistical models:

  • ANOVA, partition of variance
  • least-squares fitting \(\sim\) Gaussian
  • Beta-Binomial
  • logistic regression
  • robust regression
  • General Linear (Mixed) Models