Overdiserpersion

Peter Ralph

28 January 2021 – Advanced Biological Statistics

Count data

A hypothetical situation:

We have counts of transcript numbers,
from some individuals of different ages and past exposures to solar irradiation,
of two genotypes.

The data:

count_data <- read.table("data/poisson_counts_data.tsv", header=TRUE)

Poisson linear model

\[\begin{aligned} Z_i &\sim \Poisson(y_i) \\ y_i &= \exp(a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i ) \end{aligned}\]

The result

fit1 <- brm(counts ~ genotype + age + exposure * genotype,
             data=count_data, family=poisson(link='log'),
            prior=prior("normal(0, 5)", class="b"))

## Warning: partial match of 'dpar' to 'dpars'

## Compiling Stan program...

## Start sampling

Conditional effects

plot of chunk r condeffs

Posterior predictive ECDF: uh-oh

pp_check(fit1, type='ecdf_overlay', nsamples=40) + labs(x="order", y="quantile")

plot of chunk r pp_check

True data are overdispersed relative to posterior predictive sims

plot of chunk r plot_post_sims1

True data are overdispersed relative to Poisson

Recall that if \(X \sim \Poisson(\lambda)\) then \[ \E[X] = \var[X] = \lambda, \] and so a “\(z\)-score” is \[\begin{aligned} \E\left(\frac{X - \lambda}{\sqrt{\lambda}}\right) = 0, \qquad \qquad \text{SD}\left(\frac{X - \lambda}{\sqrt{\lambda}}\right) = 1. \end{aligned}\]

plot of chunk r plot_overdisp

summary(fit1)

##  Family: poisson 
##   Links: mu = log 
## Formula: counts ~ genotype + age + exposure * genotype 
##    Data: count_data (Number of observations: 500) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept              0.31      0.07     0.17     0.46 1.00     2693     2731
## genotypeB              1.91      0.05     1.81     2.01 1.00     1957     2800
## age                    0.05      0.01     0.04     0.06 1.00     4317     2909
## exposure               0.09      0.00     0.08     0.10 1.00     2348     2852
## genotypeB:exposure    -0.21      0.01    -0.22    -0.19 1.00     2227     2365
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Adding overdispersion

Add overdispersion

\[\begin{aligned} Z_i &\sim \Poisson(\exp(\mu_i)) \\ \mu_i &\sim \Normal(y_i, \sigma) \\ y_i &= a_{g_i} + b \times \text{age}_i + c_{g_i} \times \text{exposure}_i \end{aligned}\]

is equivalent to

\[\begin{aligned} Z_i &\sim \Poisson(\exp(y_i)) \\ y_i &= a_{g_i} + b \times \text{age}_i + c_{g_i} \times \text{exposure}_i + \epsilon_i \\ \epsilon_i &\sim \Normal(0, \sigma) \end{aligned}\]

Add overdispersion

\[\begin{aligned} Z_i &\sim \Poisson(\exp(y_i)) \\ y_i &= a_{g_i} + b \times \text{age}_i + c_{g_i} \times \text{exposure}_i + \epsilon_i \\ \epsilon_i &\sim \Normal(0, \sigma) \end{aligned}\]

count_data$id <- 1:nrow(count_data)
fit2 <- brm(counts ~ genotype + age + exposure * genotype + (1|id),
            data=count_data, family=poisson(link='log'),
            prior=prior("normal(0, 5)", class="b"))

## Warning: partial match of 'dpar' to 'dpars'

## Compiling Stan program...

## Start sampling

conditional effects

plot of chunk r condeffs2

Posterior predictive ECDF: much better!

pp_check(fit2, type='ecdf_overlay', nsamples=40) + labs(x="order", y="quantile")

plot of chunk r pp_check2

Now it’s underdispersed?

plot of chunk r plot_post_sims2

summary(fit2)

##  Family: poisson 
##   Links: mu = log 
## Formula: counts ~ genotype + age + exposure * genotype + (1 | id) 
##    Data: count_data (Number of observations: 500) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~id (Number of levels: 500) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.72      0.04     0.65     0.79 1.01      881     1486
## 
## Population-Level Effects: 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept              0.01      0.16    -0.30     0.31 1.00     1374     2284
## genotypeB              1.91      0.11     1.70     2.13 1.00     1233     1693
## age                    0.05      0.01     0.03     0.08 1.00     1256     1929
## exposure               0.10      0.01     0.08     0.11 1.00      813     1239
## genotypeB:exposure    -0.20      0.01    -0.23    -0.17 1.00     1206     1582
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Conclusions?

Comparing the models

The overdispersed model fits better, but is it better?

The predictions are different:

pred1 <- predict(fit1)
pred2 <- predict(fit2)
plot(pred1[,"Estimate"], pred2[,"Estimate"], xlab='predictions, first model', ylab='predictions, second model', asp=1)

plot of chunk r xval

Crossvalidation

brms_kfold <- function (K, models, xy) {
    stopifnot(!is.null(names(models)))
    Kfold <- sample(rep(1:K, nrow(xy)/K))
    results <- data.frame(rep=1:K)
    for (j in seq_along(models)) {
        train <- test <- rep(NA, K)
        for (k in 1:K) {
            new_fit <- update(models[[j]], newdata=subset(xy, Kfold != k))
            train[k] <- sqrt(mean(resid(new_fit)[,"Estimate"]^2))
            test_y <- xy$y[Kfold == k]
            test[k] <- sqrt(mean(
                   (test_y - predict(new_fit, newdata=subset(xy, Kfold==k), re_formula=NA)[,"Estimate"])^2 ))
        }
        results[[paste0(names(models)[j], "_train")]] <- train
        results[[paste0(names(models)[j], "_test")]] <- test
    }
    return(results)
}

xval_results <- brms_kfold(10, list(poisson=fit1, overdispersed=fit2), count_data)

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

Which model would you rather use?

plot of chunk r show_brms_xvals

Crossvalidation, fixed

brms_kfold2 <- function (K, models, xy) {
    stopifnot(!is.null(names(models)))
    Kfold <- sample(rep(1:K, nrow(xy)/K))
    results <- data.frame(rep=1:K)
    for (j in seq_along(models)) {
        train <- test <- rep(NA, K)
        for (k in 1:K) {
            new_fit <- update(models[[j]], newdata=subset(xy, Kfold != k))
            train[k] <- sqrt(mean(resid(new_fit, re_formula=NA)[,"Estimate"]^2))
            test_y <- xy$y[Kfold == k]
            test[k] <- sqrt(mean(
                   (test_y - predict(new_fit, newdata=subset(xy, Kfold==k), re_formula=NA)[,"Estimate"])^2 ))
        }
        results[[paste0(names(models)[j], "_train")]] <- train
        results[[paste0(names(models)[j], "_test")]] <- test
    }
    return(results)
}

xval_results2 <- brms_kfold2(10, list(poisson=fit1, overdispersed=fit2), count_data)

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

## Warning: partial match of 'dpar' to 'dpars'

## Start sampling

Which model would you rather use NOW?

plot of chunk r show_brms_xvals2