Poisson linear models

Peter Ralph

Advanced Biological Statistics

Count data

A hypothetical situation:

We have counts of mutations,
from some individuals of different ages and past exposures to solar irradiation,
of two genotypes.

Model:

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \end{aligned}\]

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= a + b \times \text{age}_i + c \times \text{exposure}_i \end{aligned}\]

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= a_{g_i} + b \times \text{age}_i + c_{g_i} \times \text{exposure}_i \end{aligned}\]

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= \exp\left( a_{g_i} + b \times \text{age}_i \right. \\ &\qquad \left. {} + c_{g_i} \times \text{exposure}_i \right) \end{aligned}\]

Poisson modeling, in practice

The data

plot of chunk r plot_counts

Here are the data:

head( countdata <- read.table("data/poisson_counts.tsv", header=TRUE, stringsAsFactors=TRUE) )

##   genotype       age   exposure counts
## 1        A 11.961887 24.7839999     17
## 2        B  8.699680  2.4985780     12
## 3        B 13.044583  0.6582441      5
## 4        A 16.193124  2.5269076      3
## 5        B  5.438608 23.3953875      1
## 6        A 14.261364  8.2820704      2

A GLM

bf <- brmsformula(counts ~ 0 + genotype + age + genotype:exposure)
get_prior(bf, data=countdata)

##                 prior class               coef group resp dpar nlpar bound       source
##                (flat)     b                                                     default
##                (flat)     b                age                             (vectorized)
##                (flat)     b          genotypeA                             (vectorized)
##                (flat)     b genotypeA:exposure                             (vectorized)
##                (flat)     b          genotypeB                             (vectorized)
##                (flat)     b genotypeB:exposure                             (vectorized)
##  student_t(3, 0, 4.4) sigma                                                     default

fit1 <- brm(
   counts ~ 0 + genotype + age + genotype:exposure,
   family=poisson(link='log'),
   prior=set_prior("normal(0, 1)", class="b"),
   data=countdata
)

## Compiling Stan program...

## Start sampling

print(fit1)

##  Family: poisson 
##   Links: mu = log 
## Formula: counts ~ 0 + genotype + age + genotype:exposure 
##    Data: countdata (Number of observations: 500) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## genotypeA              0.30      0.07     0.16     0.44 1.00     1301     1626
## genotypeB              2.21      0.06     2.09     2.34 1.00     1358     1765
## age                    0.05      0.01     0.04     0.06 1.00     1334     1829
## genotypeA:exposure     0.09      0.00     0.09     0.10 1.00     2802     2534
## genotypeB:exposure    -0.12      0.01    -0.13    -0.10 1.00     2685     2507
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

mcmc_trace(fit1, regex_pars=c("b_.*"))

plot of chunk r not_warmup

Posterior distributions of the parameters: plot of chunk r true_fit_1

Goodness of fit

Posterior predictive simulations

Idea: Let’s simulate up data under this model and see if it looks like the real data.

A bunch of times, we’ll

pick a set of parameters from the posterior
compute the vector of means (mu)
simulate a vector of Poisson counts with mean mu

… with the pp_check function!

The data are overdispersed relative to our simulations

pp_check(fit1, type='intervals', x="order")

## Using all posterior draws for ppc type 'intervals' by default.

plot of chunk r plot_post_sims1

One solution:

Use the negbinomial family

… which introduces more randomness:

\[\begin{aligned} Z_i &\sim \Poisson(M_i) \\ M_i &\sim \Gam(\text{mean}=\mu_i, \text{shape}=r) \\ \mu_i &= \exp(y_i) \\ y_i &= a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i \end{aligned}\]

fit2 <- brm(
   counts ~ 0 + genotype + age + genotype:exposure,
   family=negbinomial(link='log'),
   prior=set_prior("normal(0, 1)", class="b"),
   data=countdata
)

## Compiling Stan program...

## Start sampling

print(fit2)

##  Family: negbinomial 
##   Links: mu = log; shape = identity 
## Formula: counts ~ 0 + genotype + age + genotype:exposure 
##    Data: countdata (Number of observations: 500) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## genotypeA              0.17      0.15    -0.14     0.47 1.00     1719     1659
## genotypeB              2.05      0.15     1.76     2.33 1.00     1847     1603
## age                    0.06      0.01     0.04     0.09 1.00     1753     1756
## genotypeA:exposure     0.09      0.01     0.08     0.11 1.00     3105     2326
## genotypeB:exposure    -0.10      0.01    -0.13    -0.08 1.00     3703     2711
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape     1.95      0.17     1.63     2.31 1.00     3243     2506
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

mcmc_trace(fit2, regex_pars=c("b_.*"))

plot of chunk r not_warmup2

Posterior distributions of the parameters: plot of chunk r true_fit_2

pp_check(fit2, type='intervals', x="order")

## Using all posterior draws for ppc type 'intervals' by default.

plot of chunk r plot_post_sims2