\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Poisson linear models

Peter Ralph

Advanced Biological Statistics

Count data

A hypothetical situation:

  1. We have counts of mutations,

  2. from some individuals of different ages and past exposures to solar irradiation,

  3. of two genotypes.

Model:

  • Counts are Poisson,

  • with mean that depends on age and exposure,

  • but effect of exposure depends on genotype.

  1. Counts are Poisson,

  2. with mean that depends on age and exposure,

  3. but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \end{aligned}\]

  1. Counts are Poisson,

  2. with mean that depends on age and exposure,

  3. but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= a + b \times \text{age}_i + c \times \text{exposure}_i \end{aligned}\]

  1. Counts are Poisson,

  2. with mean that depends on age and exposure,

  3. but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= a_{g_i} + b \times \text{age}_i + c_{g_i} \times \text{exposure}_i \end{aligned}\]

  1. Counts are Poisson,

  2. with mean that depends on age and exposure,

  3. but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= \exp\left( a_{g_i} + b \times \text{age}_i \right. \\ &\qquad \left. {} + c_{g_i} \times \text{exposure}_i \right) \end{aligned}\]

Poisson modeling, in practice

The data

plot of chunk r plot_counts

Here are the data:

head( countdata <- read.table("data/poisson_counts.tsv", header=TRUE, stringsAsFactors=TRUE) )
##   genotype       age   exposure counts
## 1        A 11.961887 24.7839999     17
## 2        B  8.699680  2.4985780     12
## 3        B 13.044583  0.6582441      5
## 4        A 16.193124  2.5269076      3
## 5        B  5.438608 23.3953875      1
## 6        A 14.261364  8.2820704      2

A GLM

bf <- brmsformula(counts ~ 0 + genotype + age + genotype:exposure)
get_prior(bf, data=countdata)
##                 prior class               coef group resp dpar nlpar bound       source
##                (flat)     b                                                     default
##                (flat)     b                age                             (vectorized)
##                (flat)     b          genotypeA                             (vectorized)
##                (flat)     b genotypeA:exposure                             (vectorized)
##                (flat)     b          genotypeB                             (vectorized)
##                (flat)     b genotypeB:exposure                             (vectorized)
##  student_t(3, 0, 4.4) sigma                                                     default
fit1 <- brm(
   counts ~ 0 + genotype + age + genotype:exposure,
   family=poisson(link='log'),
   prior=set_prior("normal(0, 1)", class="b"),
   data=countdata
)
## Compiling Stan program...
## Start sampling

print(fit1)
##  Family: poisson 
##   Links: mu = log 
## Formula: counts ~ 0 + genotype + age + genotype:exposure 
##    Data: countdata (Number of observations: 500) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## genotypeA              0.30      0.07     0.16     0.44 1.00     1301     1626
## genotypeB              2.21      0.06     2.09     2.34 1.00     1358     1765
## age                    0.05      0.01     0.04     0.06 1.00     1334     1829
## genotypeA:exposure     0.09      0.00     0.09     0.10 1.00     2802     2534
## genotypeB:exposure    -0.12      0.01    -0.13    -0.10 1.00     2685     2507
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

mcmc_trace(fit1, regex_pars=c("b_.*"))

plot of chunk r not_warmup

Posterior distributions of the parameters: plot of chunk r true_fit_1

Goodness of fit

Posterior predictive simulations

Idea: Let’s simulate up data under this model and see if it looks like the real data.

A bunch of times, we’ll

  1. pick a set of parameters from the posterior
  2. compute the vector of means (mu)
  3. simulate a vector of Poisson counts with mean mu

… with the pp_check function!

The data are overdispersed relative to our simulations

pp_check(fit1, type='intervals', x="order")
## Using all posterior draws for ppc type 'intervals' by default.

plot of chunk r plot_post_sims1

One solution:

Use the negbinomial family

… which introduces more randomness:

\[\begin{aligned} Z_i &\sim \Poisson(M_i) \\ M_i &\sim \Gam(\text{mean}=\mu_i, \text{shape}=r) \\ \mu_i &= \exp(y_i) \\ y_i &= a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i \end{aligned}\]

fit2 <- brm(
   counts ~ 0 + genotype + age + genotype:exposure,
   family=negbinomial(link='log'),
   prior=set_prior("normal(0, 1)", class="b"),
   data=countdata
)
## Compiling Stan program...
## Start sampling

print(fit2)
##  Family: negbinomial 
##   Links: mu = log; shape = identity 
## Formula: counts ~ 0 + genotype + age + genotype:exposure 
##    Data: countdata (Number of observations: 500) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## genotypeA              0.17      0.15    -0.14     0.47 1.00     1719     1659
## genotypeB              2.05      0.15     1.76     2.33 1.00     1847     1603
## age                    0.06      0.01     0.04     0.09 1.00     1753     1756
## genotypeA:exposure     0.09      0.01     0.08     0.11 1.00     3105     2326
## genotypeB:exposure    -0.10      0.01    -0.13    -0.08 1.00     3703     2711
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## shape     1.95      0.17     1.63     2.31 1.00     3243     2506
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

mcmc_trace(fit2, regex_pars=c("b_.*"))

plot of chunk r not_warmup2

Posterior distributions of the parameters: plot of chunk r true_fit_2

pp_check(fit2, type='intervals', x="order")
## Using all posterior draws for ppc type 'intervals' by default.

plot of chunk r plot_post_sims2