Poisson linear models

Peter Ralph

1 December 2020 – Advanced Biological Statistics

Count data

A hypothetical situation:

We have counts of mutations,
from some individuals of different ages and past exposures to solar irradiation,
of two genotypes.

Model:

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \end{aligned}\]

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= a + b \times \text{age}_i + c \times \text{exposure}_i \end{aligned}\]

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= a_{g_i} + b \times \text{age}_i + c_{g_i} \times \text{exposure}_i \end{aligned}\]

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= \exp\left( a_{g_i} + b \times \text{age}_i \right. \\ &\qquad \left. {} + c_{g_i} \times \text{exposure}_i \right) \end{aligned}\]

Poisson modeling, in practice

The data

plot of chunk r plot_counts

Let’s do it, with `glm( )`

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= \exp\left( a_{g_i} + b \times \text{age}_i + c_{g_i} \times \text{exposure}_i \right) \end{aligned}\]

Here are the data:

head( countdata <- read.table("data/poisson_counts.tsv", header=TRUE, stringsAsFactors=TRUE) )

##   genotype       age   exposure counts
## 1        A 11.961887 24.7839999     17
## 2        B  8.699680  2.4985780     12
## 3        B 13.044583  0.6582441      5
## 4        A 16.193124  2.5269076      3
## 5        B  5.438608 23.3953875      1
## 6        A 14.261364  8.2820704      2

Write a Stan block

Counts are Poisson,
with mean that depends on age and exposure,
but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(y_i) \\ y_i &= \exp(a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i ) \end{aligned}\]

data {
    int N;
    int counts[N];
    vector[N] age;
    vector[N] exposure;
    int genotype[N];
    int ngeno;
}
parameters {
    real a[ngeno]; // intercepts
    real b; // slope for age
    real c[ngeno]; // slopes for exposure
}
model {
    vector[N] y; // means
    y = exp(a[genotype] + b .* age 
            + c[genotype] .* exposure);
    counts ~ poisson(y);
    // implicitly flat priors
}

The result

Note: scaling the data helps Stan find the right scale to move on.

fit1 <- stan(file="data/simple_poisson.stan",
             data=with(data, list(N=length(counts),
                               counts=counts,
                               age=age,
                               expo=exposure,
                               geno=as.numeric(genotype))),
             control=list(max_treedepth=12),
             iter=1000, chains=3)

post1 <- extract(fit1)
print(fit1)

## Inference for Stan model: simple_poisson.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##         mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## a[1]    0.31    0.00 0.08    0.16    0.26    0.32    0.36    0.46   369 1.01
## a[2]    2.23    0.00 0.07    2.09    2.18    2.23    2.27    2.36   363 1.01
## b       0.05    0.00 0.01    0.04    0.04    0.05    0.05    0.06   346 1.01
## c[1]    0.09    0.00 0.00    0.09    0.09    0.09    0.09    0.10  1133 1.00
## c[2]   -0.12    0.00 0.01   -0.13   -0.12   -0.12   -0.11   -0.10   894 1.00
## lp__ 3962.45    0.06 1.56 3958.60 3961.58 3962.71 3963.64 3964.59   673 1.00
## 
## Samples were drawn using NUTS(diag_e) at Thu Dec  3 09:16:03 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

stan_trace(fit1, pars=c("a","b","c","lp__"), inc_warmup=FALSE)

plot of chunk r not_warmup

Posterior distributions of the parameters: plot of chunk r true_fit_1

Goodness of fit

Posterior predictive simulations

Let’s simulate up data under this model to check for goodness of fit.

One hundred times, we’ll

pick a set of parameters from the posterior
compute the vector of means (mu)
simulate a vector of Poisson counts with mean mu

do_sim <- function () {
    k <- sample.int(nrow(post1$a), 1)
    params1 <- list(a=post1$a[k,],
                    b=post1$b[k],
                    c=post1$c[k,])
    mu1 <- with(data,
                    exp(params1$a[as.numeric(genotype)] 
                        + params1$b * age
                        + params1$c[as.numeric(genotype)] * exposure))
    rpois(length(mu1), mu1)
}
# 100 datasets:
sim1 <- replicate(100, do_sim())

model {
    vector[N] mu;
    mu = exp(a[geno] 
             + b * age 
             + c[geno] 
               .* expo);
    counts ~ poisson(mu);

True data are overdispersed relative to our simulations

plot of chunk r plot_post_sims1

Possible solutions

use the quasipoisson family in glm( )

introduce more randomness:

\[\begin{aligned} Z_i &\sim \Poisson(y_i) \\ y_i &= \exp(\mu_i) \\ \mu_i &\sim \Normal(a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i, \sigma) \end{aligned}\]