\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Poisson linear models

Peter Ralph

1 December 2020 – Advanced Biological Statistics

Count data

A hypothetical situation:

  1. We have counts of mutations,

  2. from some individuals of different ages and past exposures to solar irradiation,

  3. of two genotypes.

Model:

  • Counts are Poisson,

  • with mean that depends on age and exposure,

  • but effect of exposure depends on genotype.

  1. Counts are Poisson,

  2. with mean that depends on age and exposure,

  3. but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \end{aligned}\]

  1. Counts are Poisson,

  2. with mean that depends on age and exposure,

  3. but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= a + b \times \text{age}_i + c \times \text{exposure}_i \end{aligned}\]

  1. Counts are Poisson,

  2. with mean that depends on age and exposure,

  3. but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= a_{g_i} + b \times \text{age}_i + c_{g_i} \times \text{exposure}_i \end{aligned}\]

  1. Counts are Poisson,

  2. with mean that depends on age and exposure,

  3. but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= \exp\left( a_{g_i} + b \times \text{age}_i \right. \\ &\qquad \left. {} + c_{g_i} \times \text{exposure}_i \right) \end{aligned}\]

Poisson modeling, in practice

The data

plot of chunk r plot_counts

Let’s do it, with glm( )

\[\begin{aligned} Z_i &\sim \Poisson(\mu_i) \\ \mu_i &= \exp\left( a_{g_i} + b \times \text{age}_i + c_{g_i} \times \text{exposure}_i \right) \end{aligned}\]

Here are the data:

head( countdata <- read.table("data/poisson_counts.tsv", header=TRUE, stringsAsFactors=TRUE) )
##   genotype       age   exposure counts
## 1        A 11.961887 24.7839999     17
## 2        B  8.699680  2.4985780     12
## 3        B 13.044583  0.6582441      5
## 4        A 16.193124  2.5269076      3
## 5        B  5.438608 23.3953875      1
## 6        A 14.261364  8.2820704      2

Write a Stan block

  1. Counts are Poisson,

  2. with mean that depends on age and exposure,

  3. but effect of exposure depends on genotype.

\[\begin{aligned} Z_i &\sim \Poisson(y_i) \\ y_i &= \exp(a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i ) \end{aligned}\]

data {
    int N;
    int counts[N];
    vector[N] age;
    vector[N] exposure;
    int genotype[N];
    int ngeno;
}
parameters {
    real a[ngeno]; // intercepts
    real b; // slope for age
    real c[ngeno]; // slopes for exposure
}
model {
    vector[N] y; // means
    y = exp(a[genotype] + b .* age 
            + c[genotype] .* exposure);
    counts ~ poisson(y);
    // implicitly flat priors
}

The result

Note: scaling the data helps Stan find the right scale to move on.

fit1 <- stan(file="data/simple_poisson.stan",
             data=with(data, list(N=length(counts),
                               counts=counts,
                               age=age,
                               expo=exposure,
                               geno=as.numeric(genotype))),
             control=list(max_treedepth=12),
             iter=1000, chains=3)
post1 <- extract(fit1)
print(fit1)
## Inference for Stan model: simple_poisson.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##         mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## a[1]    0.31    0.00 0.08    0.16    0.26    0.32    0.36    0.46   369 1.01
## a[2]    2.23    0.00 0.07    2.09    2.18    2.23    2.27    2.36   363 1.01
## b       0.05    0.00 0.01    0.04    0.04    0.05    0.05    0.06   346 1.01
## c[1]    0.09    0.00 0.00    0.09    0.09    0.09    0.09    0.10  1133 1.00
## c[2]   -0.12    0.00 0.01   -0.13   -0.12   -0.12   -0.11   -0.10   894 1.00
## lp__ 3962.45    0.06 1.56 3958.60 3961.58 3962.71 3963.64 3964.59   673 1.00
## 
## Samples were drawn using NUTS(diag_e) at Thu Dec  3 09:16:03 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

stan_trace(fit1, pars=c("a","b","c","lp__"), inc_warmup=FALSE)

plot of chunk r not_warmup

Posterior distributions of the parameters: plot of chunk r true_fit_1

Goodness of fit

Posterior predictive simulations

Let’s simulate up data under this model to check for goodness of fit.

One hundred times, we’ll

  1. pick a set of parameters from the posterior
  2. compute the vector of means (mu)
  3. simulate a vector of Poisson counts with mean mu

do_sim <- function () {
    k <- sample.int(nrow(post1$a), 1)
    params1 <- list(a=post1$a[k,],
                    b=post1$b[k],
                    c=post1$c[k,])
    mu1 <- with(data,
                    exp(params1$a[as.numeric(genotype)] 
                        + params1$b * age
                        + params1$c[as.numeric(genotype)] * exposure))
    rpois(length(mu1), mu1)
}
# 100 datasets:
sim1 <- replicate(100, do_sim())
model {
    vector[N] mu;
    mu = exp(a[geno] 
             + b * age 
             + c[geno] 
               .* expo);
    counts ~ poisson(mu);

True data are overdispersed relative to our simulations

plot of chunk r plot_post_sims1

Possible solutions

  • use the quasipoisson family in glm( )
  • introduce more randomness:

\[\begin{aligned} Z_i &\sim \Poisson(y_i) \\ y_i &= \exp(\mu_i) \\ \mu_i &\sim \Normal(a_{g_i} + b \times \text{age}_i \\ &\qquad {} + c_{g_i} \times \text{exposure}_i, \sigma) \end{aligned}\]

// reveal.js plugins