\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Generalized Linear Models

Peter Ralph

1 December 2020 – Advanced Biological Statistics

Notation note

Matrix multiplication

If

  • \(b\) is a \(k\)-vector
  • \(X\) is an \(n \times k\) matrix

then “\(X b\)” is shorthand for the \(n\)-vector \[ (Xb)_i = \sum_{j=1}^k X_{ij} b_j . \]

  • In R, we use %*%.
  • In Stan, we use *.

Example

Let’s say we want to predict apple yield using leaf_area, tree_age, and fertilizer: \[\begin{aligned} \text{yield}_i &= \beta_0 + \beta_1 \times \text{leaf_area}_i \\ &\qquad {} + \beta_2 \times \text{tree_age}_i + \beta_3 \times \text{fertilizer}_i + \epsilon_i. \end{aligned}\]

If we set the columns of \(X\) to be

  1. all 1s
  2. leaf_area
  3. tree_age
  4. fertilizer

and \(\beta = (\beta_0, \beta_1, \beta_2, \beta_3),\) then this is \[ \text{yield}_i = (X \beta)_i + \epsilon_i . \]

Or, \[ \text{yield} = X \beta + \epsilon . \]

The model.matrix( ) function gives you back an X matrix appropriate for factors as well.

Generalized Linear Models

Ingredients of a GLM:

  1. A probability distribution, \(Y\).

(the “family”; describes the output)

  1. A linear predictor, \(X \beta\).

(connects input to output)

  1. A link function (usually, \(h(\E[Y]) = X\beta\)).

(connects linear predictor to probability distribution)

Common choices:

  • Linear: \[ Y \sim \Normal(X \beta, \sigma) .\]

  • Binomial (“Logistic”): \[ Y \sim \Binom(n, \logistic(X\beta)) .\]

  • Poisson: \[ Y \sim \Poisson(\exp(X\beta)) .\]

  • Gamma: \[ Y \sim \Gam(\text{scale}=\exp(X\beta)/k, \text{shape}=k) .\]

Logistic GLM, in R

glm(y ~ x, family=binomial)
family {stats}

Family objects provide a convenient way to specify the details of the models
used by functions such as glm.

binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")

Arguments

link: a specification for the model link function. This can be a
name/expression, a literal character string, a length-one character vector, or
an object of class "link-glm" (such as generated by make.link) provided it is
not specified via one of the standard names given next.

The gaussian family accepts the links (as names) identity, log and inverse; the
binomial family the links logit, probit, cauchit, (corresponding to logistic,
normal and Cauchy CDFs respectively) log and cloglog (complementary log-log);
the Gamma family the links inverse, identity and log; the poisson family the
links log, identity, and sqrt; and the inverse.gaussian family the links
1/mu^2, inverse, identity and log.

Note that the link function is the inverse of what we’d write down in the model: for instance, if the mean of the response is the exponential of the linear predictor, i.e., if \[ \E[Y] = \exp(X\beta) . \] then we have

link = "log"

Other inverses:

  • logistic has inverse logit
  • identity has inverse identity
  • x^2 (“squared”) has inverse sqrt

Unpacking the logistic GLM

Simulate data

100 trials, where probability of success depends on \(x\):

alpha <- -7; beta <- 1.2
x <- runif(100, 0, 10)
y <- alpha + beta * x
p <- 1 / (1 + exp(-y))
z <- rbinom(100, size=1, prob=p)
plot(z ~ x)
curve(1/(1+exp(-(-7 + 1.2 *x))), 0, 10, col='red', add=TRUE)

plot of chunk r sim_logistic

glm()

zz <- cbind(z, 1-z)
summary(glm_fit <- glm(zz ~ x, family='binomial'))
## 
## Call:
## glm(formula = zz ~ x, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79584  -0.14355   0.01418   0.07181   2.58115  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.7512     2.4701  -3.948 7.89e-05 ***
## x             1.9531     0.4849   4.028 5.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.269  on 99  degrees of freedom
## Residual deviance:  28.807  on 98  degrees of freedom
## AIC: 32.807
## 
## Number of Fisher Scoring iterations: 7

Stan

\[\begin{aligned} Z &\sim \Binom(1, P) \\ P &= \logistic(\alpha + \beta X) \end{aligned}\]

logistic_block <- "
data {
    int N;
    vector[N] X;
    int<lower=0> Z[N];
}
parameters {
    real alpha;
    real beta;
}
model {
    vector[N] p;
    p = inv_logit(alpha + beta * X);
    Z ~ binomial(1, p);
}"

Stan: fit the model

logistic_model <- stan_model(model_code=logistic_block)
fit <- sampling(logistic_model, chains=3, iter=1000,
            data=list(N=100, X=x, Z=z))
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
rstan::summary(fit)$summary
##             mean    se_mean       sd       2.5%        25%        50%        75%      97.5%    n_eff     Rhat
## alpha -11.231396 0.19863109 2.917856 -18.153892 -12.818953 -10.877573  -9.235022  -6.703278 215.7910 1.017132
## beta    2.245831 0.03897684 0.568362   1.364083   1.848582   2.187879   2.551442   3.576015 212.6360 1.017964
## lp__  -15.467009 0.06580101 1.124115 -18.564140 -15.838721 -15.127840 -14.697168 -14.435249 291.8477 1.007922

Stan: posterior distributions

samples <- extract(fit)
layout(t(1:2))
hist(samples$alpha, main=expression(alpha))
abline(v=alpha, col='red'); abline(v=coef(glm_fit)[1], col='green')
hist(samples$beta, main=expression(beta))
abline(v=beta, col='red'); abline(v=coef(glm_fit)[2], col='green')
legend("topright", lty=1, col=c('red', 'green'), legend=c('truth', 'glm'))

plot of chunk r plot_stan_logistic

Identify the GLM

data {
  int N;
  vector[N] X;
  vector[N] Y;
  vector<lower=0> Z[N];
}
parameters {
  real beta[2];
}
model {
  Z ~ gamma(1, exp(- beta[1] * X - beta[2] * Y));
}

What is

  1. the probability distribution? (describes the output)

  2. the linear predictor? (connects input to output)

  3. the link function? (connects linear predictor to probability distribution)

Then, simulate from it.

// reveal.js plugins