Generalized Linear Models

Peter Ralph

1 December 2020 – Advanced Biological Statistics

Notation note

Matrix multiplication

\(b\) is a \(k\)-vector
\(X\) is an \(n \times k\) matrix

then “\(X b\)” is shorthand for the \(n\)-vector \[ (Xb)_i = \sum_{j=1}^k X_{ij} b_j . \]

In R, we use %*%.
In Stan, we use *.

Example

Let’s say we want to predict apple yield using leaf_area, tree_age, and fertilizer: \[\begin{aligned} \text{yield}_i &= \beta_0 + \beta_1 \times \text{leaf_area}_i \\ &\qquad {} + \beta_2 \times \text{tree_age}_i + \beta_3 \times \text{fertilizer}_i + \epsilon_i. \end{aligned}\]

If we set the columns of \(X\) to be

all 1s
leaf_area
tree_age
fertilizer

and \(\beta = (\beta_0, \beta_1, \beta_2, \beta_3),\) then this is \[ \text{yield}_i = (X \beta)_i + \epsilon_i . \]

Or, \[ \text{yield} = X \beta + \epsilon . \]

The model.matrix( ) function gives you back an X matrix appropriate for factors as well.

Generalized Linear Models

Ingredients of a GLM:

A probability distribution, \(Y\).

(the “family”; describes the output)

A linear predictor, \(X \beta\).

(connects input to output)

A link function (usually, \(h(\E[Y]) = X\beta\)).

(connects linear predictor to probability distribution)

Common choices:

Linear: \[ Y \sim \Normal(X \beta, \sigma) .\]
Binomial (“Logistic”): \[ Y \sim \Binom(n, \logistic(X\beta)) .\]
Poisson: \[ Y \sim \Poisson(\exp(X\beta)) .\]
Gamma: \[ Y \sim \Gam(\text{scale}=\exp(X\beta)/k, \text{shape}=k) .\]

Logistic GLM, in R

glm(y ~ x, family=binomial)

family {stats}

Family objects provide a convenient way to specify the details of the models
used by functions such as glm.

binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")

Arguments

link: a specification for the model link function. This can be a
name/expression, a literal character string, a length-one character vector, or
an object of class "link-glm" (such as generated by make.link) provided it is
not specified via one of the standard names given next.

The gaussian family accepts the links (as names) identity, log and inverse; the
binomial family the links logit, probit, cauchit, (corresponding to logistic,
normal and Cauchy CDFs respectively) log and cloglog (complementary log-log);
the Gamma family the links inverse, identity and log; the poisson family the
links log, identity, and sqrt; and the inverse.gaussian family the links
1/mu^2, inverse, identity and log.

Note that the link function is the inverse of what we’d write down in the model: for instance, if the mean of the response is the exponential of the linear predictor, i.e., if \[ \E[Y] = \exp(X\beta) . \] then we have

link = "log"

Other inverses:

logistic has inverse logit
identity has inverse identity
x^2 (“squared”) has inverse sqrt

Unpacking the logistic GLM

Simulate data

100 trials, where probability of success depends on \(x\):

alpha <- -7; beta <- 1.2
x <- runif(100, 0, 10)
y <- alpha + beta * x
p <- 1 / (1 + exp(-y))
z <- rbinom(100, size=1, prob=p)
plot(z ~ x)
curve(1/(1+exp(-(-7 + 1.2 *x))), 0, 10, col='red', add=TRUE)

plot of chunk r sim_logistic

`glm()`

zz <- cbind(z, 1-z)
summary(glm_fit <- glm(zz ~ x, family='binomial'))

## 
## Call:
## glm(formula = zz ~ x, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79584  -0.14355   0.01418   0.07181   2.58115  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.7512     2.4701  -3.948 7.89e-05 ***
## x             1.9531     0.4849   4.028 5.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.269  on 99  degrees of freedom
## Residual deviance:  28.807  on 98  degrees of freedom
## AIC: 32.807
## 
## Number of Fisher Scoring iterations: 7

Stan

\[\begin{aligned} Z &\sim \Binom(1, P) \\ P &= \logistic(\alpha + \beta X) \end{aligned}\]

logistic_block <- "
data {
    int N;
    vector[N] X;
    int<lower=0> Z[N];
}
parameters {
    real alpha;
    real beta;
}
model {
    vector[N] p;
    p = inv_logit(alpha + beta * X);
    Z ~ binomial(1, p);
}"

Stan: fit the model

logistic_model <- stan_model(model_code=logistic_block)
fit <- sampling(logistic_model, chains=3, iter=1000,
            data=list(N=100, X=x, Z=z))

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

rstan::summary(fit)$summary

##             mean    se_mean       sd       2.5%        25%        50%        75%      97.5%    n_eff     Rhat
## alpha -11.231396 0.19863109 2.917856 -18.153892 -12.818953 -10.877573  -9.235022  -6.703278 215.7910 1.017132
## beta    2.245831 0.03897684 0.568362   1.364083   1.848582   2.187879   2.551442   3.576015 212.6360 1.017964
## lp__  -15.467009 0.06580101 1.124115 -18.564140 -15.838721 -15.127840 -14.697168 -14.435249 291.8477 1.007922

Stan: posterior distributions

samples <- extract(fit)
layout(t(1:2))
hist(samples$alpha, main=expression(alpha))
abline(v=alpha, col='red'); abline(v=coef(glm_fit)[1], col='green')
hist(samples$beta, main=expression(beta))
abline(v=beta, col='red'); abline(v=coef(glm_fit)[2], col='green')
legend("topright", lty=1, col=c('red', 'green'), legend=c('truth', 'glm'))

plot of chunk r plot_stan_logistic

Identify the GLM

data {
  int N;
  vector[N] X;
  vector[N] Y;
  vector<lower=0> Z[N];
}
parameters {
  real beta[2];
}
model {
  Z ~ gamma(1, exp(- beta[1] * X - beta[2] * Y));
}

What is

the probability distribution? (describes the output)
the linear predictor? (connects input to output)
the link function? (connects linear predictor to probability distribution)

Then, simulate from it.