Generalized Linear Models

Peter Ralph

Advanced Biological Statistics

Notation note

Matrix multiplication

\(b\) is a \(k\)-vector
\(X\) is an \(n \times k\) matrix

then “\(X b\)” is shorthand for the \(n\)-vector \[ (Xb)_i = \sum_{j=1}^k X_{ij} b_j . \]

In R, we use %*%.

Example

Let’s say we want to predict apple yield using leaf_area, tree_age, and fertilizer: \[\begin{aligned} \text{yield}_i &= \beta_0 + \beta_1 \times \text{leaf\_area}_i \\ &\qquad {} + \beta_2 \times \text{tree\_age}_i + \beta_3 \times \text{fertilizer}_i + \epsilon_i. \end{aligned}\]

If we set the columns of \(X\) to be

all 1s
leaf_area
tree_age
fertilizer

and \(\beta = (\beta_0, \beta_1, \beta_2, \beta_3),\) then this is \[ \text{yield}_i = (X \beta)_i + \epsilon_i . \]

Or, \[ \text{yield} = X \beta + \epsilon . \]

Recall the model.matrix( ) function gives you back an X matrix appropriate for factors as well.

Generalized Linear Models

Ingredients of a GLM:

A probability distribution, \(Y\).

(the “family”; describes the output)

A linear predictor, \(X \beta\).

(connects input to output)

A link function (usually, \(h(\E[Y]) = X\beta\)).

(connects linear predictor to probability distribution)

Common choices:

Linear: \[ Y \sim \Normal(X \beta, \sigma) .\]
Binomial (“Logistic”): \[ Y \sim \Binom(n, \logistic(X\beta)) .\]
Poisson: \[ Y \sim \Poisson(\exp(X\beta)) .\]
Gamma: \[ Y \sim \Gam(\text{scale}=\exp(X\beta)/k, \text{shape}=k) .\]

Logistic GLM, in R

glm(y ~ x, family=binomial)

family {stats}

Family objects provide a convenient way to specify the details of the models
used by functions such as glm.

binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")

Arguments

link: a specification for the model link function. This can be a
name/expression, a literal character string, a length-one character vector, or
an object of class "link-glm" (such as generated by make.link) provided it is
not specified via one of the standard names given next.

The gaussian family accepts the links (as names) identity, log and inverse; the
binomial family the links logit, probit, cauchit, (corresponding to logistic,
normal and Cauchy CDFs respectively) log and cloglog (complementary log-log);
the Gamma family the links inverse, identity and log; the poisson family the
links log, identity, and sqrt; and the inverse.gaussian family the links
1/mu^2, inverse, identity and log.

Note that the link function is the inverse of what we’d write down in the model: for instance, if the mean of the response is the exponential of the linear predictor, i.e., if \[ \E[Y] = \exp(X\beta) . \] then we have

link = "log"

Other inverses:

logistic has inverse logit
identity has inverse identity
x^2 (“squared”) has inverse sqrt

Unpacking the logistic GLM

Simulate data

100 trials, where probability of success depends on \(x\):

alpha <- -7; beta <- 1.2
df <- data.frame( x = runif(100, 0, 10) )
df$y <- alpha + beta * df$x
df$p <- 1 / (1 + exp(-df$y))
df$z <- rbinom(100, size=1, prob=df$p)
plot(z ~ x, data=df)
curve(1/(1+exp(-(-7 + 1.2 * x))), 0, 10, col='red', add=TRUE)

plot of chunk r sim_logistic

`glm()`

zz <- cbind(df$z, 1-df$z)
summary(
    glm_fit <- glm(zz ~ df$x, family='binomial')
)

## 
## Call:
## glm(formula = zz ~ df$x, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79584  -0.14355   0.01418   0.07181   2.58115  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.7512     2.4701  -3.948 7.89e-05 ***
## df$x          1.9531     0.4849   4.028 5.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.269  on 99  degrees of freedom
## Residual deviance:  28.807  on 98  degrees of freedom
## AIC: 32.807
## 
## Number of Fisher Scoring iterations: 7

`brms::brm()`

( brm_fit <- brm(z | trials(1) ~ x, family='binomial', data=df) )

## Only 2 levels detected so that family 'bernoulli' might be a more efficient choice.

## Compiling Stan program...

## Start sampling

##  Family: binomial 
##   Links: mu = logit 
## Formula: z | trials(1) ~ x 
##    Data: df (Number of observations: 100) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   -11.07      2.81   -17.54    -6.67 1.00     2275     2103
## x             2.21      0.55     1.35     3.45 1.00     2053     1833
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

brms: posterior distributions

mcmc_hist(brm_fit, pars=c("b_Intercept", "b_x")) + geom_vline(data=data.frame(Parameter=c("b_Intercept", "b_x"), x=c(alpha, beta)), aes(xintercept = x), colour='red')

plot of chunk r plot_brm

Identify the GLM

\[\begin{aligned} Z &\sim \Gam(\text{shape}=1, \text{scale}=\mu) \\ \mu &=\exp(-\beta_1 X - \beta_2 Y) \end{aligned}\]

What is

the probability distribution? (describes the output)
the linear predictor? (connects input to output)
the link function? (connects linear predictor to probability distribution)

Then, simulate from it.

IN CLASS

n <- 100
df <- data.frame(
    x = runif(n, 0, 10),
    y = rbinom(n, size=1, prob=0.5)
)
beta1 <- 2.5
beta2 <- -.5
df$mu <- exp(-beta1 * df$x -beta2 * df$y)
df$z <- rgamma(n, shape=1, scale=df$mu)
plot(z ~ x, data=df, col=ifelse(y, 'red', 'black'), pch=20)

plot of chunk r exex

gfit <- glm(z ~ x + y + 0, family=Gamma(link="log"), data=df)
summary(gfit)

## 
## Call:
## glm(formula = z ~ x + y + 0, family = Gamma(link = "log"), data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8205  -0.9298  -0.3615   0.3247   2.0758  
## 
## Coefficients:
##   Estimate Std. Error  t value Pr(>|t|)    
## x -2.46608    0.02082 -118.467  < 2e-16 ***
## y  0.48068    0.17177    2.798  0.00618 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.9292648)
## 
##     Null deviance: 2297.80  on 100  degrees of freedom
## Residual deviance:  102.01  on  98  degrees of freedom
## AIC: -2171.4
## 
## Number of Fisher Scoring iterations: 6