\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Generalized Linear Models

Peter Ralph

Advanced Biological Statistics

Notation note

Matrix multiplication

If

  • \(b\) is a \(k\)-vector
  • \(X\) is an \(n \times k\) matrix

then “\(X b\)” is shorthand for the \(n\)-vector \[ (Xb)_i = \sum_{j=1}^k X_{ij} b_j . \]

  • In R, we use %*%.

Example

Let’s say we want to predict apple yield using leaf_area, tree_age, and fertilizer: \[\begin{aligned} \text{yield}_i &= \beta_0 + \beta_1 \times \text{leaf\_area}_i \\ &\qquad {} + \beta_2 \times \text{tree\_age}_i + \beta_3 \times \text{fertilizer}_i + \epsilon_i. \end{aligned}\]

If we set the columns of \(X\) to be

  1. all 1s
  2. leaf_area
  3. tree_age
  4. fertilizer

and \(\beta = (\beta_0, \beta_1, \beta_2, \beta_3),\) then this is \[ \text{yield}_i = (X \beta)_i + \epsilon_i . \]

Or, \[ \text{yield} = X \beta + \epsilon . \]

Recall the model.matrix( ) function gives you back an X matrix appropriate for factors as well.

Generalized Linear Models

Ingredients of a GLM:

  1. A probability distribution, \(Y\).

(the “family”; describes the output)

  1. A linear predictor, \(X \beta\).

(connects input to output)

  1. A link function (usually, \(h(\E[Y]) = X\beta\)).

(connects linear predictor to probability distribution)

Common choices:

  • Linear: \[ Y \sim \Normal(X \beta, \sigma) .\]

  • Binomial (“Logistic”): \[ Y \sim \Binom(n, \logistic(X\beta)) .\]

  • Poisson: \[ Y \sim \Poisson(\exp(X\beta)) .\]

  • Gamma: \[ Y \sim \Gam(\text{scale}=\exp(X\beta)/k, \text{shape}=k) .\]

Logistic GLM, in R

glm(y ~ x, family=binomial)
family {stats}

Family objects provide a convenient way to specify the details of the models
used by functions such as glm.

binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")

Arguments

link: a specification for the model link function. This can be a
name/expression, a literal character string, a length-one character vector, or
an object of class "link-glm" (such as generated by make.link) provided it is
not specified via one of the standard names given next.

The gaussian family accepts the links (as names) identity, log and inverse; the
binomial family the links logit, probit, cauchit, (corresponding to logistic,
normal and Cauchy CDFs respectively) log and cloglog (complementary log-log);
the Gamma family the links inverse, identity and log; the poisson family the
links log, identity, and sqrt; and the inverse.gaussian family the links
1/mu^2, inverse, identity and log.

Note that the link function is the inverse of what we’d write down in the model: for instance, if the mean of the response is the exponential of the linear predictor, i.e., if \[ \E[Y] = \exp(X\beta) . \] then we have

link = "log"

Other inverses:

  • logistic has inverse logit
  • identity has inverse identity
  • x^2 (“squared”) has inverse sqrt

Unpacking the logistic GLM

Simulate data

100 trials, where probability of success depends on \(x\):

alpha <- -7; beta <- 1.2
df <- data.frame( x = runif(100, 0, 10) )
df$y <- alpha + beta * df$x
df$p <- 1 / (1 + exp(-df$y))
df$z <- rbinom(100, size=1, prob=df$p)
plot(z ~ x, data=df)
curve(1/(1+exp(-(-7 + 1.2 * x))), 0, 10, col='red', add=TRUE)

plot of chunk r sim_logistic

glm()

zz <- cbind(df$z, 1-df$z)
summary(
    glm_fit <- glm(zz ~ df$x, family='binomial')
)
## 
## Call:
## glm(formula = zz ~ df$x, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79584  -0.14355   0.01418   0.07181   2.58115  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.7512     2.4701  -3.948 7.89e-05 ***
## df$x          1.9531     0.4849   4.028 5.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 138.269  on 99  degrees of freedom
## Residual deviance:  28.807  on 98  degrees of freedom
## AIC: 32.807
## 
## Number of Fisher Scoring iterations: 7

brms::brm()

( brm_fit <- brm(z | trials(1) ~ x, family='binomial', data=df) )
## Only 2 levels detected so that family 'bernoulli' might be a more efficient choice.
## Compiling Stan program...
## Start sampling
##  Family: binomial 
##   Links: mu = logit 
## Formula: z | trials(1) ~ x 
##    Data: df (Number of observations: 100) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   -11.07      2.81   -17.54    -6.67 1.00     2275     2103
## x             2.21      0.55     1.35     3.45 1.00     2053     1833
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

brms: posterior distributions

mcmc_hist(brm_fit, pars=c("b_Intercept", "b_x")) + geom_vline(data=data.frame(Parameter=c("b_Intercept", "b_x"), x=c(alpha, beta)), aes(xintercept = x), colour='red')

plot of chunk r plot_brm

Identify the GLM

\[\begin{aligned} Z &\sim \Gam(\text{shape}=1, \text{scale}=\mu) \\ \mu &=\exp(-\beta_1 X - \beta_2 Y) \end{aligned}\]

What is

  1. the probability distribution? (describes the output)

  2. the linear predictor? (connects input to output)

  3. the link function? (connects linear predictor to probability distribution)

Then, simulate from it.

IN CLASS

n <- 100
df <- data.frame(
    x = runif(n, 0, 10),
    y = rbinom(n, size=1, prob=0.5)
)
beta1 <- 2.5
beta2 <- -.5
df$mu <- exp(-beta1 * df$x -beta2 * df$y)
df$z <- rgamma(n, shape=1, scale=df$mu)
plot(z ~ x, data=df, col=ifelse(y, 'red', 'black'), pch=20)

plot of chunk r exex

gfit <- glm(z ~ x + y + 0, family=Gamma(link="log"), data=df)
summary(gfit)
## 
## Call:
## glm(formula = z ~ x + y + 0, family = Gamma(link = "log"), data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8205  -0.9298  -0.3615   0.3247   2.0758  
## 
## Coefficients:
##   Estimate Std. Error  t value Pr(>|t|)    
## x -2.46608    0.02082 -118.467  < 2e-16 ***
## y  0.48068    0.17177    2.798  0.00618 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.9292648)
## 
##     Null deviance: 2297.80  on 100  degrees of freedom
## Residual deviance:  102.01  on  98  degrees of freedom
## AIC: -2171.4
## 
## Number of Fisher Scoring iterations: 6