\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Random effects, and mixed models

Peter Ralph

Advanced Biological Statistics

Random effects

An example: urchins eat algae

From Logan:

To investigate density-dependent grazing effects of sea urchin Andrew and Underwood (1993) on filamentous algae measured the percentage of filamentous algae within five quadrats randomly positioned within each of four random patches of reef that were in turn nested within four sea urchin density treatments (no urchins, 33% of natural density, 66% natural density and 100% natural density). The sea urchin density treatment was considered a fixed factor and patch within density treatment as well as the individual quadrats were treated as random factors.

An example: urchins eat algae

andrew_data <- read.table('../Datasets/Logan_data/andrew.tsv', header=T, sep='\t')
head(andrew_data)
##   TREAT PATCH QUAD ALGAE
## 1    0%     1    1    46
## 2    0%     1    2    44
## 3    0%     1    3    41
## 4    0%     1    4    29
## 5    0%     1    5    11
## 6    0%     2    1    65

There are four variables: TREAT, PATCH, QUAD and ALGAE

Main effect factor: TREAT

Both QUAD and PATCH are factors:

andrew_data$QUAD <- factor(andrew_data$QUAD)
andrew_data$PATCH <- factor(andrew_data$PATCH)
andrew_data$TREAT <- factor(andrew_data$TREAT, levels=c("0%", "33%", "66%", "100%"))

Experimental design

with(andrew_data, table(PATCH, QUAD, TREAT))
## , , TREAT = 0%
## 
##      QUAD
## PATCH 1 2 3 4 5
##    1  1 1 1 1 1
##    2  1 1 1 1 1
##    3  1 1 1 1 1
##    4  1 1 1 1 1
##    5  0 0 0 0 0
##    6  0 0 0 0 0
##    7  0 0 0 0 0
##    8  0 0 0 0 0
##    9  0 0 0 0 0
##    10 0 0 0 0 0
##    11 0 0 0 0 0
##    12 0 0 0 0 0
##    13 0 0 0 0 0
##    14 0 0 0 0 0
##    15 0 0 0 0 0
##    16 0 0 0 0 0
## 
## , , TREAT = 33%
## 
##      QUAD
## PATCH 1 2 3 4 5
##    1  0 0 0 0 0
##    2  0 0 0 0 0
##    3  0 0 0 0 0
##    4  0 0 0 0 0
##    5  1 1 1 1 1
##    6  1 1 1 1 1
##    7  1 1 1 1 1
##    8  1 1 1 1 1
##    9  0 0 0 0 0
##    10 0 0 0 0 0
##    11 0 0 0 0 0
##    12 0 0 0 0 0
##    13 0 0 0 0 0
##    14 0 0 0 0 0
##    15 0 0 0 0 0
##    16 0 0 0 0 0
## 
## , , TREAT = 66%
## 
##      QUAD
## PATCH 1 2 3 4 5
##    1  0 0 0 0 0
##    2  0 0 0 0 0
##    3  0 0 0 0 0
##    4  0 0 0 0 0
##    5  0 0 0 0 0
##    6  0 0 0 0 0
##    7  0 0 0 0 0
##    8  0 0 0 0 0
##    9  1 1 1 1 1
##    10 1 1 1 1 1
##    11 1 1 1 1 1
##    12 1 1 1 1 1
##    13 0 0 0 0 0
##    14 0 0 0 0 0
##    15 0 0 0 0 0
##    16 0 0 0 0 0
## 
## , , TREAT = 100%
## 
##      QUAD
## PATCH 1 2 3 4 5
##    1  0 0 0 0 0
##    2  0 0 0 0 0
##    3  0 0 0 0 0
##    4  0 0 0 0 0
##    5  0 0 0 0 0
##    6  0 0 0 0 0
##    7  0 0 0 0 0
##    8  0 0 0 0 0
##    9  0 0 0 0 0
##    10 0 0 0 0 0
##    11 0 0 0 0 0
##    12 0 0 0 0 0
##    13 1 1 1 1 1
##    14 1 1 1 1 1
##    15 1 1 1 1 1
##    16 1 1 1 1 1

Response distribution

plot(ALGAE ~ TREAT, data=andrew_data)
points(ALGAE ~ jitter(as.numeric(TREAT)), data=andrew_data, pch=20, col=1+as.numeric(PATCH)%%4)

plot of chunk r boxit

Why is this wrong?

summary(lm(ALGAE ~ TREAT, data=andrew_data))
## 
## Call:
## lm(formula = ALGAE ~ TREAT, data = andrew_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39.20 -19.00  -1.30  12.72  57.45 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   39.200      5.152   7.608 6.17e-11 ***
## TREAT33%     -20.200      7.287  -2.772   0.0070 ** 
## TREAT66%     -17.650      7.287  -2.422   0.0178 *  
## TREAT100%    -37.900      7.287  -5.201 1.62e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.04 on 76 degrees of freedom
## Multiple R-squared:  0.2634, Adjusted R-squared:  0.2343 
## F-statistic: 9.059 on 3 and 76 DF,  p-value: 3.362e-05
xkcd:2533

What we really want: \[ \text{(algae)} = \text{(mean for treatment)} + \text{(mean offset for patch)} + \text{("noise")} . \]

We could do:

ALGAE ~ TREAT + PATCH

… but do we care about all those patch means?

summary(lm(ALGAE ~ TREAT + PATCH, data=andrew_data))
## 
## Call:
## lm(formula = ALGAE ~ TREAT + PATCH, data = andrew_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -36.80  -2.60  -0.50   4.25  42.20 
## 
## Coefficients: (3 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   34.200      7.728   4.426 3.82e-05 ***
## TREAT33%       1.600     10.929   0.146  0.88406    
## TREAT66%     -14.200     10.929  -1.299  0.19850    
## TREAT100%    -31.600     10.929  -2.891  0.00523 ** 
## PATCH2        27.800     10.929   2.544  0.01339 *  
## PATCH3       -32.000     10.929  -2.928  0.00472 ** 
## PATCH4        24.200     10.929   2.214  0.03038 *  
## PATCH5       -33.200     10.929  -3.038  0.00345 ** 
## PATCH6       -35.800     10.929  -3.276  0.00170 ** 
## PATCH7         1.800     10.929   0.165  0.86970    
## PATCH8            NA         NA      NA       NA    
## PATCH9         8.400     10.929   0.769  0.44495    
## PATCH10       16.800     10.929   1.537  0.12917    
## PATCH11      -19.000     10.929  -1.739  0.08693 .  
## PATCH12           NA         NA      NA       NA    
## PATCH13       -1.000     10.929  -0.092  0.92738    
## PATCH14       -2.600     10.929  -0.238  0.81272    
## PATCH15       -1.600     10.929  -0.146  0.88406    
## PATCH16           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.28 on 64 degrees of freedom
## Multiple R-squared:  0.6512, Adjusted R-squared:  0.5694 
## F-statistic: 7.964 on 15 and 64 DF,  p-value: 1.05e-09

Random effects

Small modification: \[ \text{(algae)} = \text{(mean for treatment)} + \text{(random offset for patch)} + \text{("noise")} . \]

We add a random intercept:

ALGAE ~ TREAT + (1|PATCH)

library(lme4)
alglm <- lmer(ALGAE ~ TREAT + (1|PATCH), data=andrew_data)
summary(alglm)
## Linear mixed model fit by REML ['lmerMod']
## Formula: ALGAE ~ TREAT + (1 | PATCH)
##    Data: andrew_data
## 
## REML criterion at convergence: 682.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.9808 -0.3106 -0.1093  0.2831  2.5910 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  PATCH    (Intercept) 294.3    17.16   
##  Residual             298.6    17.28   
## Number of obs: 80, groups:  PATCH, 16
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   39.200      9.408   4.167
## TREAT33%     -20.200     13.305  -1.518
## TREAT66%     -17.650     13.305  -1.327
## TREAT100%    -37.900     13.305  -2.849
## 
## Correlation of Fixed Effects:
##           (Intr) TREAT3 TREAT6
## TREAT33%  -0.707              
## TREAT66%  -0.707  0.500       
## TREAT100% -0.707  0.500  0.500

anova(
      lmer(ALGAE ~ TREAT + (1|PATCH), data=andrew_data),
      lm(ALGAE ~ TREAT, data=andrew_data))
## refitting model(s) with ML (instead of REML)
## Data: andrew_data
## Models:
## lm(ALGAE ~ TREAT, data = andrew_data): ALGAE ~ TREAT
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data): ALGAE ~ TREAT + (1 | PATCH)
##                                                       npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## lm(ALGAE ~ TREAT, data = andrew_data)                    5 734.90 746.81 -362.45   724.90                         
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data)    6 718.83 733.12 -353.42   706.83 18.069  1   2.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(
      lmer(ALGAE ~ TREAT + (1|PATCH), data=andrew_data),
      lmer(ALGAE ~ (1|PATCH), data=andrew_data))
## refitting model(s) with ML (instead of REML)
## Data: andrew_data
## Models:
## lmer(ALGAE ~ (1 | PATCH), data = andrew_data): ALGAE ~ (1 | PATCH)
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data): ALGAE ~ TREAT + (1 | PATCH)
##                                                       npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## lmer(ALGAE ~ (1 | PATCH), data = andrew_data)            3 721.12 728.27 -357.56   715.12                       
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data)    6 718.83 733.12 -353.42   706.83 8.2938  3    0.04031 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What are the random effects?

ranef(alglm)
## $PATCH
##    (Intercept)
## 1   -4.1565746
## 2   18.9539802
## 3  -30.7586522
## 4   15.9612465
## 5  -13.6335647
## 6  -15.7949835
## 7   15.4624576
## 8   13.9660907
## 9    5.6945072
## 10  12.6775526
## 11 -17.0835217
## 12  -1.2885381
## 13   0.2493945
## 14  -1.0807094
## 15  -0.2493945
## 16   1.0807094
## 
## with conditional variances for "PATCH"

rfs <- ranef(alglm)$PATCH
ses <- rfs[,1] + outer(sqrt(as.vector(attr(rfs, "postVar"))), c(-2, 2), "*")
plot(rfs[,1], 1:nrow(rfs), xlab='patch mean', xlim=range(ses), ylab='')
segments(x0=ses[,1], x1=ses[,2], y0=1:nrow(rfs))
abline(v=0, col='red')

plot of chunk r plot_ranef

Notes on mixed models

The math is a lot harder.

For a simple linear model fit by lm( ) (with fixed effects), the log-likelihood function is just the sum of the squared residuals.

But with a mixed model, the likelihood averages over the values of the random effects, which makes everything more difficult.

You sometimes have to worry about convergence.

Since the math is harder, mixed-model-fitting functions like lmer( ) have to use various sorts of numerical optimization methods to find the best-fitting parameters.

Sometimes, these may fail.

Notably, many use the REML approximation:

Usage:

     lmer(formula, data = NULL, REML = TRUE, control = lmerControl(),
          start = NULL, verbose = 0L, subset, weights, na.action,
          offset, contrasts = NULL, devFunOnly = FALSE, ...)

Hypothesis testing?

With fixed effects, for a factor f, the comparison

anova( lm(y ~ f - 1), lm(y ~ 1) )

uses the model that \[ y_i = \beta_{f_i} + \epsilon_i \] to test against the null hypothesis that \[ H_0 : \beta_1 = \beta_2 = \cdots = \beta_m = 0. \]

With random effects,

anova( lm(y ~ (1|f) - 1), lm(y ~ 1) )

uses the model that \[\begin{aligned} y_i &= \beta_{f_i} + \epsilon_i \\ \beta_a &\sim \Normal(0, \eta) \end{aligned}\] to test against the null hypothesis that \[ H_0 : \eta = 0. \]

Back to the pumpkins

Your turn

  1. Adjust the simulations so that there is an effect of plot, drawn randomly from Normal(0, 1).
  2. Fit a model including a random effect for plot.
  3. How big is the “plot” effect? Is it estimated correctly?
  4. Assess significance by using anova( ) to compare to a nested model.

Simulate data

(Here’s our previous code:)

pumpkins <- expand.grid(
          fertilizer=c("low", "medium", "high"),
          water=c("no water", "water"),
          plot=1:4,
          plant=1:5,
          weight=NA)

params <- list(
    mu = 5, # kg
    alpha = c("low" = 0,
              "medium" = +0.5,
              "high" = +1
    ),
    beta = c("no water" = 0,
             "water" = +1
    ),
    gamma = c("high,water" = -1),
    sigma = 1.5
)

inter_name <- paste(pumpkins$fertilizer, pumpkins$water, sep=",")
gamma <- ifelse(
        inter_name %in% names(params$gamma),
        params$gamma[inter_name],
        0
)
pumpkins$mean_weight <- (
    params$mu
    + params$alpha[as.character(pumpkins$fertilizer)]
    + params$beta[as.character(pumpkins$water)]
    + gamma
)

pumpkins$weight <- rnorm(
    nrow(pumpkins),
    mean=pumpkins$mean_weight,
    sd=params$sigma
)