Peter Ralph
27 October – Advanced Biological Statistics
From Logan:
To investigate density-dependent grazing effects of sea urchin Andrew and Underwood (1993) on filamentous algae measured the percentage of filamentous algae within five quadrats randomly positioned within each of four random patches of reef that were in turn nested within four sea urchin density treatments (no urchins, 33% of natural density, 66% natural density and 100% natural density). The sea urchin density treatment was considered a fixed factor and patch within density treatment as well as the individual quadrats were treated as random factors.
andrew_data <- read.table('../Datasets/Logan_data/andrew.tsv', header=T, sep='\t')
head(andrew_data)
## TREAT PATCH QUAD ALGAE
## 1 0% 1 1 46
## 2 0% 1 2 44
## 3 0% 1 3 41
## 4 0% 1 4 29
## 5 0% 1 5 11
## 6 0% 2 1 65
There are four variables: TREAT
, PATCH
, QUAD
and ALGAE
Main effect factor: TREAT
## , , TREAT = 0%
##
## QUAD
## PATCH 1 2 3 4 5
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 3 1 1 1 1 1
## 4 1 1 1 1 1
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 7 0 0 0 0 0
## 8 0 0 0 0 0
## 9 0 0 0 0 0
## 10 0 0 0 0 0
## 11 0 0 0 0 0
## 12 0 0 0 0 0
## 13 0 0 0 0 0
## 14 0 0 0 0 0
## 15 0 0 0 0 0
## 16 0 0 0 0 0
##
## , , TREAT = 33%
##
## QUAD
## PATCH 1 2 3 4 5
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 1 1 1 1 1
## 6 1 1 1 1 1
## 7 1 1 1 1 1
## 8 1 1 1 1 1
## 9 0 0 0 0 0
## 10 0 0 0 0 0
## 11 0 0 0 0 0
## 12 0 0 0 0 0
## 13 0 0 0 0 0
## 14 0 0 0 0 0
## 15 0 0 0 0 0
## 16 0 0 0 0 0
##
## , , TREAT = 66%
##
## QUAD
## PATCH 1 2 3 4 5
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 7 0 0 0 0 0
## 8 0 0 0 0 0
## 9 1 1 1 1 1
## 10 1 1 1 1 1
## 11 1 1 1 1 1
## 12 1 1 1 1 1
## 13 0 0 0 0 0
## 14 0 0 0 0 0
## 15 0 0 0 0 0
## 16 0 0 0 0 0
##
## , , TREAT = 100%
##
## QUAD
## PATCH 1 2 3 4 5
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 7 0 0 0 0 0
## 8 0 0 0 0 0
## 9 0 0 0 0 0
## 10 0 0 0 0 0
## 11 0 0 0 0 0
## 12 0 0 0 0 0
## 13 1 1 1 1 1
## 14 1 1 1 1 1
## 15 1 1 1 1 1
## 16 1 1 1 1 1
Why is this wrong?
##
## Call:
## lm(formula = ALGAE ~ TREAT, data = andrew_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.20 -19.00 -1.30 12.72 57.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.200 5.152 7.608 6.17e-11 ***
## TREAT33% -20.200 7.287 -2.772 0.0070 **
## TREAT66% -17.650 7.287 -2.422 0.0178 *
## TREAT100% -37.900 7.287 -5.201 1.62e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.04 on 76 degrees of freedom
## Multiple R-squared: 0.2634, Adjusted R-squared: 0.2343
## F-statistic: 9.059 on 3 and 76 DF, p-value: 3.362e-05
What we really want: \[ \text{(algae)} = \text{(mean for treatment)} + \text{(mean offset for patch)} + \text{("noise")} . \]
We could do:
ALGAE ~ TREAT + PATCH
… but do we care about all those patch means?
##
## Call:
## lm(formula = ALGAE ~ TREAT + PATCH, data = andrew_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.80 -2.60 -0.50 4.25 42.20
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.200 7.728 4.426 3.82e-05 ***
## TREAT33% 1.600 10.929 0.146 0.88406
## TREAT66% -14.200 10.929 -1.299 0.19850
## TREAT100% -31.600 10.929 -2.891 0.00523 **
## PATCH2 27.800 10.929 2.544 0.01339 *
## PATCH3 -32.000 10.929 -2.928 0.00472 **
## PATCH4 24.200 10.929 2.214 0.03038 *
## PATCH5 -33.200 10.929 -3.038 0.00345 **
## PATCH6 -35.800 10.929 -3.276 0.00170 **
## PATCH7 1.800 10.929 0.165 0.86970
## PATCH8 NA NA NA NA
## PATCH9 8.400 10.929 0.769 0.44495
## PATCH10 16.800 10.929 1.537 0.12917
## PATCH11 -19.000 10.929 -1.739 0.08693 .
## PATCH12 NA NA NA NA
## PATCH13 -1.000 10.929 -0.092 0.92738
## PATCH14 -2.600 10.929 -0.238 0.81272
## PATCH15 -1.600 10.929 -0.146 0.88406
## PATCH16 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.28 on 64 degrees of freedom
## Multiple R-squared: 0.6512, Adjusted R-squared: 0.5694
## F-statistic: 7.964 on 15 and 64 DF, p-value: 1.05e-09
Small modification: \[ \text{(algae)} = \text{(mean for treatment)} + \text{(random offset for patch)} + \text{("noise")} . \]
We add a random intercept:
ALGAE ~ TREAT + (1|PATCH)
## Linear mixed model fit by REML ['lmerMod']
## Formula: ALGAE ~ TREAT + (1 | PATCH)
## Data: andrew_data
##
## REML criterion at convergence: 682.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9808 -0.3106 -0.1093 0.2831 2.5910
##
## Random effects:
## Groups Name Variance Std.Dev.
## PATCH (Intercept) 294.3 17.16
## Residual 298.6 17.28
## Number of obs: 80, groups: PATCH, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 39.200 9.408 4.167
## TREAT33% -20.200 13.305 -1.518
## TREAT66% -17.650 13.305 -1.327
## TREAT100% -37.900 13.305 -2.849
##
## Correlation of Fixed Effects:
## (Intr) TREAT3 TREAT6
## TREAT33% -0.707
## TREAT66% -0.707 0.500
## TREAT100% -0.707 0.500 0.500
## refitting model(s) with ML (instead of REML)
## Data: andrew_data
## Models:
## lm(ALGAE ~ TREAT, data = andrew_data): ALGAE ~ TREAT
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data): ALGAE ~ TREAT + (1 | PATCH)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lm(ALGAE ~ TREAT, data = andrew_data) 5 734.90 746.81 -362.45 724.90
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data) 6 718.83 733.12 -353.42 706.83 18.069 1 2.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(
lmer(ALGAE ~ TREAT + (1|PATCH), data=andrew_data),
lmer(ALGAE ~ (1|PATCH), data=andrew_data))
## refitting model(s) with ML (instead of REML)
## Data: andrew_data
## Models:
## lmer(ALGAE ~ (1 | PATCH), data = andrew_data): ALGAE ~ (1 | PATCH)
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data): ALGAE ~ TREAT + (1 | PATCH)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lmer(ALGAE ~ (1 | PATCH), data = andrew_data) 3 721.12 728.27 -357.56 715.12
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data) 6 718.83 733.12 -353.42 706.83 8.2938 3 0.04031 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## $PATCH
## (Intercept)
## 1 -4.1565746
## 2 18.9539802
## 3 -30.7586522
## 4 15.9612465
## 5 -13.6335647
## 6 -15.7949835
## 7 15.4624576
## 8 13.9660907
## 9 5.6945072
## 10 12.6775526
## 11 -17.0835217
## 12 -1.2885381
## 13 0.2493945
## 14 -1.0807094
## 15 -0.2493945
## 16 1.0807094
##
## with conditional variances for "PATCH"
For a simple linear model fit by lm( )
(with fixed effects), the log-likelihood function is just the sum of the squared residuals.
But with a mixed model, the likelihood averages over the values of the random effects, which makes everything more difficult.
Since the math is harder, mixed-model-fitting functions like lmer( )
have to use various sorts of numerical optimization methods to find the best-fitting parameters.
Sometimes, these may fail.
Notably, many use the REML
approximation:
Usage:
lmer(formula, data = NULL, REML = TRUE, control = lmerControl(),
start = NULL, verbose = 0L, subset, weights, na.action,
offset, contrasts = NULL, devFunOnly = FALSE, ...)
With fixed effects, for a factor f
, the comparison
anova( lm(y ~ f - 1), lm(y ~ 1) )
uses the model that \[ y_i = \beta_{f_i} + \epsilon_i \] to test against the null hypothesis that \[ H_0 : \beta_1 = \beta_2 = \cdots = \beta_m = 0. \]
With random effects,
anova( lm(y ~ (1|f) - 1), lm(y ~ 1) )
uses the model that \[\begin{aligned} y_i &= \beta_{f_i} + \epsilon_i \\ \beta_a &\sim \Normal(0, \eta) \end{aligned}\] to test against the null hypothesis that \[ H_0 : \eta = 0. \]
plot
to the model.anova( )
to compare to a nested model.Data: data/pumpkins.tsv
(1) Add a random effect to the linear model.
library(lme4)
fixed_fit <- lm( weight ~ water * fertilizer, data=pumpkins)
mixed_fit <- lmer( weight ~ water * fertilizer + (1|plot:water:fertilizer), data=pumpkins)
summary(mixed_fit)
## Linear mixed model fit by REML ['lmerMod']
## Formula: weight ~ water * fertilizer + (1 | plot:water:fertilizer)
## Data: pumpkins
##
## REML criterion at convergence: 369.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.64930 -0.54801 0.03808 0.70427 1.69905
##
## Random effects:
## Groups Name Variance Std.Dev.
## plot:water:fertilizer (Intercept) 0.06128 0.2476
## Residual 1.23871 1.1130
## Number of obs: 120, groups: plot:water:fertilizer, 24
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 20.00462 0.27795 71.972
## waterwater 0.02462 0.39308 0.063
## fertilizerlow -12.08324 0.39308 -30.740
## fertilizermedium -6.13599 0.39308 -15.610
## waterwater:fertilizerlow -1.84651 0.55590 -3.322
## waterwater:fertilizermedium 2.67808 0.55590 4.818
##
## Correlation of Fixed Effects:
## (Intr) wtrwtr frtlzrl frtlzrm wtrwtr:frtlzrl
## waterwater -0.707
## fertilizrlw -0.707 0.500
## fertilzrmdm -0.707 0.500 0.500
## wtrwtr:frtlzrl 0.500 -0.707 -0.707 -0.354
## wtrwtr:frtlzrm 0.500 -0.707 -0.354 -0.707 0.500
(2) How big is the “plot” effect? The estimated standard deviation of the plot effects is 0.2475554. We can also look at the estimated random effects visually, by plotting the estimated mean plot effects, with a bar showing plus/minus 2 standard errors.
Overall, our best guess is that plots differ by around 0.2 pounds, but no more than 0.6 pounds from each other (in terms of average pumpkin weight, with 95% confidence).
(3) Compare the models with/without random effects, to see if the random effects seem to have added anything useful.
## refitting model(s) with ML (instead of REML)
## Data: pumpkins
## Models:
## fixed_fit: weight ~ water * fertilizer
## mixed_fit: weight ~ water * fertilizer + (1 | plot:water:fertilizer)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## fixed_fit 7 378.68 398.19 -182.34 364.68
## mixed_fit 8 380.68 402.98 -182.34 364.68 0 1 1
Hm, a \(p\)-value of 1. That’s suspicious. We know there shouldn’t be an effect of plot, but are things really working right?
See what happens if there is a plot effect
sim_pumpkins <- function (plot_sd=0) {
resid_sd <- 1.2
plants_per_plot <- 5
plots_per_treatment <- 4
pumpkins <- expand.grid(
fertilizer=c("low", "medium", "high"),
water=c("no water", "water"),
plot=1:plots_per_treatment,
plant=1:plants_per_plot,
weight=NA)
# true values
mu <- 20
alpha <- c('high'=0, 'medium'=-6, 'low'=-12)
beta <- c('no water'=0, 'water'=0)
gamma <- c('high.no water'=0,
'high.water'=0,
'medium.no water'=0,
'medium.water'=2,
'low.no water'=0,
'low.water'=-2)
plot_delta <- rnorm(plots_per_treatment
* nlevels(pumpkins$fertilizer)
* nlevels(pumpkins$water),
mean=0, sd=plot_sd)
k <- 1
for (p in 1:plots_per_treatment) {
for (f in levels(pumpkins$fertilizer)) {
for (w in levels(pumpkins$water)) {
names(plot_delta)[k] <- paste(p, f, w, sep='.')
k <- k + 1
}
}
}
pumpkins$mean_weight <- (mu
+ alpha[as.character(pumpkins$fertilizer)]
+ beta[as.character(pumpkins$water)]
+ gamma[paste(pumpkins$fertilizer, pumpkins$water, sep='.')]
+ plot_delta[paste(pumpkins$plot, pumpkins$fertilizer, pumpkins$water, sep='.')])
pumpkins$weight <- rnorm(nrow(pumpkins),
mean=pumpkins$mean_weight,
sd=resid_sd)
return(pumpkins)
}
new_pumpkins <- sim_pumpkins(plot_sd=3)
new_fixed_fit <- lm(
weight ~ water * fertilizer, data=new_pumpkins)
new_mixed_fit <- lmer(
weight ~ water * fertilizer + (1|plot:water:fertilizer),
data=new_pumpkins)
anova(new_mixed_fit, new_fixed_fit)
## refitting model(s) with ML (instead of REML)
## Data: new_pumpkins
## Models:
## new_fixed_fit: weight ~ water * fertilizer
## new_mixed_fit: weight ~ water * fertilizer + (1 | plot:water:fertilizer)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## new_fixed_fit 7 584.57 604.08 -285.29 570.57
## new_mixed_fit 8 466.63 488.93 -225.32 450.63 119.94 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, everything seems to be working right!
In conclusion, there’s no good evidence that we need a random effect for plots (i.e., that plots are inducing unmodeled correlation in our data). But it’s not hurting our model to include them, either.