Peter Ralph
Advanced Biological Statistics
From Logan:
To investigate density-dependent grazing effects of sea urchin Andrew and Underwood (1993) on filamentous algae measured the percentage of filamentous algae within five quadrats randomly positioned within each of four random patches of reef that were in turn nested within four sea urchin density treatments (no urchins, 33% of natural density, 66% natural density and 100% natural density). The sea urchin density treatment was considered a fixed factor and patch within density treatment as well as the individual quadrats were treated as random factors.
andrew_data <- read.table('../Datasets/Logan_data/andrew.tsv', header=T, sep='\t')
head(andrew_data)
## TREAT PATCH QUAD ALGAE
## 1 0% 1 1 46
## 2 0% 1 2 44
## 3 0% 1 3 41
## 4 0% 1 4 29
## 5 0% 1 5 11
## 6 0% 2 1 65
There are four variables: TREAT
, PATCH
,
QUAD
and ALGAE
Main effect factor: TREAT
## , , TREAT = 0%
##
## QUAD
## PATCH 1 2 3 4 5
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 3 1 1 1 1 1
## 4 1 1 1 1 1
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 7 0 0 0 0 0
## 8 0 0 0 0 0
## 9 0 0 0 0 0
## 10 0 0 0 0 0
## 11 0 0 0 0 0
## 12 0 0 0 0 0
## 13 0 0 0 0 0
## 14 0 0 0 0 0
## 15 0 0 0 0 0
## 16 0 0 0 0 0
##
## , , TREAT = 33%
##
## QUAD
## PATCH 1 2 3 4 5
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 1 1 1 1 1
## 6 1 1 1 1 1
## 7 1 1 1 1 1
## 8 1 1 1 1 1
## 9 0 0 0 0 0
## 10 0 0 0 0 0
## 11 0 0 0 0 0
## 12 0 0 0 0 0
## 13 0 0 0 0 0
## 14 0 0 0 0 0
## 15 0 0 0 0 0
## 16 0 0 0 0 0
##
## , , TREAT = 66%
##
## QUAD
## PATCH 1 2 3 4 5
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 7 0 0 0 0 0
## 8 0 0 0 0 0
## 9 1 1 1 1 1
## 10 1 1 1 1 1
## 11 1 1 1 1 1
## 12 1 1 1 1 1
## 13 0 0 0 0 0
## 14 0 0 0 0 0
## 15 0 0 0 0 0
## 16 0 0 0 0 0
##
## , , TREAT = 100%
##
## QUAD
## PATCH 1 2 3 4 5
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## 7 0 0 0 0 0
## 8 0 0 0 0 0
## 9 0 0 0 0 0
## 10 0 0 0 0 0
## 11 0 0 0 0 0
## 12 0 0 0 0 0
## 13 1 1 1 1 1
## 14 1 1 1 1 1
## 15 1 1 1 1 1
## 16 1 1 1 1 1
Why is this wrong?
##
## Call:
## lm(formula = ALGAE ~ TREAT, data = andrew_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.20 -19.00 -1.30 12.72 57.45
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.200 5.152 7.608 6.17e-11 ***
## TREAT33% -20.200 7.287 -2.772 0.0070 **
## TREAT66% -17.650 7.287 -2.422 0.0178 *
## TREAT100% -37.900 7.287 -5.201 1.62e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.04 on 76 degrees of freedom
## Multiple R-squared: 0.2634, Adjusted R-squared: 0.2343
## F-statistic: 9.059 on 3 and 76 DF, p-value: 3.362e-05
What we really want: \[ \text{(algae)} = \text{(mean for treatment)} + \text{(mean offset for patch)} + \text{("noise")} . \]
We could do:
ALGAE ~ TREAT + PATCH
… but do we care about all those patch means?
##
## Call:
## lm(formula = ALGAE ~ TREAT + PATCH, data = andrew_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.80 -2.60 -0.50 4.25 42.20
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.200 7.728 4.426 3.82e-05 ***
## TREAT33% 1.600 10.929 0.146 0.88406
## TREAT66% -14.200 10.929 -1.299 0.19850
## TREAT100% -31.600 10.929 -2.891 0.00523 **
## PATCH2 27.800 10.929 2.544 0.01339 *
## PATCH3 -32.000 10.929 -2.928 0.00472 **
## PATCH4 24.200 10.929 2.214 0.03038 *
## PATCH5 -33.200 10.929 -3.038 0.00345 **
## PATCH6 -35.800 10.929 -3.276 0.00170 **
## PATCH7 1.800 10.929 0.165 0.86970
## PATCH8 NA NA NA NA
## PATCH9 8.400 10.929 0.769 0.44495
## PATCH10 16.800 10.929 1.537 0.12917
## PATCH11 -19.000 10.929 -1.739 0.08693 .
## PATCH12 NA NA NA NA
## PATCH13 -1.000 10.929 -0.092 0.92738
## PATCH14 -2.600 10.929 -0.238 0.81272
## PATCH15 -1.600 10.929 -0.146 0.88406
## PATCH16 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.28 on 64 degrees of freedom
## Multiple R-squared: 0.6512, Adjusted R-squared: 0.5694
## F-statistic: 7.964 on 15 and 64 DF, p-value: 1.05e-09
Small modification: \[ \text{(algae)} = \text{(mean for treatment)} + \text{(random offset for patch)} + \text{("noise")} . \]
We add a random intercept:
ALGAE ~ TREAT + (1|PATCH)
## Linear mixed model fit by REML ['lmerMod']
## Formula: ALGAE ~ TREAT + (1 | PATCH)
## Data: andrew_data
##
## REML criterion at convergence: 682.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9808 -0.3106 -0.1093 0.2831 2.5910
##
## Random effects:
## Groups Name Variance Std.Dev.
## PATCH (Intercept) 294.3 17.16
## Residual 298.6 17.28
## Number of obs: 80, groups: PATCH, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 39.200 9.408 4.167
## TREAT33% -20.200 13.305 -1.518
## TREAT66% -17.650 13.305 -1.327
## TREAT100% -37.900 13.305 -2.849
##
## Correlation of Fixed Effects:
## (Intr) TREAT3 TREAT6
## TREAT33% -0.707
## TREAT66% -0.707 0.500
## TREAT100% -0.707 0.500 0.500
## refitting model(s) with ML (instead of REML)
## Data: andrew_data
## Models:
## lm(ALGAE ~ TREAT, data = andrew_data): ALGAE ~ TREAT
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data): ALGAE ~ TREAT + (1 | PATCH)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lm(ALGAE ~ TREAT, data = andrew_data) 5 734.90 746.81 -362.45 724.90
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data) 6 718.83 733.12 -353.42 706.83 18.069 1 2.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(
lmer(ALGAE ~ TREAT + (1|PATCH), data=andrew_data),
lmer(ALGAE ~ (1|PATCH), data=andrew_data))
## refitting model(s) with ML (instead of REML)
## Data: andrew_data
## Models:
## lmer(ALGAE ~ (1 | PATCH), data = andrew_data): ALGAE ~ (1 | PATCH)
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data): ALGAE ~ TREAT + (1 | PATCH)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## lmer(ALGAE ~ (1 | PATCH), data = andrew_data) 3 721.12 728.27 -357.56 715.12
## lmer(ALGAE ~ TREAT + (1 | PATCH), data = andrew_data) 6 718.83 733.12 -353.42 706.83 8.2938 3 0.04031 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## $PATCH
## (Intercept)
## 1 -4.1565746
## 2 18.9539802
## 3 -30.7586522
## 4 15.9612465
## 5 -13.6335647
## 6 -15.7949835
## 7 15.4624576
## 8 13.9660907
## 9 5.6945072
## 10 12.6775526
## 11 -17.0835217
## 12 -1.2885381
## 13 0.2493945
## 14 -1.0807094
## 15 -0.2493945
## 16 1.0807094
##
## with conditional variances for "PATCH"
For a simple linear model fit by lm( )
(with fixed
effects), the log-likelihood function is just the sum of
the squared residuals.
But with a mixed model, the likelihood averages over the values of the random effects, which makes everything more difficult.
Since the math is harder, mixed-model-fitting functions like
lmer( )
have to use various sorts of numerical
optimization methods to find the best-fitting parameters.
Sometimes, these may fail.
Notably, many use the REML
approximation:
Usage:
lmer(formula, data = NULL, REML = TRUE, control = lmerControl(),
start = NULL, verbose = 0L, subset, weights, na.action,
offset, contrasts = NULL, devFunOnly = FALSE, ...)
With fixed effects, for a factor f
, the comparison
anova( lm(y ~ f - 1), lm(y ~ 1) )
uses the model that \[ y_i = \beta_{f_i} + \epsilon_i \] to test against the null hypothesis that \[ H_0 : \beta_1 = \beta_2 = \cdots = \beta_m = 0. \]
With random effects,
anova( lm(y ~ (1|f) - 1), lm(y ~ 1) )
uses the model that \[\begin{aligned} y_i &= \beta_{f_i} + \epsilon_i \\ \beta_a &\sim \Normal(0, \eta) \end{aligned}\] to test against the null hypothesis that \[ H_0 : \eta = 0. \]
plot
, drawn randomly from Normal(0, 1).plot
.anova( )
to compare to a
nested model.(Here’s our previous code:)
pumpkins <- expand.grid(
fertilizer=c("low", "medium", "high"),
water=c("no water", "water"),
plot=1:4,
plant=1:5,
weight=NA)
params <- list(
mu = 5, # kg
alpha = c("low" = 0,
"medium" = +0.5,
"high" = +1
),
beta = c("no water" = 0,
"water" = +1
),
gamma = c("high,water" = -1),
sigma = 1.5
)
inter_name <- paste(pumpkins$fertilizer, pumpkins$water, sep=",")
gamma <- ifelse(
inter_name %in% names(params$gamma),
params$gamma[inter_name],
0
)
pumpkins$mean_weight <- (
params$mu
+ params$alpha[as.character(pumpkins$fertilizer)]
+ params$beta[as.character(pumpkins$water)]
+ gamma
)
pumpkins$weight <- rnorm(
nrow(pumpkins),
mean=pumpkins$mean_weight,
sd=params$sigma
)