Peter Ralph
Advanced Biological Statistics
Linear models…
Say we have \(n\) observations coming from combinations of two factors, so that the \(k\)th observation in the \(i\)th group of factor \(A\) and the \(j\)th group of factor \(B\) is \[\begin{equation} X_{ijk} = \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk} , \end{equation}\] where
In words, \[\begin{equation} \begin{split} \text{(value)} &= \text{(overall mean)} + \text{(A group mean)} \\ &\qquad {} + \text{(B group mean)} + \text{(AB group mean)} + \text{(residual)} \end{split}\end{equation}\]
We’re looking at how mean pumpkin weight depends on both
So, we
draw the pictures
Ignore any “plant” and “plot” effects.
(e.g., only one pumpkin per vine and one plot per combination of conditions)
Say that \(i=1, 2, 3\) indexes fertilizer levels (low to high), and \(j=1, 2\) indexes late watering (no or yes), and \[\begin{equation}\begin{split} X_{ijk} &= \text{(weight of $k$th pumpkin in plot with conditions $i$, $j$)} \\ &= \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk} , \end{split}\end{equation}\] where
A good way to get a better concrete understanding of something is by simulating it –
by writing code to generate a (random) dataset that you design to look, more or less like what you expect the real data to look like.
This lets you explore statistical power, choose sample sizes, etcetera… but also makes you realize things you hadn’t, previously.
pumpkins <- expand.grid(
fertilizer=c("low", "medium", "high"),
water=c("no water", "water"),
plot=1:4,
plant=1:5,
weight=NA)
head(pumpkins)
## fertilizer water plot plant weight
## 1 low no water 1 1 NA
## 2 medium no water 1 1 NA
## 3 high no water 1 1 NA
## 4 low water 1 1 NA
## 5 medium water 1 1 NA
## 6 high water 1 1 NA
We’ll first get the mean valeus in there. First, we’ll do this using a for loop (easier to think through!):
# for loop
pumpkins$mean_weight <- NA
for (j in 1:nrow(pumpkins)) {
fert <- pumpkins$fertilizer[j]
water <- pumpkins$water[j]
gamma <- params$gamma[paste(fert, water, sep=',')]
if (is.na(gamma)) {
gamma <- 0
}
pumpkins$mean_weight[j] <- (params$mu
+ params$alpha[fert]
+ params$beta[water]
+ gamma
)
}
Here’s an alternative implementation, vectorized:
# not-for-loop version
interaction <- paste(
pumpkins$fertilizer,
pumpkins$water,
sep=','
)
mw <- (
params$mu
+ params$alpha[pumpkins$fertilizer]
+ params$beta[pumpkins$water]
+ ifelse(interaction %in% names(params$gamma),
params$gamma[interaction],
0
)
)
# check it agrees with the previous code:
stopifnot(all( mw == pumpkins$mean_weight ))
Finally, we can draw the random values:
##
## Call:
## lm(formula = weight ~ fertilizer + water, data = pumpkins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.65344 -0.56985 -0.01622 0.66858 1.59462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4415 0.1507 2.930 0.00408 **
## fertilizermedium 1.0873 0.1845 5.893 3.8e-08 ***
## fertilizerhigh 3.5027 0.1845 18.984 < 2e-16 ***
## waterwater 4.1258 0.1507 27.386 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8252 on 116 degrees of freedom
## Multiple R-squared: 0.9067, Adjusted R-squared: 0.9043
## F-statistic: 375.9 on 3 and 116 DF, p-value: < 2.2e-16
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 257.1 128.6 188.8 <2e-16 ***
## water 1 510.7 510.7 750.0 <2e-16 ***
## Residuals 116 79.0 0.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Or equivalently,
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 257.14 128.57 188.83 < 2.2e-16 ***
## water 1 510.66 510.66 749.99 < 2.2e-16 ***
## Residuals 116 78.98 0.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: this table assumes exactly \(n\) observations in every cell.
John Tukey has a method for that.
> ?TukeyHSD
TukeyHSD package:stats R Documentation
Compute Tukey Honest Significant Differences
Description:
Create a set of confidence intervals on the differences between
the means of the levels of a factor with the specified family-wise
probability of coverage. The intervals are based on the
Studentized range statistic, Tukey's ‘Honest Significant
Difference’ method.
...
When comparing the means for the levels of a factor in an analysis
of variance, a simple comparison using t-tests will inflate the
probability of declaring a significant difference when it is not
in fact present. This because the intervals are calculated with a
given coverage probability for each interval but the
interpretation of the coverage is usually with respect to the
entire family of intervals.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight ~ fertilizer + water, data = pumpkins)
##
## $fertilizer
## diff lwr upr p adj
## medium-low 1.087309 0.6492477 1.525371 1e-07
## high-low 3.502706 3.0646438 3.940767 0e+00
## high-medium 2.415396 1.9773344 2.853458 0e+00
##
## $water
## diff lwr upr p adj
## water-no water 4.125756 3.827369 4.424143 0
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 257.1 128.6 575.4 <2e-16 ***
## water 1 510.7 510.7 2285.3 <2e-16 ***
## fertilizer:water 2 53.5 26.8 119.7 <2e-16 ***
## Residuals 114 25.5 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = weight ~ fertilizer * water, data = pumpkins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.23696 -0.26951 0.00257 0.32312 0.85524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9672 0.1057 9.151 2.64e-15 ***
## fertilizermedium 0.9780 0.1495 6.543 1.79e-09 ***
## fertilizerhigh 2.0347 0.1495 13.611 < 2e-16 ***
## waterwater 3.0742 0.1495 20.566 < 2e-16 ***
## fertilizermedium:waterwater 0.2186 0.2114 1.034 0.303
## fertilizerhigh:waterwater 2.9360 0.2114 13.888 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4727 on 114 degrees of freedom
## Multiple R-squared: 0.9699, Adjusted R-squared: 0.9686
## F-statistic: 735.1 on 5 and 114 DF, p-value: < 2.2e-16
Or equivalently,
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 257.14 128.57 575.37 < 2.2e-16 ***
## water 1 510.66 510.66 2285.29 < 2.2e-16 ***
## fertilizer:water 2 53.51 26.75 119.73 < 2.2e-16 ***
## Residuals 114 25.47 0.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Me: Hey, I made our model more complicated, and look, it fits better!
You: Yeah, of course it does. How much better?
Me: How can we tell?
You: Well, does it reduce the residual variance more than you’d expect by chance?
To compare two models, \[\begin{aligned} F &= \frac{\text{(explained variance)}}{\text{(residual variance)}} \\ &= \frac{\text{(mean square model)}}{\text{(mean square residual)}} \\ &= \frac{\frac{\text{RSS}_1 - \text{RSS}_2}{p_2-p_1}}{\frac{\text{RSS}_2}{n-p_2}} \end{aligned}\]
anova(
lm(weight ~ water, data=pumpkins),
lm(weight ~ fertilizer + water, data=pumpkins),
lm(weight ~ fertilizer * water, data=pumpkins)
)
## Analysis of Variance Table
##
## Model 1: weight ~ water
## Model 2: weight ~ fertilizer + water
## Model 3: weight ~ fertilizer * water
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 118 336.12
## 2 116 78.98 2 257.138 575.37 < 2.2e-16 ***
## 3 114 25.47 2 53.509 119.73 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Do a stepwise model comparison of nested linear models, including plant
and plot
in the analysis. Think about what order to do the comparison in. Make sure they are nested!
Data: data/pumpkins.tsv
Recall that plot
is nested within the treatments, so we should get one “plot” effect for each plot level in each combination of treatments. (Doing weight ~ water * fertilizer + plot
is wrong - why?) First we check we get the right thing:
pumpkins$plot <- factor(pumpkins$plot)
summary(
lm(weight ~ water * fertilizer / plot, data=pumpkins)
)
##
## Call:
## lm(formula = weight ~ water * fertilizer/plot, data = pumpkins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25797 -0.25966 0.02709 0.29666 0.89940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.138302 0.207391 5.489 3.30e-07 ***
## waterwater 3.242620 0.293294 11.056 < 2e-16 ***
## fertilizermedium 1.002982 0.293294 3.420 0.000922 ***
## fertilizerhigh 2.103538 0.293294 7.172 1.53e-10 ***
## waterwater:fertilizermedium -0.384763 0.414781 -0.928 0.355926
## waterwater:fertilizerhigh 2.564374 0.414781 6.182 1.53e-08 ***
## waterno water:fertilizerlow:plot2 -0.427292 0.293294 -1.457 0.148415
## waterwater:fertilizerlow:plot2 -0.294556 0.293294 -1.004 0.317758
## waterno water:fertilizermedium:plot2 -0.157379 0.293294 -0.537 0.592792
## waterwater:fertilizermedium:plot2 0.359624 0.293294 1.226 0.223140
## waterno water:fertilizerhigh:plot2 -0.492073 0.293294 -1.678 0.096650 .
## waterwater:fertilizerhigh:plot2 0.127707 0.293294 0.435 0.664234
## waterno water:fertilizerlow:plot3 -0.247479 0.293294 -0.844 0.400885
## waterwater:fertilizerlow:plot3 -0.246562 0.293294 -0.841 0.402625
## waterno water:fertilizermedium:plot3 -0.469528 0.293294 -1.601 0.112690
## waterwater:fertilizermedium:plot3 0.458365 0.293294 1.563 0.121386
## waterno water:fertilizerhigh:plot3 -0.377854 0.293294 -1.288 0.200734
## waterwater:fertilizerhigh:plot3 -0.019712 0.293294 -0.067 0.946555
## waterno water:fertilizerlow:plot4 -0.009474 0.293294 -0.032 0.974299
## waterwater:fertilizerlow:plot4 -0.816747 0.293294 -2.785 0.006454 **
## waterno water:fertilizermedium:plot4 -0.157182 0.293294 -0.536 0.593253
## waterwater:fertilizermedium:plot4 0.137664 0.293294 0.469 0.639867
## waterno water:fertilizerhigh:plot4 -0.089737 0.293294 -0.306 0.760296
## waterwater:fertilizerhigh:plot4 -0.254597 0.293294 -0.868 0.387526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4637 on 96 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9698
## F-statistic: 167 on 23 and 96 DF, p-value: < 2.2e-16
And, the ANOVA:
anova(
lm(weight ~ water * fertilizer, data=pumpkins),
lm(weight ~ water * fertilizer / plot, data=pumpkins)
)
## Analysis of Variance Table
##
## Model 1: weight ~ water * fertilizer
## Model 2: weight ~ water * fertilizer/plot
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 114 25.474
## 2 96 20.645 18 4.8285 1.2474 0.2409
Also: what are the lines when you call anova( )
on just one linear model? Look for the “Sum of Sq(uares)” entries that agree (F-statistics will not because the denominator differs):
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 257.14 128.57 575.37 < 2.2e-16 ***
## water 1 510.66 510.66 2285.29 < 2.2e-16 ***
## fertilizer:water 2 53.51 26.75 119.73 < 2.2e-16 ***
## Residuals 114 25.47 0.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
First line is comparing adding fertilizer to not:
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 257.14 128.57 575.37 < 2.2e-16 ***
## Analysis of Variance Table
##
## Model 1: weight ~ water
## Model 2: weight ~ water + fertilizer
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 118 336.12
## 2 116 78.98 2 257.14 188.83 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Second line is comparing adding water to not:
## Df Sum Sq Mean Sq F value Pr(>F)
## water 1 510.66 510.66 2285.29 < 2.2e-16 ***
## Analysis of Variance Table
##
## Model 1: weight ~ fertilizer
## Model 2: weight ~ fertilizer + water
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 117 589.64
## 2 116 78.98 1 510.66 749.99 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Third line is comparison with and without interactions:
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer:water 2 53.51 26.75 119.73 < 2.2e-16 ***
anova(
lm(weight ~ fertilizer + water, data=pumpkins),
lm(weight ~ fertilizer + water + fertilizer: water, data=pumpkins)
)
## Analysis of Variance Table
##
## Model 1: weight ~ fertilizer + water
## Model 2: weight ~ fertilizer + water + fertilizer:water
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 116 78.983
## 2 114 25.474 2 53.509 119.73 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1