Model comparison with ANOVA

Peter Ralph

27 October – Advanced Biological Statistics

Model comparison with ANOVA

Recall the pumpkins

pumpkins <- read.table("data/pumpkins.tsv", header=TRUE)
pumpkins$plot <- factor(pumpkins$plot)
head(pumpkins)

##   fertilizer    water plot plant    weight mean_weight
## 1        low no water    1     1  8.231855           8
## 2     medium no water    1     1 13.478381          14
## 3       high no water    1     1 21.095921          20
## 4        low    water    1     1  8.152066           6
## 5     medium    water    1     1 17.195926          16
## 6       high    water    1     1 21.328989          20

Questions:

How does mean weight depend on fertilizer?
… or, on late-season water?
Does the effect of fertilizer depend on late-season water?
How much does mean weight differ between different plants in the same conditions?
… and, between plots of the same conditions?

Additive model

summary(lm(weight ~ fertilizer + water, data=pumpkins))

## 
## Call:
## lm(formula = weight ~ fertilizer + water, data = pumpkins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5293 -1.0821  0.1251  1.0006  3.0004 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       19.8660     0.2682  74.084   <2e-16 ***
## fertilizerlow    -13.0065     0.3284 -39.603   <2e-16 ***
## fertilizermedium  -4.7970     0.3284 -14.606   <2e-16 ***
## waterwater         0.3018     0.2682   1.126    0.263    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.469 on 116 degrees of freedom
## Multiple R-squared:  0.9326, Adjusted R-squared:  0.9309 
## F-statistic: 535.2 on 3 and 116 DF,  p-value: < 2.2e-16

Does the effect of fertilizer depend on late-season water?

… and, with interactions

summary(lm(weight ~ fertilizer * water, data=pumpkins))

## 
## Call:
## lm(formula = weight ~ fertilizer * water, data = pumpkins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.96871 -0.64682  0.00617  0.77549  2.05257 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  20.00462    0.25368  78.857  < 2e-16 ***
## fertilizerlow               -12.08324    0.35876 -33.680  < 2e-16 ***
## fertilizermedium             -6.13599    0.35876 -17.103  < 2e-16 ***
## waterwater                    0.02462    0.35876   0.069 0.945402    
## fertilizerlow:waterwater     -1.84651    0.50736  -3.639 0.000412 ***
## fertilizermedium:waterwater   2.67808    0.50736   5.278  6.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.135 on 114 degrees of freedom
## Multiple R-squared:  0.9605, Adjusted R-squared:  0.9588 
## F-statistic: 554.3 on 5 and 114 DF,  p-value: < 2.2e-16

The idea

Me: Hey, I made our model more complicated, and look, it fits better!

You: Yeah, of course it does. How much better?

Me: How can we tell?

You: Well, does it reduce the residual variance more than you’d expect by chance?

The \(F\) statistic

To compare two models, \[\begin{aligned} F &= \frac{\text{(explained variance)}}{\text{(residual variance)}} \\ &= \frac{\text{(mean square model)}}{\text{(mean square residual)}} \\ &= \frac{\frac{\text{RSS}_1 - \text{RSS}_2}{p_2-p_1}}{\frac{\text{RSS}_2}{n-p_2}} \end{aligned}\]

Nested model analysis

anova(
      lm(weight ~ water, data=pumpkins),
      lm(weight ~ fertilizer + water, data=pumpkins),
      lm(weight ~ fertilizer * water, data=pumpkins)
)

## Analysis of Variance Table
## 
## Model 1: weight ~ water
## Model 2: weight ~ fertilizer + water
## Model 3: weight ~ fertilizer * water
##   Res.Df    RSS Df Sum of Sq        F    Pr(>F)    
## 1    118 3711.3                                    
## 2    116  250.2  2    3461.0 1344.507 < 2.2e-16 ***
## 3    114  146.7  2     103.5   40.212 6.095e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Your turn

Do a stepwise model comparison of nested linear models, including plot in the analysis. Think about what order to do the comparison in. Make sure they are nested!

Data: data/pumpkins.tsv