Multivariate ANOVA

Peter Ralph

Advanced Biological Statistics

Outline

Linear models…

with categorical variables (multivariate ANOVA)
with continuous variables (least-squares linear models)
and likelihood (where’s “least-squares” come from).

Multivariate ANOVA

The factorial ANOVA model

Say we have $n$ observations coming from combinations of two factors, so that the $k$th observation in the $i$th group of factor $A$ and the $j$th group of factor $B$ is \[\begin{equation} X_{ijk} = \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk} , \end{equation}\] where

$\mu$: overall mean
$\alpha_i$: mean deviation of group $i$ of factor A from $\mu$ and average of B,
$\beta_j$: mean deviation of group $j$ of factor B from $\mu$ and average of A,
$\gamma_{ij}$: mean deviation of combination $i + j$ from $\mu + \alpha_i + \beta_j$, and
$\epsilon_{ijk}$: what’s left over (“error”, or “residual”)

In words, \[\begin{equation} \begin{split} \text{(value)} &= \text{(overall mean)} + \text{(A group mean)} \\ &\qquad {} + \text{(B group mean)} + \text{(AB group mean)} + \text{(residual)} \end{split}\end{equation}\]

Example: pumpkin pie

We’re looking at how mean pumpkin weight depends on both

fertilizer input, and
late-season watering

So, we

divide a large field into many plots
randomly assign plots to either “high”, “medium”, or “low” fertilizer, and
independently, assign plots to either “no late water” or “late water”; then
plant a fixed number of plants per plot,
grow pumpkins and measure their weight.

Questions:

How does mean weight depend on fertilizer?
… or, on late-season water?
Does the effect of fertilizer depend on late-season water?
How much does mean weight differ between different plants in the same conditions?
… and, between plots of the same conditions?
How much does weight of different pumpkins on the same plant differ?

draw the pictures

First, a simplification

Rightarrow Ignore any “plant” and “plot” effects.

(e.g., only one pumpkin per vine and one plot per combination of conditions)

Say that $i=1, 2, 3$ indexes fertilizer levels (low to high), and $j=1, 2$ indexes late watering (no or yes), and \[\begin{equation}\begin{split} X_{ijk} &= \text{(weight of $k$th pumpkin in plot with conditions $i$, $j$)} \\ &= \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk} , \end{split}\end{equation}\] where

$\mu$:
$\alpha_i$:
$\beta_j$:
$\gamma_{ij}$:
$\epsilon_{ijk}$:

Making it real with simulation

A good way to get a better concrete understanding of something is by simulating it –

by writing code to generate a (random) dataset that you design to look, more or less like what you expect the real data to look like.

This lets you explore statistical power, choose sample sizes, etcetera… but also makes you realize things you hadn’t, previously.

First, make up some numbers

$\mu = 5$ kg
$\alpha_{low} = 0$ kg
$\alpha_\text{med} = 0.5$ kg
$\alpha_\text{high} = 1.0$ kg
$\beta_\text{no water} = 0$ kg
$\beta_\text{water} = 1$ kg
$\gamma_{\text{high, water}} = -1$ kg
$\epsilon_{ijk} \sim$ Normal$(\text{mean}=0, \text{sd}=1.5)$ kg

picture of the board with parameter values on it

Next, a data format

pumpkins <- expand.grid(
          fertilizer=c("low", "medium", "high"),
          water=c("no water", "water"),
          plot=1:4,
          plant=1:5,
          weight=NA)
head(pumpkins)

##   fertilizer    water plot plant weight
## 1        low no water    1     1     NA
## 2     medium no water    1     1     NA
## 3       high no water    1     1     NA
## 4        low    water    1     1     NA
## 5     medium    water    1     1     NA
## 6       high    water    1     1     NA

# true parameters
params <- list(
    mu = 5, # kg
    alpha = c("low" = 0,
              "medium" = +0.5,
              "high" = +1
    ),
    beta = c("no water" = 0,
             "water" = +1
    ),
    gamma = c("high,water" = -1),
    sigma = 1.5
)

In class

pumpkins$mean_weight <- NA
for (j in 1:nrow(pumpkins)) {
    f <- as.character(pumpkins$fertilizer[j])
    w <- as.character(pumpkins$water[j])
    fw <- paste(f, w, sep=",")
    if (fw %in% names(params$gamma)) {
        gamma <- params$gamma[fw]
    } else {
        gamma <- 0
    }
    pumpkins$mean_weight[j] <- (
        params$mu
        + params$alpha[f]
        + params$beta[w]
        + gamma
    )
}

Or, equivalently,

inter_name <- paste(pumpkins$fertilizer, pumpkins$water, sep=",")
gamma <- ifelse(
        inter_name %in% names(params$gamma),
        params$gamma[inter_name],
        0
)
pumpkins$mean_weight <- (
    params$mu
    + params$alpha[as.character(pumpkins$fertilizer)]
    + params$beta[as.character(pumpkins$water)]
    + gamma
)

pumpkins$weight <- rnorm(
    nrow(pumpkins),
    mean=pumpkins$mean_weight,
    sd=params$sigma
)

ggplot(pumpkins) + geom_boxplot(aes(y=weight, fill=water)) +
    labs(x="", y="weight (kg)", title="pumpkin weights") +
    facet_wrap(~ fertilizer)

plot of chunk r plotit

Finally, we can draw the random values:

write.table(pumpkins, file="data/pumpkins.tsv")
    
ggplot(pumpkins) + geom_boxplot(aes(x=fertilizer, fill=water, y=weight))

plot of chunk r randoms

Questions that (linear models and) ANOVA can answer

What are (estimates of) the coefficents?

summary(lm(weight ~ fertilizer + water, data=pumpkins))

## 
## Call:
## lm(formula = weight ~ fertilizer + water, data = pumpkins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1705 -0.8935 -0.0265  1.1688  2.6672 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.4992     0.2663  20.649  < 2e-16 ***
## fertilizerlow     -0.5081     0.3262  -1.558  0.12199    
## fertilizermedium   0.2538     0.3262   0.778  0.43805    
## waterwater         1.0439     0.2663   3.920  0.00015 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.459 on 116 degrees of freedom
## Multiple R-squared:  0.1534, Adjusted R-squared:  0.1315 
## F-statistic: 7.009 on 3 and 116 DF,  p-value: 0.0002254

Do different fertilizer levels differ? water?

summary(aov(weight ~ fertilizer + water, data=pumpkins))

##              Df Sum Sq Mean Sq F value  Pr(>F)    
## fertilizer    2  12.04    6.02    2.83 0.06311 .  
## water         1  32.69   32.69   15.37 0.00015 ***
## Residuals   116 246.81    2.13                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Or equivalently,

anova(lm(weight ~ fertilizer + water, data=pumpkins))

## Analysis of Variance Table
## 
## Response: weight
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## fertilizer   2  12.042   6.021  2.8298 0.0631062 .  
## water        1  32.694  32.694 15.3660 0.0001503 ***
## Residuals  116 246.811   2.128                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What are all those numbers?

Note: this table assumes exactly $n$ observations in every cell.

What do they mean?

Which levels are different from which other ones?

John Tukey has a method for that.

> ?TukeyHSD

TukeyHSD                 package:stats                 R Documentation

Compute Tukey Honest Significant Differences

Description:

     Create a set of confidence intervals on the differences between
     the means of the levels of a factor with the specified family-wise
     probability of coverage.  The intervals are based on the
     Studentized range statistic, Tukey's ‘Honest Significant
     Difference’ method.

...

     When comparing the means for the levels of a factor in an analysis
     of variance, a simple comparison using t-tests will inflate the
     probability of declaring a significant difference when it is not
     in fact present.  This because the intervals are calculated with a
     given coverage probability for each interval but the
     interpretation of the coverage is usually with respect to the
     entire family of intervals.

Example

TukeyHSD(aov(weight ~ fertilizer + water, data=pumpkins))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ fertilizer + water, data = pumpkins)
## 
## $fertilizer
##                   diff        lwr       upr     p adj
## low-high    -0.5081166 -1.2824907 0.2662575 0.2681118
## medium-high  0.2538116 -0.5205625 1.0281857 0.7171494
## medium-low   0.7619282 -0.0124459 1.5363023 0.0548354
## 
## $water
##                    diff       lwr      upr     p adj
## water-no water 1.043934 0.5164675 1.571401 0.0001503

Does the effect of fertilizer depend on water?

summary(aov(weight ~ fertilizer * water, data=pumpkins))

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## fertilizer         2  12.04    6.02   2.994 0.0540 .  
## water              1  32.69   32.69  16.257 0.0001 ***
## fertilizer:water   2  17.55    8.77   4.363 0.0149 *  
## Residuals        114 229.26    2.01                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(lm(weight ~ fertilizer * water, data=pumpkins))

## 
## Call:
## lm(formula = weight ~ fertilizer * water, data = pumpkins)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7109 -0.8085  0.0077  0.9694  2.5657 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  6.00577    0.31710  18.940  < 2e-16 ***
## fertilizerlow               -1.10405    0.44845  -2.462  0.01532 *  
## fertilizermedium            -0.66999    0.44845  -1.494  0.13794    
## waterwater                   0.03078    0.44845   0.069  0.94540    
## fertilizerlow:waterwater     1.19187    0.63421   1.879  0.06276 .  
## fertilizermedium:waterwater  1.84760    0.63421   2.913  0.00431 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.418 on 114 degrees of freedom
## Multiple R-squared:  0.2136, Adjusted R-squared:  0.1791 
## F-statistic: 6.194 on 5 and 114 DF,  p-value: 4.075e-05

Or equivalently,

anova(lm(weight ~ fertilizer * water, data=pumpkins))

## Analysis of Variance Table
## 
## Response: weight
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## fertilizer         2  12.042   6.021  2.9939 0.0540477 .  
## water              1  32.694  32.694 16.2569 0.0001003 ***
## fertilizer:water   2  17.547   8.774  4.3626 0.0149398 *  
## Residuals        114 229.264   2.011                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model comparison with ANOVA

The idea

Me: Hey, I made our model more complicated, and look, it fits better!

You: Yeah, of course it does. How much better?

Me: How can we tell?

You: Well, does it reduce the residual variance more than you’d expect by chance?

The $F$ statistic

To compare two models, \[\begin{aligned} F &= \frac{\text{(explained variance)}}{\text{(residual variance)}} \\ &= \frac{\text{(mean square model)}}{\text{(mean square residual)}} \\ &= \frac{\frac{\text{RSS}_1 - \text{RSS}_2}{p_2-p_1}}{\frac{\text{RSS}_2}{n-p_2}} \end{aligned}\]

Nested model analysis

anova(
      lm(weight ~ water, data=pumpkins),
      lm(weight ~ fertilizer + water, data=pumpkins),
      lm(weight ~ fertilizer * water, data=pumpkins)
)

## Analysis of Variance Table
## 
## Model 1: weight ~ water
## Model 2: weight ~ fertilizer + water
## Model 3: weight ~ fertilizer * water
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    118 258.85                              
## 2    116 246.81  2    12.042 2.9939 0.05405 .
## 3    114 229.26  2    17.547 4.3626 0.01494 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Your turn

Do a stepwise model comparison of nested linear models, including plant and plot in the analysis. Think about what order to do the comparison in. Make sure plot is nested within treatment!

Data: pumkpins.tsv

Multivariate ANOVA

Outline

Multivariate ANOVA

The factorial ANOVA model

Example: pumpkin pie

Questions:

First, a simplification

Making it real with simulation

First, make up some numbers

Next, a data format

In class

Questions that (linear models and) ANOVA can answer

What are (estimates of) the coefficents?

Do different fertilizer levels differ? water?

What are all those numbers?

What do they mean?

Which levels are different from which other ones?

Example

Does the effect of fertilizer depend on water?

Model comparison with ANOVA

The idea

The \(F\) statistic

Nested model analysis

Your turn