ANOVA, and contingency tables

Peter Ralph

21 January 2018 – Advanced Biological Statistics

ANOVA

Metric data from groups

Suppose we have numerical observations coming from \(m\) different groups. We’d like to know how different the means are between groups, compared to variation within groups.

For example: dry leaf mass after 10d of growth of \(k=5\) genotypes in in \(m=4\) different conditions. (data here)

Dry leaf mass after 10d of growth of \(k=5\) genotypes in in \(m=4\) different conditions.

leafs <- read.table("leaf_mass.tsv", 
                    header=TRUE)
boxplot(y ~ geno + treat, 
        main='dry leaf mass', 
        xlab='genotype.treatment',
        data=leafs)

plot of chunk plotdata_ra

Each group has a different mean; noise about these is Normal with the same SD. Group means are random deviations from a mean determined additively by genotype and treatment.

Make the model.
What posterior(s) will we look at?

plot of chunk plotdata_ra4

a stan model

anova_model <- stan_model(model_code="
data {
    int N;
    int ngeno;
    int ntreat;
    vector[N] leaf;
    int geno[N];
    int treat[N];
}
parameters {
    real mu;
    vector[ngeno] alpha;
    vector[ntreat] beta;
    matrix[ngeno, ntreat] gamma;
    matrix<lower=0>[ngeno, ntreat] sigma;
    real<lower=0> sigma_gamma;
}
model {
    vector[N] muvec;
    vector[N] sigmavec;
    muvec = mu + alpha[geno] + beta[treat];
    for (k in 1:N) {
        muvec[k] += sigma_gamma * gamma[geno[k], treat[k]];
        sigmavec[k] = sigma[geno[k], treat[k]];
    }
    leaf ~ normal(muvec, sigmavec);
    // DOESN'T WORK: leaf ~ normal(muvec, sigma[geno, treat]);
    mu ~ normal(205, 10);
    alpha ~ normal(0, 10);
    beta ~ normal(0, 10);
    for (k in 1:ngeno) {
        // using Matt's trick here
        gamma[k] ~ normal(0, 1);
        sigma[k] ~ gamma(20, 10);
    }
    sigma_gamma ~ gamma(0.5, 0.5);
}
")

fit <- sampling(anova_model,
            data=list(N=nrow(leafs),
                      ngeno=ngeno,
                      ntreat=ntreat,
                      leaf=leafs$y,
                      geno=as.numeric(leafs$geno),
                      treat=as.numeric(leafs$treat)))

print(fit)

## Inference for Stan model: 585d0bae0af15bb83752f81372fca2b7.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##                mean se_mean   sd    2.5%     25%     50%     75%   97.5%
## mu           205.53    0.16 5.58  194.69  201.60  205.64  209.30  216.46
## alpha[1]      -2.15    0.12 4.19  -10.27   -4.93   -2.26    0.65    6.32
## alpha[2]      -5.46    0.12 4.19  -13.58   -8.27   -5.57   -2.69    2.91
## alpha[3]       1.26    0.12 4.19   -6.88   -1.55    1.15    4.05    9.70
## alpha[4]       5.58    0.12 4.19   -2.54    2.73    5.47    8.33   13.96
## alpha[5]       1.39    0.12 4.18   -6.77   -1.44    1.31    4.11    9.76
## beta[1]        3.29    0.13 4.56   -5.50    0.11    3.32    6.34   12.14
## beta[2]       -3.53    0.13 4.56  -12.28   -6.67   -3.51   -0.43    5.37
## beta[3]        3.10    0.13 4.56   -5.66   -0.03    3.15    6.20   11.89
## beta[4]       -1.75    0.13 4.56  -10.53   -4.91   -1.74    1.30    7.08
## gamma[1,1]     0.03    0.01 0.98   -1.89   -0.65    0.04    0.69    1.97
## gamma[1,2]     0.18    0.01 1.03   -1.85   -0.54    0.18    0.89    2.18
## gamma[1,3]    -0.13    0.01 1.00   -2.05   -0.81   -0.15    0.53    1.88
## gamma[1,4]    -0.03    0.01 0.98   -1.99   -0.69   -0.02    0.66    1.90
## gamma[2,1]    -0.32    0.01 1.01   -2.26   -1.01   -0.33    0.37    1.65
## gamma[2,2]    -0.06    0.01 0.97   -1.95   -0.72   -0.08    0.60    1.86
## gamma[2,3]     0.23    0.01 0.97   -1.70   -0.42    0.22    0.88    2.10
## gamma[2,4]     0.13    0.01 0.96   -1.74   -0.54    0.14    0.78    1.99
## gamma[3,1]     0.08    0.01 0.97   -1.83   -0.57    0.07    0.76    1.97
## gamma[3,2]     0.07    0.02 0.99   -1.94   -0.59    0.06    0.73    2.07
## gamma[3,3]    -0.08    0.01 0.97   -1.91   -0.72   -0.11    0.56    1.85
## gamma[3,4]    -0.06    0.01 0.94   -1.91   -0.69   -0.05    0.57    1.76
## gamma[4,1]     0.11    0.01 0.98   -1.84   -0.53    0.11    0.78    2.01
## gamma[4,2]    -0.14    0.01 0.98   -2.07   -0.77   -0.14    0.52    1.77
## gamma[4,3]    -0.08    0.01 0.96   -1.96   -0.73   -0.08    0.56    1.88
## gamma[4,4]     0.10    0.01 0.94   -1.75   -0.55    0.10    0.76    1.90
## gamma[5,1]     0.15    0.01 0.99   -1.82   -0.49    0.15    0.80    2.14
## gamma[5,2]    -0.12    0.01 0.98   -2.02   -0.81   -0.13    0.54    1.82
## gamma[5,3]     0.07    0.01 0.94   -1.78   -0.56    0.07    0.72    1.90
## gamma[5,4]    -0.12    0.02 1.02   -2.10   -0.83   -0.13    0.57    1.87
## sigma[1,1]     2.11    0.00 0.33    1.54    1.88    2.08    2.31    2.85
## sigma[1,2]     1.93    0.01 0.35    1.34    1.68    1.91    2.14    2.70
## sigma[1,3]     2.06    0.00 0.34    1.48    1.81    2.03    2.27    2.80
## sigma[1,4]     2.07    0.00 0.31    1.53    1.85    2.04    2.25    2.78
## sigma[2,1]     1.78    0.00 0.32    1.25    1.55    1.75    1.98    2.50
## sigma[2,2]     1.74    0.01 0.35    1.16    1.49    1.71    1.95    2.54
## sigma[2,3]     2.09    0.00 0.30    1.57    1.87    2.06    2.27    2.75
## sigma[2,4]     2.09    0.00 0.33    1.52    1.86    2.07    2.29    2.81
## sigma[3,1]     1.54    0.00 0.36    0.95    1.29    1.51    1.76    2.35
## sigma[3,2]     1.92    0.01 0.37    1.29    1.65    1.88    2.15    2.71
## sigma[3,3]     2.46    0.00 0.33    1.88    2.23    2.44    2.67    3.17
## sigma[3,4]     2.24    0.00 0.33    1.68    2.00    2.21    2.44    2.96
## sigma[4,1]     2.10    0.01 0.34    1.52    1.86    2.08    2.32    2.86
## sigma[4,2]     2.03    0.00 0.29    1.52    1.83    2.01    2.21    2.66
## sigma[4,3]     1.93    0.00 0.30    1.41    1.72    1.91    2.11    2.60
## sigma[4,4]     2.03    0.01 0.37    1.42    1.77    2.00    2.26    2.85
## sigma[5,1]     2.01    0.00 0.30    1.49    1.79    1.99    2.20    2.63
## sigma[5,2]     2.16    0.00 0.32    1.62    1.92    2.14    2.36    2.85
## sigma[5,3]     2.06    0.00 0.32    1.50    1.83    2.03    2.26    2.76
## sigma[5,4]     2.48    0.00 0.34    1.90    2.24    2.45    2.69    3.21
## sigma_gamma    0.16    0.00 0.17    0.00    0.02    0.10    0.23    0.61
## lp__        -387.87    0.13 5.17 -398.68 -391.22 -387.56 -384.18 -378.50
##             n_eff Rhat
## mu           1196    1
## alpha[1]     1161    1
## alpha[2]     1156    1
## alpha[3]     1149    1
## alpha[4]     1159    1
## alpha[5]     1178    1
## beta[1]      1257    1
## beta[2]      1254    1
## beta[3]      1262    1
## beta[4]      1256    1
## gamma[1,1]   4724    1
## gamma[1,2]   5409    1
## gamma[1,3]   5401    1
## gamma[1,4]   5242    1
## gamma[2,1]   4666    1
## gamma[2,2]   5058    1
## gamma[2,3]   4708    1
## gamma[2,4]   4677    1
## gamma[3,1]   5166    1
## gamma[3,2]   4382    1
## gamma[3,3]   6297    1
## gamma[3,4]   4911    1
## gamma[4,1]   4899    1
## gamma[4,2]   4357    1
## gamma[4,3]   5379    1
## gamma[4,4]   5253    1
## gamma[5,1]   5357    1
## gamma[5,2]   5478    1
## gamma[5,3]   6638    1
## gamma[5,4]   4535    1
## sigma[1,1]   5446    1
## sigma[1,2]   4795    1
## sigma[1,3]   4907    1
## sigma[1,4]   4876    1
## sigma[2,1]   4408    1
## sigma[2,2]   4658    1
## sigma[2,3]   5212    1
## sigma[2,4]   5156    1
## sigma[3,1]   5319    1
## sigma[3,2]   4825    1
## sigma[3,3]   5145    1
## sigma[3,4]   4856    1
## sigma[4,1]   4652    1
## sigma[4,2]   4516    1
## sigma[4,3]   5362    1
## sigma[4,4]   5103    1
## sigma[5,1]   4977    1
## sigma[5,2]   5359    1
## sigma[5,3]   5076    1
## sigma[5,4]   5091    1
## sigma_gamma  2364    1
## lp__         1666    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jan 24 23:29:11 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

stan_trace(fit, pars=c("mu", "sigma_gamma", "alpha", "beta"))

plot of chunk plot_trace

stan_plot(fit, pars=c("mu", "sigma_gamma", "alpha", "beta"))

## ci_level: 0.8 (80% intervals)

## outer_level: 0.95 (95% intervals)

plot of chunk plot_trace

stan_trace(fit, pars=c("gamma", "sigma"))

plot of chunk plot_trace2

samples <- extract(fit)

# let's get the posterior distribution of
# the group mean for (genotype a, treatment 1)

samples <- extract(fit)

post_a1 <- samples$mu + samples$alpha[,1] + samples$beta[,1] + samples$gamma[,1,1]

post_means <- as.vector(samples$mu) + samples$gamma
for (k in 1:ngeno) {
    post_means[,k,] <- post_means[,k,] + samples$alpha[,k]
}
for (k in 1:ntreat) {
    post_means[,,k] <- post_means[,,k] + samples$beta[,k]
}

dim(post_means) <- c(dim(post_means)[1], prod(dim(post_means)[2:3]))
colnames(post_means) <- outer(levels(leafs$geno), levels(leafs$treat), paste, sep=".")
post_means <- as.data.frame(post_means)

Stochastic minute

The Dirichlet distribution

A collection of \(n\) nonnegative random numbers \((P_1, \ldots, P_n)\) that sums to 1 has a Dirichlet(\(\alpha_1, \ldots, \alpha_n\)) distribution if it has probability density \[ \frac{1}{B(\alpha)} \prod_{i=1}^n p_i^{\alpha_i - 1} . \]

Facts:

If \(U \sim \Beta(\alpha, \beta)\) then \((U, 1-U) \sim \Dirichlet(\alpha, \beta)\).
If \(Y_1, \ldots, Y_n\) are independent, Exponential with rates \(\alpha_1, \ldots, \alpha_n\) then \[ Y / \sum_{i=1}^n Y_i \sim \Dirichlet(\alpha_1, \ldots, \alpha_n) . \]
Useful? Yes, if you need a distribution on numbers that sum to one (e.g., class proportions).

Exercise

Simulate from a

\(\Dirichlet(1, 1, 1)\)
\(\Dirichlet(100, 100, 100)\)

and describe the difference. (Hint: plot these in the \((P_1, P_2)\) plane.)

Categorical data

Hair and Eye color

data(HairEyeColor)

HairEyeColor             package:datasets              R Documentation

Hair and Eye Color of Statistics Students

Description:

     Distribution of hair and eye color and sex in 592 statistics
     students.

Usage:

     HairEyeColor
     
Format:

     A 3-dimensional array resulting from cross-tabulating 592
     observations on 3 variables.  The variables and their levels are
     as follows:

       No  Name  Levels                    
        1  Hair  Black, Brown, Red, Blond  
        2  Eye   Brown, Blue, Hazel, Green 
        3  Sex   Male, Female              
      
Details:

     The Hair x Eye table comes from a survey of students at the
     University of Delaware reported by Snee (1974).  The split by
     ‘Sex’ was added by Friendly (1992a) for didactic purposes.

     This data set is useful for illustrating various techniques for
     the analysis of contingency tables, such as the standard
     chi-squared test or, more generally, log-linear modelling, and
     graphical methods such as mosaic plots, sieve diagrams or
     association plots.

Source:

     <URL:
     http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas>

     Snee (1974) gives the two-way table aggregated over ‘Sex’.  The
     ‘Sex’ split of the ‘Brown hair, Brown eye’ cell was changed to
     agree with that used by Friendly (2000).

References:

     Snee, R. D. (1974).  Graphical display of two-way contingency
     tables.  _The American Statistician_, *28*, 9-12.  doi:
     10.2307/2683520 (URL: http://doi.org/10.2307/2683520).

## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

(haireye <- data.frame(hair = dimnames(HairEyeColor)[[1]][slice.index(HairEyeColor, 1)],
                      eye = dimnames(HairEyeColor)[[2]][slice.index(HairEyeColor, 2)],
                      sex = dimnames(HairEyeColor)[[3]][slice.index(HairEyeColor, 3)],
                      number = HairEyeColor))

##     hair   eye    sex number.Hair number.Eye number.Sex number.Freq
## 1  Black Brown   Male       Black      Brown       Male          32
## 2  Brown Brown   Male       Brown      Brown       Male          53
## 3    Red Brown   Male         Red      Brown       Male          10
## 4  Blond Brown   Male       Blond      Brown       Male           3
## 5  Black  Blue   Male       Black       Blue       Male          11
## 6  Brown  Blue   Male       Brown       Blue       Male          50
## 7    Red  Blue   Male         Red       Blue       Male          10
## 8  Blond  Blue   Male       Blond       Blue       Male          30
## 9  Black Hazel   Male       Black      Hazel       Male          10
## 10 Brown Hazel   Male       Brown      Hazel       Male          25
## 11   Red Hazel   Male         Red      Hazel       Male           7
## 12 Blond Hazel   Male       Blond      Hazel       Male           5
## 13 Black Green   Male       Black      Green       Male           3
## 14 Brown Green   Male       Brown      Green       Male          15
## 15   Red Green   Male         Red      Green       Male           7
## 16 Blond Green   Male       Blond      Green       Male           8
## 17 Black Brown Female       Black      Brown     Female          36
## 18 Brown Brown Female       Brown      Brown     Female          66
## 19   Red Brown Female         Red      Brown     Female          16
## 20 Blond Brown Female       Blond      Brown     Female           4
## 21 Black  Blue Female       Black       Blue     Female           9
## 22 Brown  Blue Female       Brown       Blue     Female          34
## 23   Red  Blue Female         Red       Blue     Female           7
## 24 Blond  Blue Female       Blond       Blue     Female          64
## 25 Black Hazel Female       Black      Hazel     Female           5
## 26 Brown Hazel Female       Brown      Hazel     Female          29
## 27   Red Hazel Female         Red      Hazel     Female           7
## 28 Blond Hazel Female       Blond      Hazel     Female           5
## 29 Black Green Female       Black      Green     Female           2
## 30 Brown Green Female       Brown      Green     Female          14
## 31   Red Green Female         Red      Green     Female           7
## 32 Blond Green Female       Blond      Green     Female           8

Questions:

Are hair and eye color independent in this sample?
Do hair and eye color proportions differ by sex?

Independence and multiplicativity

If hair and eye color are independent, then probabilities of combinations are multiplicative: \[\begin{aligned} \P\{\text{black hair and blue eyes}\} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\given\text{black hair}\} \\ \end{aligned}\]

which by independence is \[\begin{aligned} &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\} \end{aligned}\]

Multiplicativity to additivity

A model of independence will have a multiplicative form: \[ p_{ab} = p_a \times p_b . \]

Set \(\lambda = \log(p)\), so that \[ \lambda_{ab} = \lambda_a + \lambda_b . \]

Stochastic facts

Poisson additivity

If we have Poisson-many things of two categories: \[\begin{aligned} A &\sim \Poisson(a) \\ B &\sim \Poisson(b) \end{aligned}\] then the total number of things is also Poisson: \[ A + B \sim \Poisson(a + b) \] and each thing chooses its type independently: \[ A \given A + B \sim \Binom\left(A+B, \frac{a}{a+b}\right) . \]

Back to hair and eye colors

A model

For hair color \(i\) and eye color \(j\), the number of students with that combination is \[ N_{ij} \sim \Poisson(\mu_{ij}) . \]

If hair and eye color are independent of each other and of sex, then \[\begin{aligned} \mu_{ij} &= \exp\left( \alpha_i + \beta_j \right) . \end{aligned}\]

Nonindependence?

\[\begin{aligned} \mu_{ij} &= \exp\left( \alpha_i + \beta_j + \gamma_{ij} \right) ? \end{aligned}\]

Is this identifiable?

Nonindependence.

\[\begin{aligned} \mu_{ij} &= \exp\left( \delta_{ij} \right) \\ \delta_{ij} &\sim \Normal( \alpha_i + \beta_j, \sigma) \end{aligned}\]

Pick appropriate priors.