Advanced Biological Statistics Week 6

10/30/2018 & 11/01/2018

Goals for this week

One factor ANOVA
Git and GitHub
Means tests in ANOVA
Experimental Design
Power analyses
Multi-factor ANOVA

ANOVA

Stands for ANalysis of VAriance
Core statistical procedure in biology
Developed by R.A. Fisher in the early 20th Century
The core idea is to ask how much variation exists within vs. among groups
ANOVAs are linear models that have categorical predictor and continuous response variables
The categorical predictors are often called factors, and can have two or more levels (important to specify in R)
Each factor will have a hypothesis test
The levels of each factor may also need to be tested

ANOVA

Let’s start with an example

Percent time that male mice experiencing discomfort spent “stretching”.
Data are from an experiment in which mice experiencing mild discomfort (result of injection of 0.9% acetic acid into the abdomen) were kept in:
- isolation
- with a companion mouse not injected or
- with a companion mouse also injected and exhibiting “stretching” behaviors associated with discomfort
The results suggest that mice stretch the most when a companion mouse is also experiencing mild discomfort. Mice experiencing pain appear to “empathize” with co-housed mice also in pain.

From Langford, D. J.,et al. 2006. Science 312: 1967-1970

ANOVA

Let’s start with an example

In words:

stretching = intercept + treatment

- The model statement includes a response variable, a constant, and an explanatory variable.
- The only difference with regression is that here the explanatory variable is categorical.

ANOVA

Let’s start with an example

ANOVA

Conceptually similar to regression

ANOVA

Statistical results table

ANOVA

F-ratio calculation

ANOVA

F-ratio calculation

R INTERLUDE

One way ANOVA

Again, use the RNAseq_lip.tsv data again.
Let’s test for an effect of Population on Gene01 expression levels
First, let’s look at how the data are distributed

RNAseq_Data <- read.table('RNAseq_lip.tsv', header=T, sep='\t')
g1 <- RNAseq_Data$Gene01
Pop <- RNAseq_Data$Population
boxplot(g1~Pop, col=c("blue","green"))

Or, to plot all points:

stripchart(g1~Pop, vertical=T, pch=19, col=c("blue","green"), 
           at=c(1.25,1.75), method="jitter", jitter=0.05)
Pop_Anova <- aov(g1 ~ Pop)
summary(Pop_Anova)

R INTERLUDE

One way ANOVA

ANOVA

One or more predictor variables

One-way ANOVAs just have a single factor
Multi-factor ANOVAs
- Factorial - two or more factors and their interactions
- Nested - the levels of one factor are contained within another level
- The models can be quite complex
ANOVAs use an F-statistic to test factors in a model
- Ratio of two variances (numerator and denominator)
- The numerator and denominator d.f. need to be included (e.g. \(F_{1, 34} = 29.43\))
Determining the appropriate test ratios for complex ANOVAs takes some work

ANOVA

Assumptions

Normally distributed groups
- robust to non-normality if equal variances and sample sizes
Equal variances across groups
- okay if largest-to-smallest variance ratio < 3:1
- problematic if there is a mean-variance relationship among groups
Observations in a group are independent
- randomly selected
- don’t confound group with another factor

Different ways to include factors in models

ANOVA

Fixed effects of factors

Groups are predetermined, of direct interest, repeatable.
For example:
- medical treatments in a clinical trial
- predetermined doses of a toxin
- age groups in a population
- habitat, season, etc.
Any conclusions reached in the study about differences among groups can be applied only to the groups included in the study.
The results cannot be generalized to other treatments, habitats, etc. not included in the study.

ANOVA

Random effects of factors

Measurements that come in groups. A group can be:
- a family made up of siblings
- a subject measured repeatedly
- a transect of quadrats in a sampling survey
- a block of an experiment done at a given time
Groups are assumed to be randomly sampled from a population of groups.
Therefore, conclusions reached about groups can be generalized to the population of groups.
With random effects, the variance among groups is the main quantity of interest, not the specific group attributes.

ANOVA

Random effects of factors

Below are cases where you are likely to treat factors as random effects
Whenever your sampling design is nested
- quadrats within transects
- transects within woodlots
- woodlots within districts
Whenever you divide up plots and apply separate treatments to subplots
Whenever your replicates are grouped spatially or temporally
- in blocks
- in batches
Whenever you take measurements on related individuals
Whenever you measure subjects or other sampling units repeatedly

ANOVA

Random effects of factors

ANOVA

Random effects - test your understanding

Factor is sex (Male vs. Female)
Factor is fish tank (10 tanks in an experiment)
Factor is family (measure multiple sibs per family)
Factor is temperature (10 arbitrary temps over natural range)

ANOVA

Caution about fixed vs. random effects

Using fixed vs. random effects changes the way that statistical tests are performed in ANOVA
Most statistical packages assume that all factors are fixed unless you instruct it otherwise
Designating factors as random takes extra work and probably a read of the manual
In R, lm assumes that all effects are fixed
For random effects, use lme instead (part of the nlme package)

Git and GitHub

https://learngitbranching.js.org/

Clone the repository

First make a new directory into which you will clone our course repository
Open the terminal and navigate to the directory and type the following

git clone https://github.com/wcresko/UO_ABS.git

Now to update the repository you just need to use these commands

git status

git merge origin/master

The first command just tells you if anything has changed
If so, do the second!

Means test to compare levels of a factor

Means for greater than two factor levels?

The F-ratio test for a single-factor ANOVA tests for any difference among groups.
If we want to understand specific differences, we need further “contrasts”.
Unplanned comparisons (post hoc):
- Multiple comparisons carried out after the results are obtained.
- Used to find where the differences lie (which means differ from which other means)
- Comparisons require protection for inflated Type 1 error rates:
  - Tukey tests: compare all pairs of means and control for multiple comparisons
  - Scheffé contrasts: compare all combinations of means
Planned comparisons (a priori):
- Comparisons between group means that were decided when the experiment was designed (not after the data were in)
- Must be few in number to avoid inflating Type 1 error rates

Planned (a priori) contrasts

A well planned experiment often dictates which comparison of means are of most interest, whereas other comparisons are of no interest.
By restricting the comparisons to just the ones of interest, researchers can mitigate the multiple testing problem associated with post-hoc tests.
Some statisticians argue that, in fact, planned comparisons allow researchers to avoid adjusting p-values all together because each test is therefore unique.
Contrasts can also allow more complicated tests of the relationships among means.
Coding a priori contrasts in R is quite easy and just depends upon writing the right series of coefficient contrasts.

Planned (a priori) contrasts

Understand the coefficients table

R INTERLUDE

Planned contrasts

Take the RNAseq data you’ve examined before and create a new four level genotype by combining genotype and microbiota treatment into a single variable
Think about how to do this using dplyr functions.

RNAseq_Data <- read.table("RNAseq.tsv", header=T, sep='')

x <- RNAseq_Data$categorical_var
y <- RNAseq_Data$continuous_var1
z <- RNAseq_Data$continuous_var2

Set up the a priori contrasts specifically testing one group mean against another
These are just examples - you should figure out the logic of the contrasts

contrasts(x) <- cbind(c(0, 1, 0, -1), c(2, -1, 0, -1), c(-1, -1, 3, -1))

Confirm that the contrasts are orthogonal

round(crossprod(contrasts(x)), 2)

R INTERLUDE

Planned contrasts

Define the contrast labels

rnaseq_data_list <- list(x = list(‘xxx vs. xxx’ = 1, ‘xxx vs. xxx’ = 2, ‘xxx vs. xxx’ = 3))

Then fit the fixed effect model

RNAseq_aov_fixed <- aov(y ~ x)
plot(RNAseq_aov_fixed)
boxplot(y ~ x)
summary(RNAseq_aov_fixed, split = rnaseq_data_list)

R INTERLUDE

Unplanned contrasts

Remember that this is when you had no hypotheses of differences in means in advance
Read in the perchlorate data from Week 3
Let’s assess the effects of the 4 perchlorate levels on T4
Which perchlorate levels differ in their effect on T4?

perc <- read.table('perchlorate_data.tsv', header=T, sep='\t')

x <- perc$Perchlorate_Level
y <- log10(perc$T4_Hormone_Level)

MyANOVA <- aov(y ~ x)
summary (MyANOVA)
boxplot(y ~ x)

install.packages("multcomp")
library(multcomp)

summary(glht(MyANOVA, linfct = mcp(x = "Tukey")))