Design of Experiments, and Analysis of Variance

Peter Ralph

8 October – Advanced Biological Statistics

Outline

Goal for the week

To compare means of something between groups. (Tools will be applicable to other tasks also.)

Using statistics.

When can you do it, and how well? Power/sensitivity, false positive rate.
How can experiments best do it? Experimental design.
Methods: two-sample \(t\)-test, (one-way) ANOVA, permutation tests

Comparing means

Example:

How different are AirBnB prices between neighbourhoods?

airbnb <- read.csv("../Datasets/portland-airbnb-listings.csv")
airbnb$price <- as.numeric(gsub("$", "", airbnb$price, fixed=TRUE))

## Warning: NAs introduced by coercion

airbnb$neighbourhood[airbnb$neighbourhood == ""] <- NA
(neighbourhood_counts <- sort(table(airbnb$neighbourhood), decreasing=TRUE))

## 
##             Richmond   Northwest District            Concordia 
##                  318                  238                  230 
##             Downtown              Buckman                 King 
##                  221                  205                  188 
##            Sunnyside          Boise-Eliot    Hosford-Abernethy 
##                  166                  160                  159 
##             Overlook            Mt. Tabor            Irvington 
##                  156                  145                  134 
##    Sellwood-Moreland           Montavilla                Pearl 
##                  133                  129                  120 
##             Humboldt                Kerns          Arbor Lodge 
##                  112                  104                  103 
##             Piedmont            Woodstock                Cully 
##                   98                   91                   87 
##       South Portland             Woodlawn            St. Johns 
##                   87                   85                   81 
##               Kenton                Eliot       Rose City Park 
##                   80                   78                   76 
##                Sabin          South Tabor               Vernon 
##                   75                   73                   73 
##   Creston-Kenilworth            Hillsdale   Old Town/Chinatown 
##                   68                   59                   59 
##    Beaumont-Wilshire              Roseway             Brooklyn 
##                   58                   55                   53 
##             N. Tabor                Lents Brentwood-Darlington 
##                   53                   50                   49 
##          Mount Scott      University Park        Foster-Powell 
##                   48                   48                   47 
##          Laurelhurst            Multnomah     Sullivan's Gulch 
##                   45                   45                   45 
##            Hazelwood  Powellhurst-Gilbert      Southwest Hills 
##                   44                   44                   41 
##             Hayhurst        Madison South           Portsmouth 
##                   38                   37                   37 
##              Alameda          Forest Park            Homestead 
##                   35                   34                   34 
##       Cathedral Park         Goose Hollow                 Reed 
##                   32                   32                   30 
##         Eastmoreland           Grant Park             Parkrose 
##                   28                   27                   26 
##             Hillside             Ashcreek      Pleasant Valley 
##                   25                   24                   21 
##     Parkrose Heights   West Portland Park           Bridlemile 
##                   19                   18                   15 
##            Mill Park     South Burlingame            Bridgeton 
##                   15                   14                   12 
##               Sumner       Lloyd District              Markham 
##                   12                   11                   11 
##                Argay         Collins View               Wilkes 
##                   10                   10                   10 
##         Arnold Creek             Glenfair     Sylvan-Highlands 
##                    9                    9                    8 
##    Arlington Heights        Far Southwest        Hayden Island 
##                    7                    7                    7 
##            Hollywood            Maplewood        Marshall Park 
##                    7                    7                    7 
##              Russell        East Columbia Northwest Industrial 
##                    7                    4                    4 
##            Crestwood           Sunderland        Healy Heights 
##                    3                    3                    1 
##        Woodland Park                      
##                    1                    0

Let’s take only the ten biggest neighbourhoods:

big_neighbourhoods <- names(neighbourhood_counts)[1:10]
sub_bnb <- subset(airbnb, !is.na(price) & neighbourhood %in% big_neighbourhoods)
sub_bnb <- droplevels(sub_bnb[, c("price", "neighbourhood", "host_id")])
nrow(sub_bnb)

## [1] 2023

Look at the data:

par(mar=c(7, 3, 1, 1)+.1)
plot(price ~ neighbourhood, data=sub_bnb, fill=grey(0.8), las=2, xlab='')

plot of chunk plot_boxes

Preliminary conclusions?

Formal questions?

ANOVA

The ANOVA model

The price \(P_{ij}\) of the \(j\)th room in neighbourhood \(i\) is \[\begin{equation} P_{ij} = \mu + \alpha_i + \epsilon_{ij} , \end{equation}\] where

\(\mu\) is the overall mean
\(\alpha_i\) is the mean deviation of neighborhood \(i\) from \(\mu\)
\(\epsilon_{ij}\) is what’s left over (“error”, or “residual”)

In words, \[\begin{equation} \text{(price)} = \text{(group mean)} + \text{(residual)} \end{equation}\]

ANOVA

Stands for ANalysis of VAriance
Core statistical procedure in biology
Developed by R.A. Fisher in the early 20th Century
The core idea is to ask how much variation exists within vs. among groups
ANOVAs are linear models that have categorical predictor and continuous response variables
The categorical predictors are often called factors, and can have two or more levels (important to specify in R)
Each factor will have a hypothesis test
The levels of each factor may also need to be tested

Question 1: what are the means?

summary(lm(formula = price ~ neighbourhood, data = sub_bnb))

## 
## Call:
## lm(formula = price ~ neighbourhood, data = sub_bnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.70  -48.12  -23.16   17.28  762.30 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     118.15625    8.13700  14.521   <2e-16 ***
## neighbourhoodBuckman             11.10976   10.88102   1.021   0.3074    
## neighbourhoodConcordia           -5.53016   10.59577  -0.522   0.6018    
## neighbourhoodDowntown           118.53940   10.83458  10.941   <2e-16 ***
## neighbourhoodHosford-Abernethy   14.56073   11.52553   1.263   0.2066    
## neighbourhoodKing                 3.14322   11.08430   0.284   0.7768    
## neighbourhoodNorthwest District  23.42358   10.52246   2.226   0.0261 *  
## neighbourhoodOverlook           -13.53446   11.58099  -1.169   0.2427    
## neighbourhoodRichmond            -0.03638    9.98145  -0.004   0.9971    
## neighbourhoodSunnyside           -3.90324   11.40300  -0.342   0.7322    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.9 on 2013 degrees of freedom
## Multiple R-squared:  0.1108, Adjusted R-squared:  0.1068 
## F-statistic: 27.86 on 9 and 2013 DF,  p-value: < 2.2e-16

Question 2: is there group heterogeneity?

anova(lm(formula = price ~ neighbourhood, data = sub_bnb))

## Analysis of Variance Table
## 
## Response: price
##                 Df   Sum Sq Mean Sq F value    Pr(>F)    
## neighbourhood    9  2655967  295107  27.857 < 2.2e-16 ***
## Residuals     2013 21325161   10594                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One or more predictor variables

One-way ANOVAs just have a single factor
Multi-factor ANOVAs
- Factorial - two or more factors and their interactions
- Nested - the levels of one factor are contained within another level
- The models can be quite complex
ANOVAs use an \(F\)-statistic to test factors in a model
- Ratio of two variances (numerator and denominator)
- The numerator and denominator d.f. need to be included (e.g. (F_{1, 34} = 29.43))
Determining the appropriate test ratios for complex ANOVAs takes some work

Assumptions

Normally distributed groups
- robust to non-normality if equal variances and sample sizes
Equal variances across groups
- okay if largest-to-smallest variance ratio < 3:1
- problematic if there is a mean-variance relationship among groups
Observations in a group are independent
- randomly selected
- don’t confound group with another factor

Experimental design

Goals of an experiment

What do we want to know?

How do we measure it?

This can be done by observation or experiment.

What’s an experiment?

An experiment is a study in which the values of important variables (e.g., group membership, dosage) are determined by the experimenters.

Otherwise, it is an observational study.

Note that controlling the set-up doesn’t necessarily make it a good experiment.

A biological example to get us started

Say you perform an experiment on two different strains of stickleback fish, one from an ocean population (RS) and one from a freshwater lake (BP) by making them microbe free. Microbes in the gut are known to interact with the gut epithelium in ways that lead to a proper maturation of the immune system.

Experimental setup: You decide to carry out an experiment by treating multiple fish from each strain so that some of them have a conventional microbiota, and some of them are inoculated with only one bacterial species. You then measure the levels of gene expression in the stickleback gut using RNA-seq. Because you have a suspicion that the sex of the fish might be important, you track it too.

What makes a good study?

Will we have the power to detect the effect of interest?
- What are the sources of noise?
- How big do we expect the effect to be?
How generalizable will the results be?
- How representative is the sample? Of what group?
What are possible causal explanations?
- What are possible confounding factors?

Considerations

Where do the samples come from?
Sample size, replication, and balance across groups
Controls: setting up good comparisons
Randomization!

For (2), remember that \[\begin{equation} \text{(margin of error)} \propto \frac{\sigma}{\sqrt{n}} . \end{equation}\]

Tidy data

Rules of thumb for data tidiness

Store a copy of data in nonproprietary software and hardware formats, such as plain ASCII text (aka a flat file)
Leave an uncorrected file when doing analyses
Use descriptive names for your data files and variables
Include a header line with descriptive variable names
Maintain effective metadata about the data
When you add observations to a database, add rows
When you add variables to a database, add columns, not rows
A column of data should contain only one data type
All measurements of the same type should be in the same column

Example?

Have a look at the AirBnB dataset. Is it “tidy”?

Exercise

Design a tidy data format for the stickleback experiment:

Tools for tidy data

Tidying data is hard!

… and often requires expert input.

Many common data wrangling operations are made easier by the tidyverse.

The “tidyverse”

packages that do many of the same things as base functions in R
designed to do them more “cleanly”
also includes ggplot (for “Grammar of Graphics”)

A tibble is a data frame

Key functions in dplyr for vectors

Pick observations by their values with filter().
Reorder the rows with arrange().
Pick variables by their names with select().
Create new variables with functions of existing variables with mutate().
Collapse many values down to a single summary with summarise().

`filter()`, `arrange()` and `select()`

a1 <- select(airbnb, neighbourhood, price, host_id, beds, bathrooms)

a2 <- filter(a1, neighbourhood == "Richmond" 
                 | neighbourhood == "Woodlawn"
                 | neighbourhood == "Downtown")

a3 <- arrange(a2, price, neighbourhood)

`mutate()` and `transmutate()`

Add new variables:

mutate(a3,
    price_per_bed = price / beds,
    price_per_bath = price / bathrooms)

Or, make an entirely new data frame:

transmute(airbnb,
    price = price,
    price_per_bed = price / beds,
    price_per_bath = price / bathrooms)

`group_by()` and `summarize()`

group_by() aggregates data by category, e.g.:

by_hood <- group_by(a3, neighbourhood)

Now, you can calculate summaries of other variables within each group, e.g.:

summarise(by_hood, price = mean(price, na.rm = TRUE))

Your turn

Make a data frame only including rooms in the top ten neighbourhoods. Then, using only these neighbourhoods…
Find the mean price, cleaning_fee, and ratio of cleaning fee to price, by neighbourhood.
Edit your code in (2) to add variables for the 25% and 75% quantile of price (use quantile( )).
Do as in (2) and (3) but splitting by both neighbourhood and room_type (e.g., finding the mean price of private rooms in Woodlawn).
Edit your code in (1) to add a new variable giving the number of characters in the house_rules (use nchar( )).

Design of Experiments, and Analysis of Variance

Outline

Goal for the week

Comparing means

Example:

ANOVA

The ANOVA model

ANOVA

Question 1: what are the means?

Question 2: is there group heterogeneity?

One or more predictor variables

Assumptions

Experimental design

Goals of an experiment

What’s an experiment?

A biological example to get us started

What makes a good study?

Considerations

Tidy data

Rules of thumb for data tidiness

Example?

Exercise

Tools for tidy data

The “tidyverse”

A tibble is a data frame

A tibble is a data frame

Key functions in dplyr for vectors

filter(), arrange() and select()

mutate() and transmutate()

group_by() and summarize()

Your turn

`filter()`, `arrange()` and `select()`

`mutate()` and `transmutate()`

`group_by()` and `summarize()`