\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Analysis of Variance

Peter Ralph

6 October – Advanced Biological Statistics

Outline

Goal

To compare means of something between groups.

Related topics:

  • When can you do it, and how well? Power, false positive rate.
  • How can experiments best do it? Experimental design.
  • Methods: two-sample \(t\)-test, (one-way) ANOVA, permutation tests

Comparing means

Example:

How different are AirBnB prices between neighbourhoods?

airbnb <- read.csv("../Datasets/portland-airbnb-listings.csv", stringsAsFactors=TRUE)
airbnb$price <- as.numeric(gsub("$", "", airbnb$price, fixed=TRUE))
airbnb$neighbourhood[airbnb$neighbourhood == ""] <- NA
(neighbourhood_counts <- sort(table(airbnb$neighbourhood), decreasing=TRUE))
## 
##             Richmond   Northwest District            Concordia             Downtown              Buckman                 King            Sunnyside          Boise-Eliot    Hosford-Abernethy             Overlook 
##                  318                  238                  230                  221                  205                  188                  166                  160                  159                  156 
##            Mt. Tabor            Irvington    Sellwood-Moreland           Montavilla                Pearl             Humboldt                Kerns          Arbor Lodge             Piedmont            Woodstock 
##                  145                  134                  133                  129                  120                  112                  104                  103                   98                   91 
##                Cully       South Portland             Woodlawn            St. Johns               Kenton                Eliot       Rose City Park                Sabin          South Tabor               Vernon 
##                   87                   87                   85                   81                   80                   78                   76                   75                   73                   73 
##   Creston-Kenilworth            Hillsdale   Old Town/Chinatown    Beaumont-Wilshire              Roseway             Brooklyn             N. Tabor                Lents Brentwood-Darlington          Mount Scott 
##                   68                   59                   59                   58                   55                   53                   53                   50                   49                   48 
##      University Park        Foster-Powell          Laurelhurst            Multnomah     Sullivan's Gulch            Hazelwood  Powellhurst-Gilbert      Southwest Hills             Hayhurst        Madison South 
##                   48                   47                   45                   45                   45                   44                   44                   41                   38                   37 
##           Portsmouth              Alameda          Forest Park            Homestead       Cathedral Park         Goose Hollow                 Reed         Eastmoreland           Grant Park             Parkrose 
##                   37                   35                   34                   34                   32                   32                   30                   28                   27                   26 
##             Hillside             Ashcreek      Pleasant Valley     Parkrose Heights   West Portland Park           Bridlemile            Mill Park     South Burlingame            Bridgeton               Sumner 
##                   25                   24                   21                   19                   18                   15                   15                   14                   12                   12 
##       Lloyd District              Markham                Argay         Collins View               Wilkes         Arnold Creek             Glenfair     Sylvan-Highlands    Arlington Heights        Far Southwest 
##                   11                   11                   10                   10                   10                    9                    9                    8                    7                    7 
##        Hayden Island            Hollywood            Maplewood        Marshall Park              Russell        East Columbia Northwest Industrial            Crestwood           Sunderland        Healy Heights 
##                    7                    7                    7                    7                    7                    4                    4                    3                    3                    1 
##        Woodland Park                      
##                    1                    0

Let’s take only the ten biggest neighbourhoods:

big_neighbourhoods <- names(neighbourhood_counts)[1:10]
sub_bnb <- subset(airbnb, !is.na(price) & neighbourhood %in% big_neighbourhoods)
sub_bnb <- droplevels(sub_bnb[, c("price", "neighbourhood", "host_id")])
nrow(sub_bnb)
## [1] 2023

Look at the data:

par(mar=c(9, 3, 1, 1)+.1)
plot(price ~ neighbourhood, data=sub_bnb, fill=grey(0.8), las=2, xlab='')

plot of chunk r plot_boxes

Preliminary conclusions? Formal questions?

ANOVA

The ANOVA model

The price \(P_{ij}\) of the \(j\)th room in neighbourhood \(i\) is \[\begin{equation} P_{ij} = \mu + \alpha_i + \epsilon_{ij} , \end{equation}\] where

  • \(\mu\) is the overall mean
  • \(\alpha_i\) is the mean deviation of neighborhood \(i\) from \(\mu\)
  • \(\epsilon_{ij}\) is what’s left over (“error”, or “residual”)

In words, \[\begin{equation} \text{(price)} = \text{(group mean)} + \text{(residual)} \end{equation}\]

ANOVA

  • Stands for ANalysis Of VAriance
  • Core statistical procedure in biology
  • Developed by R.A. Fisher in the early 20th Century
  • Core idea: ask how much variation exists within vs. among groups
  • ANOVAs are linear models that have categorical predictor and continuous response variables
  • The categorical predictors are often called factors, and can have two or more levels

Question 1: what are the means?

summary(lm(formula = price ~ neighbourhood, data = sub_bnb))
## 
## Call:
## lm(formula = price ~ neighbourhood, data = sub_bnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -211.70  -48.12  -23.16   17.28  762.30 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     118.15625    8.13700  14.521   <2e-16 ***
## neighbourhoodBuckman             11.10976   10.88102   1.021   0.3074    
## neighbourhoodConcordia           -5.53016   10.59577  -0.522   0.6018    
## neighbourhoodDowntown           118.53940   10.83458  10.941   <2e-16 ***
## neighbourhoodHosford-Abernethy   14.56073   11.52553   1.263   0.2066    
## neighbourhoodKing                 3.14322   11.08430   0.284   0.7768    
## neighbourhoodNorthwest District  23.42358   10.52246   2.226   0.0261 *  
## neighbourhoodOverlook           -13.53446   11.58099  -1.169   0.2427    
## neighbourhoodRichmond            -0.03638    9.98145  -0.004   0.9971    
## neighbourhoodSunnyside           -3.90324   11.40300  -0.342   0.7322    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102.9 on 2013 degrees of freedom
## Multiple R-squared:  0.1108, Adjusted R-squared:  0.1068 
## F-statistic: 27.86 on 9 and 2013 DF,  p-value: < 2.2e-16

Question 2: is there group heterogeneity?

I.e.: do mean prices differ by neighborhood?

How would you do this?

Design a statistic that would be big if mean prices are different between neighborhoods, and will be small if all neighborhoods are the same.

Question 2, answered by ANOVA

anova(lm(formula = price ~ neighbourhood, data = sub_bnb))
## Analysis of Variance Table
## 
## Response: price
##                 Df   Sum Sq Mean Sq F value    Pr(>F)    
## neighbourhood    9  2655967  295107  27.857 < 2.2e-16 ***
## Residuals     2013 21325161   10594                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F definitions

more on F

One or more predictor variables

  • One-way ANOVAs just have a single factor

  • Multi-factor ANOVAs

    • Factorial - two or more factors and their interactions
    • Nested - the levels of one factor are contained within another level
    • The models can be quite complex
  • ANOVAs use an \(F\)-statistic to test factors in a model

    • Ratio of two variances (numerator and denominator)
    • The numerator and denominator d.f. need to be included (e.g. \(F_{1, 34} = 29.43\))
  • Determining the appropriate test ratios for complex ANOVAs takes some work

Assumptions

  • Normally distributed groups

    • robust to non-normality if equal variances and sample sizes
  • Equal variances across groups

    • okay if largest-to-smallest variance ratio < 3:1
    • problematic if there is a mean-variance relationship among groups
  • Observations in a group are independent

    • randomly selected
    • don’t confound group with another factor
// reveal.js plugins