Multifactor ANOVA, and visualization

Peter Ralph

15 October – Advanced Biological Statistics


  1. Permutation tests
  2. Visualization
  3. Means in many combinations of groups, i.e., multi-way ANOVA

Permutation tests

##  Welch Two Sample t-test
## data:  airbnb$price[airbnb$instant_bookable] and airbnb$price[!airbnb$instant_bookable]
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.475555 14.872518
## sample estimates:
## mean of x mean of y 
##  124.6409  114.9668

But, the \(t\) test relies on Normality. Is the distribution of AirBnB prices too “wierd”? How can we be sure?


  1. Remove the big values and try again.

  2. Use a nonparametric test.

Remove the big values


The permutation test

Observation: If there was no meaningful difference in prices between “instant bookable” and not, then randomly shuffling that label won’t change anything.


  1. Shuffle the instant_bookable column.
  2. Compute the difference in means.
  3. Repeat, many times.
  4. Compare: the \(p\)-value is the proportion of “shuffled” values more extreme than observed.

Rightarrow Why is this a \(p\)-value? For what hypothesis?

Shuffle once

## [1] 2.837541

Many times

plot of chunk many_shuf

How surprising was the real value?

## [1] 0

The difference in price between instant bookable and not instant bookable is highly statistically significant (\(p \approx 0.001\), permutation test).

Your turn

Do the analogous thing for the ANOVA comparing price between neighbourhoods:

## Analysis of Variance Table
## Response: price
##                 Df   Sum Sq Mean Sq F value    Pr(>F)    
## neighbourhood   91  6015248   66102  7.6277 < 2.2e-16 ***
## Residuals     5510 47749952    8666                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot of chunk in_class



  • pattern discovery

  • efficient summary of information

  • visual/spatial analogy for quantitative patterns

aim to maximize information and minimize ink

paraphrased from Edward Tufte


  • Is the visual analogy appropriate for the type of data?

counts? quantities? multivariate? relationships?

  • Are important comparisons clear?

between groups? differences? time trend?

  • Are units easily interpretable?

meters? dollars? percent? relative change? is it isometric?

Principles of effective display

  • Show the data
  • Encourage the eye to compare differences
  • Represent magnitudes honestly and accurately
  • Draw graphical elements clearly, minimizing clutter
  • Make displays easy to interpret

Above all else show the data.

Tufte 1983


Case study:

Distributions of litter sizes by Order, and Family, in the PanTHERIA dataset:

##   Microbiotheria    Tubulidentata       Dermoptera Notoryctemorphia 
##                1                1                2                2 
##      Proboscidea       Hyracoidea      Monotremata          Sirenia 
##                3                4                5                5 
## Paucituberculata        Pholidota           Pilosa    Macroscelidea 
##                6                8               10               15 
##   Perissodactyla       Scandentia        Cingulata  Peramelemorphia 
##               17               20               21               21 
##   Erinaceomorpha     Afrosoricida   Dasyuromorphia          Cetacea 
##               24               51               71               84 
##  Didelphimorphia       Lagomorpha    Diprotodontia     Artiodactyla 
##               87               92              143              240 
##        Carnivora         Primates     Soricomorpha       Chiroptera 
##              286              376              428             1116 
##         Rodentia 
##             2277

note the pipe

##           Order                  Family              Genus     
##  Artiodactyla:178   Muridae         : 242   Microtus    :  38  
##  Carnivora   :209   Cricetidae      : 239   Myotis      :  38  
##  Chiroptera  :465   Sciuridae       : 158   Crocidura   :  36  
##  Primates    :209   Vespertilionidae: 135   Peromyscus  :  32  
##  Rodentia    :883   Bovidae         : 110   Sorex       :  32  
##  Soricomorpha:116   Phyllostomidae  : 106   Spermophilus:  31  
##                     (Other)         :1070   (Other)     :1853  
##    Species            LitterSize    
##  Length:2060        Min.   : 0.960  
##  Class :character   1st Qu.: 1.000  
##  Mode  :character   Median : 1.970  
##                     Mean   : 2.489  
##                     3rd Qu.: 3.490  
##                     Max.   :11.300  

stem-and-leaf “plot”

##   The decimal point is at the |
##    0 | 
##    1 | 00000000000000000000000000000000000000000000000000000000000000000000+937
##    2 | 00000000000000000000000000000000000000000000000000000000000000000000+259
##    3 | 00000000000000000000000000000000000000000000000000000000000000000000+230
##    4 | 00000000000000000000000000000000000000000011111111122222222222222233+103
##    5 | 00000000000000000000000000000111111222222222222223333333333333444444+35
##    6 | 00000000000111122223333444455555555666777788899999
##    7 | 001111235555567888899
##    8 | 000011155557789
##    9 | 0000249
##   10 | 0
##   11 | 23

five(-ish) number summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.960   1.000   1.970   2.489   3.490  11.300


plot of chunk points

Points, sorted

plot of chunk points2

Points, sorted and colored

plot of chunk points3


plot of chunk hist



plot of chunk many_hist

Overlaid histograms

Overlaid histograms

plot of chunk do_stacked_hists


plot of chunk boxplot

introduced by Mary Eleanor Spear

Many boxes

plot of chunk boxplot3

plot of chunk boxplot4

Your turn

Challenge: visualize LitterSize by TeatNumber, using a boxplot.

The Grammar of Graphics

or, “gg

Ingredients of a visualization

  • data

  • coordinate axes

  • a geometric representation of numbers

  • a mapping from (summaries of) variables to properties of the geoms

  • maybe more plots

basic template

more options

Reference: the ggplot2 book.


plot of chunk ggpoints


plot of chunk gghist

Histogram, stacked

plot of chunk gghist2


plot of chunk boxplot2

Your turn, again

Challenge: make this plot.

plot of chunk fancyplot

The cheatsheet might be helpful.