Peter Ralph
6 October – Advanced Biological Statistics
To compare means of something between groups.
Related topics:
How different are AirBnB prices between neighbourhoods?
airbnb <- read.csv("../Datasets/portland-airbnb-listings.csv", stringsAsFactors=TRUE)
airbnb$price <- as.numeric(gsub("$", "", airbnb$price, fixed=TRUE))
airbnb$neighbourhood[airbnb$neighbourhood == ""] <- NA
(neighbourhood_counts <- sort(table(airbnb$neighbourhood), decreasing=TRUE))
##
## Richmond Northwest District Concordia Downtown Buckman King Sunnyside Boise-Eliot Hosford-Abernethy Overlook
## 318 238 230 221 205 188 166 160 159 156
## Mt. Tabor Irvington Sellwood-Moreland Montavilla Pearl Humboldt Kerns Arbor Lodge Piedmont Woodstock
## 145 134 133 129 120 112 104 103 98 91
## Cully South Portland Woodlawn St. Johns Kenton Eliot Rose City Park Sabin South Tabor Vernon
## 87 87 85 81 80 78 76 75 73 73
## Creston-Kenilworth Hillsdale Old Town/Chinatown Beaumont-Wilshire Roseway Brooklyn N. Tabor Lents Brentwood-Darlington Mount Scott
## 68 59 59 58 55 53 53 50 49 48
## University Park Foster-Powell Laurelhurst Multnomah Sullivan's Gulch Hazelwood Powellhurst-Gilbert Southwest Hills Hayhurst Madison South
## 48 47 45 45 45 44 44 41 38 37
## Portsmouth Alameda Forest Park Homestead Cathedral Park Goose Hollow Reed Eastmoreland Grant Park Parkrose
## 37 35 34 34 32 32 30 28 27 26
## Hillside Ashcreek Pleasant Valley Parkrose Heights West Portland Park Bridlemile Mill Park South Burlingame Bridgeton Sumner
## 25 24 21 19 18 15 15 14 12 12
## Lloyd District Markham Argay Collins View Wilkes Arnold Creek Glenfair Sylvan-Highlands Arlington Heights Far Southwest
## 11 11 10 10 10 9 9 8 7 7
## Hayden Island Hollywood Maplewood Marshall Park Russell East Columbia Northwest Industrial Crestwood Sunderland Healy Heights
## 7 7 7 7 7 4 4 3 3 1
## Woodland Park
## 1 0
Let’s take only the ten biggest neighbourhoods:
big_neighbourhoods <- names(neighbourhood_counts)[1:10]
sub_bnb <- subset(airbnb, !is.na(price) & neighbourhood %in% big_neighbourhoods)
sub_bnb <- droplevels(sub_bnb[, c("price", "neighbourhood", "host_id")])
nrow(sub_bnb)
## [1] 2023
Look at the data:
Preliminary conclusions? Formal questions?
The price \(P_{ij}\) of the \(j\)th room in neighbourhood \(i\) is \[\begin{equation} P_{ij} = \mu + \alpha_i + \epsilon_{ij} , \end{equation}\] where
In words, \[\begin{equation} \text{(price)} = \text{(group mean)} + \text{(residual)} \end{equation}\]
##
## Call:
## lm(formula = price ~ neighbourhood, data = sub_bnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.70 -48.12 -23.16 17.28 762.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.15625 8.13700 14.521 <2e-16 ***
## neighbourhoodBuckman 11.10976 10.88102 1.021 0.3074
## neighbourhoodConcordia -5.53016 10.59577 -0.522 0.6018
## neighbourhoodDowntown 118.53940 10.83458 10.941 <2e-16 ***
## neighbourhoodHosford-Abernethy 14.56073 11.52553 1.263 0.2066
## neighbourhoodKing 3.14322 11.08430 0.284 0.7768
## neighbourhoodNorthwest District 23.42358 10.52246 2.226 0.0261 *
## neighbourhoodOverlook -13.53446 11.58099 -1.169 0.2427
## neighbourhoodRichmond -0.03638 9.98145 -0.004 0.9971
## neighbourhoodSunnyside -3.90324 11.40300 -0.342 0.7322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.9 on 2013 degrees of freedom
## Multiple R-squared: 0.1108, Adjusted R-squared: 0.1068
## F-statistic: 27.86 on 9 and 2013 DF, p-value: < 2.2e-16
I.e.: do mean prices differ by neighborhood?
How would you do this?
Design a statistic that would be big if mean prices are different between neighborhoods, and will be small if all neighborhoods are the same.
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## neighbourhood 9 2655967 295107 27.857 < 2.2e-16 ***
## Residuals 2013 21325161 10594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One-way ANOVAs just have a single factor
Multi-factor ANOVAs
ANOVAs use an \(F\)-statistic to test factors in a model
Determining the appropriate test ratios for complex ANOVAs takes some work
Normally distributed groups
Equal variances across groups
Observations in a group are independent