Peter Ralph
8 October – Advanced Biological Statistics
To compare means of something between groups. (Tools will be applicable to other tasks also.)
Using statistics.
How different are AirBnB prices between neighbourhoods?
airbnb <- read.csv("../Datasets/portland-airbnb-listings.csv")
airbnb$price <- as.numeric(gsub("$", "", airbnb$price, fixed=TRUE))
## Warning: NAs introduced by coercion
airbnb$neighbourhood[airbnb$neighbourhood == ""] <- NA
(neighbourhood_counts <- sort(table(airbnb$neighbourhood), decreasing=TRUE))
##
## Richmond Northwest District Concordia
## 318 238 230
## Downtown Buckman King
## 221 205 188
## Sunnyside Boise-Eliot Hosford-Abernethy
## 166 160 159
## Overlook Mt. Tabor Irvington
## 156 145 134
## Sellwood-Moreland Montavilla Pearl
## 133 129 120
## Humboldt Kerns Arbor Lodge
## 112 104 103
## Piedmont Woodstock Cully
## 98 91 87
## South Portland Woodlawn St. Johns
## 87 85 81
## Kenton Eliot Rose City Park
## 80 78 76
## Sabin South Tabor Vernon
## 75 73 73
## Creston-Kenilworth Hillsdale Old Town/Chinatown
## 68 59 59
## Beaumont-Wilshire Roseway Brooklyn
## 58 55 53
## N. Tabor Lents Brentwood-Darlington
## 53 50 49
## Mount Scott University Park Foster-Powell
## 48 48 47
## Laurelhurst Multnomah Sullivan's Gulch
## 45 45 45
## Hazelwood Powellhurst-Gilbert Southwest Hills
## 44 44 41
## Hayhurst Madison South Portsmouth
## 38 37 37
## Alameda Forest Park Homestead
## 35 34 34
## Cathedral Park Goose Hollow Reed
## 32 32 30
## Eastmoreland Grant Park Parkrose
## 28 27 26
## Hillside Ashcreek Pleasant Valley
## 25 24 21
## Parkrose Heights West Portland Park Bridlemile
## 19 18 15
## Mill Park South Burlingame Bridgeton
## 15 14 12
## Sumner Lloyd District Markham
## 12 11 11
## Argay Collins View Wilkes
## 10 10 10
## Arnold Creek Glenfair Sylvan-Highlands
## 9 9 8
## Arlington Heights Far Southwest Hayden Island
## 7 7 7
## Hollywood Maplewood Marshall Park
## 7 7 7
## Russell East Columbia Northwest Industrial
## 7 4 4
## Crestwood Sunderland Healy Heights
## 3 3 1
## Woodland Park
## 1 0
Let’s take only the ten biggest neighbourhoods:
big_neighbourhoods <- names(neighbourhood_counts)[1:10]
sub_bnb <- subset(airbnb, !is.na(price) & neighbourhood %in% big_neighbourhoods)
sub_bnb <- droplevels(sub_bnb[, c("price", "neighbourhood", "host_id")])
nrow(sub_bnb)
## [1] 2023
Look at the data:
Preliminary conclusions?
Formal questions?
The price \(P_{ij}\) of the \(j\)th room in neighbourhood \(i\) is \[\begin{equation} P_{ij} = \mu + \alpha_i + \epsilon_{ij} , \end{equation}\] where
In words, \[\begin{equation} \text{(price)} = \text{(group mean)} + \text{(residual)} \end{equation}\]
##
## Call:
## lm(formula = price ~ neighbourhood, data = sub_bnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -211.70 -48.12 -23.16 17.28 762.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.15625 8.13700 14.521 <2e-16 ***
## neighbourhoodBuckman 11.10976 10.88102 1.021 0.3074
## neighbourhoodConcordia -5.53016 10.59577 -0.522 0.6018
## neighbourhoodDowntown 118.53940 10.83458 10.941 <2e-16 ***
## neighbourhoodHosford-Abernethy 14.56073 11.52553 1.263 0.2066
## neighbourhoodKing 3.14322 11.08430 0.284 0.7768
## neighbourhoodNorthwest District 23.42358 10.52246 2.226 0.0261 *
## neighbourhoodOverlook -13.53446 11.58099 -1.169 0.2427
## neighbourhoodRichmond -0.03638 9.98145 -0.004 0.9971
## neighbourhoodSunnyside -3.90324 11.40300 -0.342 0.7322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102.9 on 2013 degrees of freedom
## Multiple R-squared: 0.1108, Adjusted R-squared: 0.1068
## F-statistic: 27.86 on 9 and 2013 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## neighbourhood 9 2655967 295107 27.857 < 2.2e-16 ***
## Residuals 2013 21325161 10594
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One-way ANOVAs just have a single factor
Multi-factor ANOVAs
ANOVAs use an \(F\)-statistic to test factors in a model
Determining the appropriate test ratios for complex ANOVAs takes some work
Normally distributed groups
Equal variances across groups
Observations in a group are independent
What do we want to know?
How do we measure it?
This can be done by observation or experiment.
An experiment is a study in which the values of important variables (e.g., group membership, dosage) are determined by the experimenters.
Otherwise, it is an observational study.
Note that controlling the set-up doesn’t necessarily make it a good experiment.
Say you perform an experiment on two different strains of stickleback fish, one from an ocean population (RS) and one from a freshwater lake (BP) by making them microbe free. Microbes in the gut are known to interact with the gut epithelium in ways that lead to a proper maturation of the immune system.
Experimental setup: You decide to carry out an experiment by treating multiple fish from each strain so that some of them have a conventional microbiota, and some of them are inoculated with only one bacterial species. You then measure the levels of gene expression in the stickleback gut using RNA-seq. Because you have a suspicion that the sex of the fish might be important, you track it too.
Will we have the power to detect the effect of interest?
How generalizable will the results be?
What are possible causal explanations?
For (2), remember that \[\begin{equation} \text{(margin of error)} \propto \frac{\sigma}{\sqrt{n}} . \end{equation}\]
Have a look at the AirBnB dataset. Is it “tidy”?
Design a tidy data format for the stickleback experiment:
Tidying data is hard!
… and often requires expert input.
Many common data wrangling operations are made easier by the tidyverse.
packages that do many of the same things as base functions in R
designed to do them more “cleanly”
also includes ggplot
(for “Grammar of Graphics”)
filter()
.arrange()
.select()
.mutate()
.summarise()
.filter()
, arrange()
and select()
a1 <- select(airbnb, neighbourhood, price, host_id, beds, bathrooms)
a2 <- filter(a1, neighbourhood == "Richmond"
| neighbourhood == "Woodlawn"
| neighbourhood == "Downtown")
a3 <- arrange(a2, price, neighbourhood)
mutate()
and transmutate()
Add new variables:
mutate(a3,
price_per_bed = price / beds,
price_per_bath = price / bathrooms)
Or, make an entirely new data frame:
transmute(airbnb,
price = price,
price_per_bed = price / beds,
price_per_bath = price / bathrooms)
group_by()
and summarize()
group_by()
aggregates data by category, e.g.:
by_hood <- group_by(a3, neighbourhood)
Now, you can calculate summaries of other variables within each group, e.g.:
summarise(by_hood, price = mean(price, na.rm = TRUE))
Make a data frame only including rooms in the top ten neighbourhoods. Then, using only these neighbourhoods…
Find the mean price
, cleaning_fee
, and ratio of cleaning fee to price, by neighbourhood.
Edit your code in (2) to add variables for the 25% and 75% quantile of price
(use quantile( )
).
Do as in (2) and (3) but splitting by both neighbourhood
and room_type
(e.g., finding the mean price of private rooms in Woodlawn).
Edit your code in (1) to add a new variable giving the number of characters in the house_rules
(use nchar( )
).