Peter Ralph
13 October – Advanced Biological Statistics
##
## Welch Two Sample t-test
##
## data: airbnb$price[airbnb$instant_bookable] and airbnb$price[!airbnb$instant_bookable]
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
But, the \(t\) test relies on Normality. Is the distribution of AirBnB prices too “weird”? How can we be sure?
Methods:
Remove the big values and try again.
Use a nonparametric test.
Observation: If there was no meaningful difference in prices between “instant bookable” and not, then randomly shuffling that label won’t change anything.
Strategy:
instant_bookable
column.Why is this a \(p\)-value? For what hypothesis?
fake_is_instant <- sample(airbnb$instant_bookable)
(mean(airbnb$price[fake_is_instant], na.rm=TRUE) -
mean(airbnb$price[!fake_is_instant], na.rm=TRUE))
## [1] 2.837541
real_diff <- (mean(airbnb$price[airbnb$instant_bookable], na.rm=TRUE)
- mean(airbnb$price[!airbnb$instant_bookable], na.rm=TRUE))
permuted_diffs <- replicate(10000, {
fake_is_instant <- sample(airbnb$instant_bookable)
(mean(airbnb$price[fake_is_instant], na.rm=TRUE)
- mean(airbnb$price[!fake_is_instant], na.rm=TRUE))
} )
hist(permuted_diffs, xlab="shuffled differences in mean", xlim=range(c(permuted_diffs, real_diff)))
abline(v=real_diff, col='red', lwd=3)
## [1] 3e-04
The difference in price between instant bookable and not instant bookable is highly statistically significant (\(p \approx 0.0003\), permutation test).
Let’s do the analogous thing for the ANOVA comparing price between neighbourhoods:
## Analysis of Variance Table
##
## Response: price
## Df Sum Sq Mean Sq F value Pr(>F)
## neighbourhood 91 6015248 66102 7.6277 < 2.2e-16 ***
## Residuals 5510 47749952 8666
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
do_perm_test <- function (dataset) {
anova_true <- anova(lm(price ~ neighbourhood, data=dataset))
true_F <- anova_true[["F value"]][1]
# do it once
shuffled_hood <- sample(dataset$neighbourhood)
perm_F <- anova(lm(price ~ shuffled_hood, data=dataset))[["F value"]][1]
# do it lots of times
perm_F_multiple <- replicate(1000, {
shuffled_hood <- sample(dataset$neighbourhood)
anova(lm(price ~ shuffled_hood, data=dataset))[["F value"]][1]
})
# get a p-value = proportion of permuted
# F statistics that are bigger than
# the observed value
return(mean(perm_F_multiple >= true_F))
}
# look at the values
# hist(perm_F_multiple, breaks=40,
# xlab='permuted F statistic',
# main='sampling distribution of F')
# get the p-value:
do_perm_test(airbnb)
## [1] 0
There is strongly statistically significant heterogeneity in prices between neighbourhoods (p < 0.001, permutation test).
## [1] 0
There remains significant heterogeneity even after removing Downtown.
not