Peter Ralph
Advanced Biological Statistics
##
## Welch Two Sample t-test
##
## data: airbnb$price[airbnb$instant_bookable] and airbnb$price[!airbnb$instant_bookable]
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
But, the \(t\) test relies on Normality (a little). Is the distribution of AirBnB prices too “weird”? How can we be sure?
Methods:
Remove the big values and try again.
Use a nonparametric test.
Observation: If there was no meaningful difference in prices between “instant bookable” and not, then randomly shuffling that label won’t change anything.
Strategy:
instant_bookable
column.Why is this a \(p\)-value? For what hypothesis?
fake_is_instant <- sample(airbnb$instant_bookable)
(mean(airbnb$price[fake_is_instant], na.rm=TRUE) -
mean(airbnb$price[!fake_is_instant], na.rm=TRUE))
## [1] 2.837541
real_diff <- (mean(airbnb$price[airbnb$instant_bookable], na.rm=TRUE)
- mean(airbnb$price[!airbnb$instant_bookable], na.rm=TRUE))
permuted_diffs <- replicate(10000, {
fake_is_instant <- sample(airbnb$instant_bookable)
(mean(airbnb$price[fake_is_instant], na.rm=TRUE)
- mean(airbnb$price[!fake_is_instant], na.rm=TRUE))
} )
hist(permuted_diffs, xlab="shuffled differences in mean", xlim=range(c(permuted_diffs, real_diff)))
abline(v=real_diff, col='red', lwd=3)
## [1] 3e-04
The difference in price between instant bookable and not instant bookable is highly statistically significant (\(p \approx 0.0003\), permutation test).
Here’s a nonparametric method to see if omnivores tend to be bigger than herbivores in the PanTHERIA data:
Find pairs of species in the same family for which one is a omnivore and the other an herbivore.
Compute within each pair (some statistic) comparing omnivore to herbivore size, and average these (the test statistic).
Randomly reassign “trophic level” within families many times, recomputing this statistic.
The \(p\)-value is the proportion of shuffled statistics greater than the observed statistic.
First, let’s subset down to only the herbivores and omnivores, with non-missing data.
pantheria <- subset(pantheria,
TrophicLevel %in% c("herbivore", "omnivore")
&
!is.na(AdultBodyMass_g)
) |> droplevels()
We still have 1430 species remaining.
ratios <- rep(NA, nlevels(pantheria$Family))
names(ratios) <- levels(pantheria$Family)
for (fam in levels(pantheria$Family)) {
this_fam <- subset(pantheria, Family == fam)
omnis <- which(this_fam$TrophicLevel == "omnivore")
herbs <- which(this_fam$TrophicLevel == "herbivore")
if (length(omnis) > 0 && length(herbs) > 0) {
herb <- herbs[sample.int(length(herbs), 1)]
omni <- omnis[sample.int(length(omnis), 1)]
out <- this_fam$AdultBodyMass_g[herb] / this_fam$AdultBodyMass_g[omni]
} else {
out <- NA
}
ratios[fam] <- out
}
shuffled_mean_ratios <- replicate(100, {
shuffled_ratios <- rep(NA, nlevels(pantheria$Family))
names(ratios) <- levels(pantheria$Family)
for (fam in levels(pantheria$Family)) {
this_fam <- subset(pantheria, Family == fam)
# shuffle up trophic level column
this_fam$TrophicLevel <- sample(this_fam$TrophicLevel)
omnis <- which(this_fam$TrophicLevel == "omnivore")
herbs <- which(this_fam$TrophicLevel == "herbivore")
if (length(omnis) > 0 && length(herbs) > 0) {
herb <- herbs[sample.int(length(herbs), 1)]
omni <- omnis[sample.int(length(omnis), 1)]
out <- this_fam$AdultBodyMass_g[herb] / this_fam$AdultBodyMass_g[omni]
} else {
out <- NA
}
shuffled_ratios[fam] <- out
}
mean(shuffled_ratios, na.rm=TRUE)
})
On average the herbivore of the pair was 3.2959199 times bigger than the omnivore; a mean ratio this big occurred 0.08 of the time in the permuted datasets. So, this is not strong evidence that herbivores tend to be bigger than omnivores (p = 0.08, permutation test).
However, we only did one random draw of the pairs to get the observed values.
observed_mean_ratio_vector <- replicate(100, {
ratios <- rep(NA, nlevels(pantheria$Family))
names(ratios) <- levels(pantheria$Family)
for (fam in levels(pantheria$Family)) {
this_fam <- subset(pantheria, Family == fam)
omnis <- which(this_fam$TrophicLevel == "omnivore")
herbs <- which(this_fam$TrophicLevel == "herbivore")
if (length(omnis) > 0 && length(herbs) > 0) {
herb <- herbs[sample.int(length(herbs), 1)]
omni <- omnis[sample.int(length(omnis), 1)]
out <- this_fam$AdultBodyMass_g[herb] / this_fam$AdultBodyMass_g[omni]
} else {
out <- NA
}
ratios[fam] <- out
}
mean(ratios, na.rm=TRUE)
})
xh <- hist(c(observed_mean_ratio_vector, shuffled_mean_ratios), plot=FALSE)
hist(shuffled_mean_ratios, breaks=xh$breaks, col=adjustcolor('red', 0.5))
hist(observed_mean_ratio_vector, breaks=xh$breaks, add=TRUE, col=adjustcolor("grey", 0.5))
legend("topright", col=c("white", "red"),
legend=c("observed", "shuffled")
)
Conclusion? Well, a t-test says:
##
## Welch Two Sample t-test
##
## data: observed_mean_ratio_vector and shuffled_mean_ratios
## t = 4.4179, df = 194.6, p-value = 1.653e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3398443 0.8879671
## sample estimates:
## mean of x mean of y
## 2.649445 2.035539
… but we need to think about this more.