Peter Ralph
Advanced Biological Statistics
Recall our AirBnB example:
airbnb <- read.csv("../Datasets/portland-airbnb-listings.csv")
airbnb$price <- as.numeric(gsub("$", "", airbnb$price, fixed=TRUE))
airbnb$instant_bookable <- (airbnb$instant_bookable == "t")
instant <- airbnb$price[airbnb$instant_bookable]
not_instant <- airbnb$price[!airbnb$instant_bookable]
(tt <- t.test(instant, not_instant))
##
## Welch Two Sample t-test
##
## data: instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
The central limit theorem.
The number of standard errors that the sample mean is away from the true mean has a \(t\) distribution.
For instance, the probability that the sample mean is within 2 standard errors of the true mean is approximately
\[\begin{aligned} \int_{-2}^2 \frac{\Gamma\left(\frac{n-1}{2}\right)}{\sqrt{(n-2) \pi}\Gamma\left(\frac{n-2}{2}\right)} \left(1 + \frac{x^2}{n-2}\right)^{-\frac{n - 1}{2}} dx . \end{aligned}\]
A 95% confidence interval for an estimate is constructed so that no matter what the true values, 95% of the the confidence intervals you construct will overlap the truth.
In other words, if we collect 100 independent samples from a population with true mean \(\mu\), and 95% construct confidence intervals for the mean from each, then about 95 of these should overlap \(\mu\).
if we collect 100 independent samples from a population with true mean \(\mu\), and construct 95% confidence intervals from each, then about 95 of these should overlap \(\mu\).
Let’s take independent samples of size \(n=20\) from a Normal distribution with \(\mu = 0\). Example:
## [1] -0.4019083 0.5862080
## attr(,"conf.level")
## [1] 0.95
## [1] 0.05
Suppose we survey 100 random UO students and find that 10 had been to a party recently and so get a 95% confidence interval of 4%-16% for the percentage of UO students who have been to a party recently.
There is a 95% chance that the true proportion of UO students who have been to a party recently is between 4% and 16%.
Not so good: the true proportion is a fixed number, so it doesn’t make sense to talk about a probability here.
Statistical power is how good our statistics can find things out.
Formally: the probability of identifying a true effect.
Example: Suppose two snail species’ speeds differ by 3cm/h. What’s the chance our experiment will identify the difference?
Suppose that we’re going to do a survey of room prices of an AirBnB competitor. How do our power and accuracy depend on sample size? Supposing that prices roughly match AirBnB’s: mean \(\mu =\) $120 and SD \(\sigma =\) $98, estimate:
The size of the difference between the mean price of a random
sample of size n
and the (true) mean price.
The probability that a sample of size n
rooms has a
sample mean within $10 of the (true) mean price.
Answer those questions empirically: by taking random samples
from the price
column of the airbnb
data, make
two plots:
Expected difference between the mean price of a random sample of
n
Portland AirBnB rooms and the (true) mean price of
all rooms, as a function of n
.
Probability that a sample of size n
of Portland
AirBnB rooms has a sample mean within $10 of the (true) mean price of
all rooms, as a function of n
.
airbnb <- read.csv("../Datasets/portland-airbnb-listings.csv")
airbnb$price <- as.numeric(gsub("$", "", airbnb$price, fixed=TRUE))
## Warning: NAs introduced by coercion