Peter Ralph
1 October – Advanced Biological Statistics
Care, or at least think, about the data.
Look at the data.
Query the data.
Sanity check.
Communicate.
Often “statistics” focuses on querying. Doing that effectively requires all the other steps, too.
See the course website.
a numerical description of a dataset.
a numerical attribute of a model of reality.
Often, statistics are used to estimate parameters.
estimating parameters, with uncertainty (confidence intervals)
evaluating (in-)consistency with a particular situation (\(p\)-values)
This week: digging in, with simple examples.
AirBnB hosts in Portland, OR: website and download link.
Questions: how much does an AirBnB typically cost in Portland? Do “instant bookable” ones cost more?
## [1] 5634
## [1] "id"
## [2] "listing_url"
## [3] "scrape_id"
## [4] "last_scraped"
## [5] "name"
## [6] "summary"
## [7] "space"
## [8] "description"
## [9] "experiences_offered"
## [10] "neighborhood_overview"
## [11] "notes"
## [12] "transit"
## [13] "access"
## [14] "interaction"
## [15] "house_rules"
## [16] "thumbnail_url"
## [17] "medium_url"
## [18] "picture_url"
## [19] "xl_picture_url"
## [20] "host_id"
## [21] "host_url"
## [22] "host_name"
## [23] "host_since"
## [24] "host_location"
## [25] "host_about"
## [26] "host_response_time"
## [27] "host_response_rate"
## [28] "host_acceptance_rate"
## [29] "host_is_superhost"
## [30] "host_thumbnail_url"
## [31] "host_picture_url"
## [32] "host_neighbourhood"
## [33] "host_listings_count"
## [34] "host_total_listings_count"
## [35] "host_verifications"
## [36] "host_has_profile_pic"
## [37] "host_identity_verified"
## [38] "street"
## [39] "neighbourhood"
## [40] "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed"
## [42] "city"
## [43] "state"
## [44] "zipcode"
## [45] "market"
## [46] "smart_location"
## [47] "country_code"
## [48] "country"
## [49] "latitude"
## [50] "longitude"
## [51] "is_location_exact"
## [52] "property_type"
## [53] "room_type"
## [54] "accommodates"
## [55] "bathrooms"
## [56] "bedrooms"
## [57] "beds"
## [58] "bed_type"
## [59] "amenities"
## [60] "square_feet"
## [61] "price"
## [62] "weekly_price"
## [63] "monthly_price"
## [64] "security_deposit"
## [65] "cleaning_fee"
## [66] "guests_included"
## [67] "extra_people"
## [68] "minimum_nights"
## [69] "maximum_nights"
## [70] "minimum_minimum_nights"
## [71] "maximum_minimum_nights"
## [72] "minimum_maximum_nights"
## [73] "maximum_maximum_nights"
## [74] "minimum_nights_avg_ntm"
## [75] "maximum_nights_avg_ntm"
## [76] "calendar_updated"
## [77] "has_availability"
## [78] "availability_30"
## [79] "availability_60"
## [80] "availability_90"
## [81] "availability_365"
## [82] "calendar_last_scraped"
## [83] "number_of_reviews"
## [84] "number_of_reviews_ltm"
## [85] "first_review"
## [86] "last_review"
## [87] "review_scores_rating"
## [88] "review_scores_accuracy"
## [89] "review_scores_cleanliness"
## [90] "review_scores_checkin"
## [91] "review_scores_communication"
## [92] "review_scores_location"
## [93] "review_scores_value"
## [94] "requires_license"
## [95] "license"
## [96] "jurisdiction_names"
## [97] "instant_bookable"
## [98] "is_business_travel_ready"
## [99] "cancellation_policy"
## [100] "require_guest_profile_picture"
## [101] "require_guest_phone_verification"
## [102] "calculated_host_listings_count"
## [103] "calculated_host_listings_count_entire_homes"
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms"
## [106] "reviews_per_month"
## $75.00 $100.00 $80.00 $125.00 $65.00 $95.00 $150.00
## 256 240 177 174 168 166 162
## $85.00 $99.00 $60.00 $90.00 $120.00 $50.00 $70.00
## 159 145 144 134 129 129 120
## $45.00 $55.00 $110.00 $200.00 $199.00 $89.00 $79.00
## 105 99 86 74 67 67 66
## $115.00 $300.00 $40.00 $175.00 $59.00 $35.00 $135.00
## 65 65 64 63 60 59 56
## $69.00 $250.00 $180.00 $130.00 $149.00 $105.00 $119.00
## 56 51 50 49 48 46 43
## $140.00 $49.00 $68.00 $109.00 $145.00 $225.00 $72.00
## 40 40 37 34 34 33 33
## $160.00 $78.00 $215.00 $82.00 $88.00 $98.00 $129.00
## 31 31 30 30 30 29 28
## $30.00 $195.00 $74.00 $350.00 $73.00 $165.00 $36.00
## 28 26 26 25 23 21 21
## $48.00 $71.00 $159.00 $169.00 $39.00 $58.00 $87.00
## 21 21 20 20 20 20 20
## $400.00 $52.00 $64.00 $83.00 $999.00 $42.00 $54.00
## 19 19 19 19 19 18 18
## $92.00 $1,023.00 $108.00 $275.00 $33.00 $43.00 $155.00
## 18 17 17 17 17 17 16
## $179.00 $209.00 $44.00 $47.00 $77.00 $94.00 $97.00
## 16 16 16 16 16 16 16
## $185.00 $38.00 $62.00 $67.00 $41.00 $84.00 $37.00
## 15 15 15 15 14 14 13
## $375.00 $57.00 $128.00 $32.00 $139.00 $325.00 $34.00
## 13 13 12 12 11 11 11
## $91.00 (Other)
## 11 663
## f t
## 2960 2674
## Warning: NAs introduced by coercion
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 69.0 95.0 119.5 136.0 999.0 32
## Mode FALSE TRUE
## logical 2960 2674
##
## Airbed Couch Futon Pull-out Sofa Real Bed
## 9 7 42 21 5555
## [1] 119.5396
Conclusion?
instant <- airbnb$price[airbnb$instant_bookable]
not_instant <- airbnb$price[!airbnb$instant_bookable]
(tt <- t.test(instant, not_instant))
##
## Welch Two Sample t-test
##
## data: instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
Instant bookable hosts cost more than others (P=0.00027, t-test with df=5039.7695486).
Critique this conclusion.
Care, or at least think, about the data.
Communicate.
How big is the difference? How sure are we?
Statistical significance does not imply real-world significance.
The mean nightly price of Portland AirBnB hosts is $ 120, with a standard deviation of $98 and a range of $ 0 to $ 999. “Instant bookable” hosts charged on average $9.7 more than others, a difference that is statistically significant (P=0.00027, t-test with df=5039.7695486).
Critiques?
So: what did we just do?
the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.
Usually, this means
which can all be defined to suit the situation.
If the null hypothesis were true, then you’d be really unlikely to see something like what you actually did.
So, either the “null hypothesis” is not a good description of reality or something surprising happened.
How useful this is depends on the null hypothesis.
##
## Welch Two Sample t-test
##
## data: instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
##
## One Sample t-test
##
## data: airbnb$price
## t = 91.32, df = 5601, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 116.9734 122.1058
## sample estimates:
## mean of x
## 119.5396
Is that \(p\)-value useful?
(class survey) How many people have a longer index finger on the hand they write with?
We want to know \[\begin{equation} \theta = \P(\text{random person has writing finger longer}) . \end{equation}\]
Everyone make a fake dataset with \(\theta = 1/2\), e.g.:
n <- 35 # class size
sum(rbinom(n, 1, 1/2) > 0)
Now we can estimate the \(p\)-value for the hypothesis that \(\theta = 1/2\). Conclusions?
replicate(1000, sum(rbinom(n, 1, 1/2) > 0))
Either math:
Or, computers. (maybe math, maybe simulation, maybe both)
The \(t\) statistic computed from a collection of \(n\) numbers is the sample mean divided by the estimated standard error of the mean, which is the sample SD divided by \(\sqrt{n}\).
If \(x_1, \ldots, x_n\) are numbers, then \[\begin{aligned} \text{(sample mean)} \qquad \bar x &= \frac{1}{n}\sum_{i=1}^n x_i \\ \text{(sample SD)} \qquad s &= \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2} \end{aligned}\] so \[\begin{equation} t(x) = \frac{\bar x}{s / \sqrt{n}} . \end{equation}\]
## t
## 1.318919 1.318919
Fact: If \(X_1, \ldots, X_n\) are independent random samples from a distribution with mean \(\mu\), then \[\begin{equation} t(X - \mu) = \frac{\bar x - \mu}{s/\sqrt{n}} \approx \StudentsT(n-2) , \end{equation}\] as long as \(n\) is not too small and the distribution isn’t too wierd.
Let’s check this, by doing:
find the sample \(t\) score of 100 random draws from some distribution
lots of times, and looking at the distribution of those \(t\) scores.
Claim: no matter the distribution we sample from, it should look close to \(t\).
Do this again (use my code) except using
x <- rexp(n) - 1
instead of 2 * runif(n) - 1
.
##
## Welch Two Sample t-test
##
## data: instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
A 95% confidence interval for an estimate is constructed so that no matter what the true value, the confidence interval overlaps the truth 95% of the time.
In other words, if we collect 1,000 independent samples from a population with true mean \(\mu\), and construct confidence intervals for the mean from each, then about 950 of these should overlap \(\mu\).
if we collect 1,000 independent samples from a population with true mean \(\mu\), and construct confidence intervals from each, then about 950 of these should overlap \(\mu\).
Let’s take independent samples of size \(n=20\) from a Normal distribution with \(\mu = 0\). Example:
## [1] -0.5259054 0.3753652
## attr(,"conf.level")
## [1] 0.95
## [1] 0.05666667
How does the margin of error change with sample size? By taking random samples from the price
column of the airbnb
data, make two plots:
Probability that a sample of size n
of Portland AirBnB rooms has a sample mean within $10 of the (true) mean price of all rooms, as a function of n
.
Expected difference between the mean price of a random sample of n
Portland AirBnB rooms and the (true) mean price of all rooms, as a function of n
.
x <- subset(airbnb, !is.na(price))$price # to save typing
nvals <- 10 * (2:20)
props <- rep(NA, length(nvals))
for (k in 1:length(nvals)) {
many_means <- replicate(1e4, mean(sample(x, nvals[k])))
props[k] <- mean(abs(many_means - mean(x)) < 10)
}
plot(nvals, 1-props,
ylab='proportion of sample means not within $10',
xlab='sample size',
ylim=c(0,1), col='plum', pch=20, cex=2)
The Central Limit Theorem says, roughly, that net effect of the sum of a bunch of small, independent random things can be well-approximated by a Gaussian distribution, almost regardless of the details.
For instance: say \(X_1, X_2, \ldots, X_n\) are independent, random draws with mean \(\mu\) and standard deviation \(\sigma\).
Then, the difference between the “true” mean, \(\mu\), and the sample mean is Gaussian, \[\begin{aligned} \bar x = \frac{1}{n}\sum_{i=1}^n X_i \approx \Normal\left(\mu, \frac{\sigma}{\sqrt{n}}\right) . \end{aligned}\]
Also called the Normal distribution: see previous slide.
Saying that a random number \(Z\) “is Normal”: \[\begin{equation} Z \sim \Normal(\mu, \sigma) \end{equation}\] means that \[\begin{equation} \P\left\{Z \ge \frac{x - \mu}{\sigma}\right\} = \int_x^\infty \frac{1}{\sqrt{2 \pi}} e^{-u^2/2} du . \end{equation}\]
What to remember:
rnorm(10, mean=3, sd=2) # random simulations
pnorm(5, mean=3, sd=2) # probabilities
qnorm(0.975, mean=3, sd=2) # quantiles
Let’s check this, by doing:
find the sample mean of 100 random draws from some distribution
lots of times, and looking at the distribution of those sample means.
Claim: no matter the distribution we sample from, it should look close to Normal.
If \(Y\) and \(Z_1, \ldots, Z_n\) are independent \(\Normal(0, \sigma)\), and \[\begin{equation} X = \frac{Y}{ \sqrt{\frac{1}{n}\sum_{j=1}^n Z_j^2} } \end{equation}\] then \[\begin{equation} X \sim \StudentsT(n) . \end{equation}\]
More usefully, a sample mean divided by its standard error is\(^*\) \(t\) distributed.
This is thanks to the Central Limit Theorem. (\(^*\) usually, approximately)
statistics describe data and estimate parameters.
\(p\)-values assess (in)consistency with specific models (ie, hypotheses)
confidence intervals give a measure of uncertainty
A sample mean scaled by (its sample SD over \(\sqrt{n}\)) is approximately \(t\)-distributed,
which means that sample means are typically a few multiples of \(\sigma/\sqrt{n}\) away from the true mean.
A permutation test gives a way of testing hypotheses with fewer assumptions.