Peter Ralph
1 October 2020 – Advanced Biological Statistics
Care, or at least think, about the data.
Look at the data.
Query the data.
Check the results.
Communicate.
Often “statistics” focuses on querying. Doing that effectively requires all the other steps, too.
We’ll be assuming that you have some familiarity with
For instance, you should be able to figure out what this means:
##
## Welch Two Sample t-test
##
## data: x and y
## t = -1.3761, df = 5.4988, p-value = 0.2222
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.031728 2.331728
## sample estimates:
## mean of x mean of y
## 3.75 6.60
See the course website.
Please take 10 minutes to
Questions?
a numerical description of a dataset.
a numerical attribute of a model of reality.
Often, statistics are used to estimate parameters.
estimating parameters, with uncertainty (confidence intervals)
evaluating (in-)consistency with a particular situation (\(p\)-values)
This week: digging in, with simple examples.
is uncertainty
thanks to randomness.
How do we understand randomness, concretely and quantitatively?
With models.
AirBnB hosts in Portland, OR: website and download link.
## [1] 5634
## [1] "id" "listing_url" "scrape_id" "last_scraped"
## [5] "name" "summary" "space" "description"
## [9] "experiences_offered" "neighborhood_overview" "notes" "transit"
## [13] "access" "interaction" "house_rules" "thumbnail_url"
## [17] "medium_url" "picture_url" "xl_picture_url" "host_id"
## [21] "host_url" "host_name" "host_since" "host_location"
## [25] "host_about" "host_response_time" "host_response_rate" "host_acceptance_rate"
## [29] "host_is_superhost" "host_thumbnail_url" "host_picture_url" "host_neighbourhood"
## [33] "host_listings_count" "host_total_listings_count" "host_verifications" "host_has_profile_pic"
## [37] "host_identity_verified" "street" "neighbourhood" "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed" "city" "state" "zipcode"
## [45] "market" "smart_location" "country_code" "country"
## [49] "latitude" "longitude" "is_location_exact" "property_type"
## [53] "room_type" "accommodates" "bathrooms" "bedrooms"
## [57] "beds" "bed_type" "amenities" "square_feet"
## [61] "price" "weekly_price" "monthly_price" "security_deposit"
## [65] "cleaning_fee" "guests_included" "extra_people" "minimum_nights"
## [69] "maximum_nights" "minimum_minimum_nights" "maximum_minimum_nights" "minimum_maximum_nights"
## [73] "maximum_maximum_nights" "minimum_nights_avg_ntm" "maximum_nights_avg_ntm" "calendar_updated"
## [77] "has_availability" "availability_30" "availability_60" "availability_90"
## [81] "availability_365" "calendar_last_scraped" "number_of_reviews" "number_of_reviews_ltm"
## [85] "first_review" "last_review" "review_scores_rating" "review_scores_accuracy"
## [89] "review_scores_cleanliness" "review_scores_checkin" "review_scores_communication" "review_scores_location"
## [93] "review_scores_value" "requires_license" "license" "jurisdiction_names"
## [97] "instant_bookable" "is_business_travel_ready" "cancellation_policy" "require_guest_profile_picture"
## [101] "require_guest_phone_verification" "calculated_host_listings_count" "calculated_host_listings_count_entire_homes" "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms" "reviews_per_month"
Questions: how much does an AirBnB typically cost in Portland? Do “instant bookable” ones cost more?
## Length Class Mode
## 5634 character character
## Length Class Mode
## 5634 character character
## Warning: NAs introduced by coercion
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 69.0 95.0 119.5 136.0 999.0 32
## Mode FALSE TRUE
## logical 2960 2674
##
## Airbed Couch Futon Pull-out Sofa Real Bed
## 9 7 42 21 5555
## [1] 119.5396
hist(airbnb$price, breaks=40, xlab='nightly price ($)', col=grey(.8), xlim=range(airbnb$price, finite=TRUE), main='AirBnB prices in Portland, OR')
Conclusion?
layout(1:2)
instant <- airbnb$price[airbnb$instant_bookable]
not_instant <- airbnb$price[!airbnb$instant_bookable]
hist(not_instant, breaks=40, xlab='nightly price ($)', col=grey(.8), xlim=range(airbnb$price, finite=TRUE), main='not instant bookable')
hist(instant, breaks=40, xlab='nightly price ($)', col=grey(.8), main='instant bookable')
##
## Welch Two Sample t-test
##
## data: instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
Instant bookable hosts cost more than others (P=0.00027, t-test with df=5039.7695486).
Critique this conclusion, and write your own.
Scribe: person with the smallest sample.int(1000, 1)
.
Care, or at least think, about the data.
Communicate.
How big is the difference? How sure are we?
Statistical significance does not imply real-world significance.
Instant bookable hosts cost on average $10 more than not instant bookable, with a 95% confidence interval of $4.50 to $15. The distribution of prices in the two groups were very similar: for instance, the first and third quantiles of instant bookable hosts are $70 and $145, and those of not instant bookable hosts are $68 and $130, respectively. The average instant bookable cost was about $125, with a 95% confidence interval of +/- about $4; non-instant bookable hosts cost on average $115 per night, with a 95% CI of about +/- $3. Note that the difference of $10 is smallish compared to the price of a room, but the difference is highly significant (p=.0003, t-test with 5039 degrees of freedom) because of the large sample sizes.
So: what did we just do?
the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.
Usually, this means
which can all be defined to suit the situation.
If the null hypothesis were true, then you’d be really unlikely to see something like what you actually did.
So, either the “null hypothesis” is not a good description of reality or something surprising happened.
How useful this is depends on the null hypothesis.
##
## Welch Two Sample t-test
##
## data: instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
##
## One Sample t-test
##
## data: airbnb$price
## t = 91.32, df = 5601, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 116.9734 122.1058
## sample estimates:
## mean of x
## 119.5396
Is that \(p\)-value useful?
My hypothesis: People tend to have longer index fingers on the hand they write with because writing stretches the ligaments.
(class survey) How many people have a longer index finger on the hand they write with?
(class survey) Everyone flip a coin:
ifelse(runif(1) < 0.5, "H", "T")
and put the result in this google doc.
We want to estimate the parameter
\[\begin{equation} \theta = \P(\text{random person has writing finger longer}) , \end{equation}\]
and now we have a fake dataset with \(\theta = 1/2\).
Let’s get some more data:
n <- 37 # class size
sum(ifelse(runif(1) < 1/2, "H", "T") == "H")
and put the result in the same google doc.
Now we can estimate the \(p\)-value for the hypothesis that \(\theta = 1/2\).
A faster method:
replicate(1000, sum(rbinom(n, 1, 1/2) > 0))
or, equivalently,
rbinom(1000, n, 1/2)
## [1] 0.3057
Here, we’ve estimated that the difference in numbers of people with a longer finger on each hand is not statistically significant (\(p \appox 0.3\), by simulation).
Either math:
Or, computers. (maybe math, maybe simulation, maybe both)
So, where did this \(p\)-value come from?
##
## Welch Two Sample t-test
##
## data: instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
The \(t\) distribution! (see separate slides)