Peter Ralph
28 September 2021 – Advanced Biological Statistics
Care, or at least think, about the data.
Look at the data.
Query the data.
Check the results.
Communicate.
Often “statistics” focuses on querying. Doing that effectively requires all the other steps, too.
We’ll be assuming that you have some familiarity with
For instance, you should be able to figure out what this means:
##
## Welch Two Sample t-test
##
## data: x and y
## t = -1.3761, df = 5.4988, p-value = 0.2222
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.031728 2.331728
## sample estimates:
## mean of x mean of y
## 3.75 6.60
See the course website.
Please take 10 minutes to
Questions?
a numerical description of a dataset.
a numerical attribute of a model of reality.
Often, statistics are used to estimate parameters.
estimating parameters, with uncertainty (confidence intervals)
evaluating (in-)consistency with a particular situation (\(p\)-values)
This week: digging in, with simple examples.
is uncertainty
thanks to randomness.
How do we understand randomness, concretely and quantitatively?
With models.
AirBnB hosts in Portland, OR: data file (source: website and download link)
## [1] 5634
## [1] "id" "listing_url" "scrape_id" "last_scraped"
## [5] "name" "summary" "space" "description"
## [9] "experiences_offered" "neighborhood_overview" "notes" "transit"
## [13] "access" "interaction" "house_rules" "thumbnail_url"
## [17] "medium_url" "picture_url" "xl_picture_url" "host_id"
## [21] "host_url" "host_name" "host_since" "host_location"
## [25] "host_about" "host_response_time" "host_response_rate" "host_acceptance_rate"
## [29] "host_is_superhost" "host_thumbnail_url" "host_picture_url" "host_neighbourhood"
## [33] "host_listings_count" "host_total_listings_count" "host_verifications" "host_has_profile_pic"
## [37] "host_identity_verified" "street" "neighbourhood" "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed" "city" "state" "zipcode"
## [45] "market" "smart_location" "country_code" "country"
## [49] "latitude" "longitude" "is_location_exact" "property_type"
## [53] "room_type" "accommodates" "bathrooms" "bedrooms"
## [57] "beds" "bed_type" "amenities" "square_feet"
## [61] "price" "weekly_price" "monthly_price" "security_deposit"
## [65] "cleaning_fee" "guests_included" "extra_people" "minimum_nights"
## [69] "maximum_nights" "minimum_minimum_nights" "maximum_minimum_nights" "minimum_maximum_nights"
## [73] "maximum_maximum_nights" "minimum_nights_avg_ntm" "maximum_nights_avg_ntm" "calendar_updated"
## [77] "has_availability" "availability_30" "availability_60" "availability_90"
## [81] "availability_365" "calendar_last_scraped" "number_of_reviews" "number_of_reviews_ltm"
## [85] "first_review" "last_review" "review_scores_rating" "review_scores_accuracy"
## [89] "review_scores_cleanliness" "review_scores_checkin" "review_scores_communication" "review_scores_location"
## [93] "review_scores_value" "requires_license" "license" "jurisdiction_names"
## [97] "instant_bookable" "is_business_travel_ready" "cancellation_policy" "require_guest_profile_picture"
## [101] "require_guest_phone_verification" "calculated_host_listings_count" "calculated_host_listings_count_entire_homes" "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms" "reviews_per_month"
Questions: how much does an AirBnB typically cost in Portland? Do “instant bookable” ones cost more?
## Length Class Mode
## 5634 character character
## chr [1:5634] "$65.00" "$275.00" "$200.00" "$125.00" "$29.00" "$130.00" "$55.00" "$79.00" "$61.00" "$78.00" "$95.00" "$40.00" "$160.00" "$90.00" "$60.00" "$175.00" "$425.00" "$85.00" "$75.00" "$55.00" ...
## Length Class Mode
## 5634 character character
## chr [1:5634] "f" "t" "f" "f" "t" "t" "f" "t" "f" "f" "f" "t" "f" "f" "f" "f" "f" "t" "f" "t" "t" "f" "f" "f" "f" "f" "f" "f" "t" "f" "f" "t" "f" "t" "t" "f" "f" "t" "f" "t" "f" "f" "f" "f" "f" "t" "f" "f" ...
## Warning: NAs introduced by coercion
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 69.0 95.0 119.5 136.0 999.0 32
## Mode FALSE TRUE
## logical 2960 2674
##
## Airbed Couch Futon Pull-out Sofa Real Bed
## 9 7 42 21 5555
## [1] 119.5396
hist(airbnb$price, breaks=40, xlab='nightly price ($)', col=grey(.8), xlim=range(airbnb$price, finite=TRUE), main='AirBnB prices in Portland, OR')
Conclusion?
layout(1:2)
instant <- airbnb$price[airbnb$instant_bookable]
not_instant <- airbnb$price[!airbnb$instant_bookable]
hist(not_instant, breaks=40, xlab='nightly price ($)', col=grey(.8), xlim=range(airbnb$price, finite=TRUE), main='not instant bookable')
hist(instant, breaks=40, xlab='nightly price ($)', col=grey(.8), main='instant bookable')
##
## Welch Two Sample t-test
##
## data: instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.475555 14.872518
## sample estimates:
## mean of x mean of y
## 124.6409 114.9668
Instant bookable hosts cost more than others (P=0.00027, t-test with df=5039.7695486).
Critique this conclusion, and write your own.
Scribe: person with the smallest sample.int(1000, 1)
.
Care, or at least think, about the data.
Communicate.
How big is the difference? How sure are we?
Statistical significance does not imply real-world significance.
So: what did we just do?
“Hypothesis testing and \(p\)-values”