\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Uncertainty: (how to) deal with it

Peter Ralph

Advanced Biological Statistics

Course overview

a box of tools

image: Frank Klausz, woodandshop.com

Steps in data analysis

  1. Care, or at least think, about the data.

  2. Look at the data.

  3. Query the data.

  4. Check the results.

  5. Communicate.

Often “statistics” focuses on querying. Doing that effectively requires all the other steps, too.

Prerequisites

We’ll be assuming that you have some familiarity with

  • programming, and
  • statistics

For instance, you should be able to figure out what this means:

x = c(2, 4, 3, 6)
y = c(5, 12, 4, 10, 2)
t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -1.3761, df = 5.4988, p-value = 0.2222
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.031728  2.331728
## sample estimates:
## mean of x mean of y 
##      3.75      6.60

Overview and mechanics

See the course website.

Break

Please take 10 minutes to

  1. answer the “Welcome Survey” on Canvas,
  2. get the course repository from github,
  3. install Rstudio and/or
  4. move around.

Questions?

Some core statistical concepts

Statistics or parameters?

A statistic is

a numerical description of a dataset.

A parameter is

a numerical attribute of a model of reality.

Often, statistics are used to estimate parameters.

The two heads of classical statistics

estimating parameters, with uncertainty (confidence intervals)

evaluating (in-)consistency with a particular situation (\(p\)-values)

  1. What do these data tell us about the world?
  2. How strongly do we believe it?

This week: digging in, with simple examples.

Lurking, behind everything:

is uncertainty

thanks to randomness.

How do we understand randomness, concretely and quantitatively?

With models.

A quick look at some data

Some data

AirBnB hosts in Portland, OR: data file (source: website and download link)

airbnb <- read.csv("../Datasets/portland-airbnb-listings.csv")
nrow(airbnb)
## [1] 5634
names(airbnb)
##   [1] "id"                                           "listing_url"                                  "scrape_id"                                    "last_scraped"                                
##   [5] "name"                                         "summary"                                      "space"                                        "description"                                 
##   [9] "experiences_offered"                          "neighborhood_overview"                        "notes"                                        "transit"                                     
##  [13] "access"                                       "interaction"                                  "house_rules"                                  "thumbnail_url"                               
##  [17] "medium_url"                                   "picture_url"                                  "xl_picture_url"                               "host_id"                                     
##  [21] "host_url"                                     "host_name"                                    "host_since"                                   "host_location"                               
##  [25] "host_about"                                   "host_response_time"                           "host_response_rate"                           "host_acceptance_rate"                        
##  [29] "host_is_superhost"                            "host_thumbnail_url"                           "host_picture_url"                             "host_neighbourhood"                          
##  [33] "host_listings_count"                          "host_total_listings_count"                    "host_verifications"                           "host_has_profile_pic"                        
##  [37] "host_identity_verified"                       "street"                                       "neighbourhood"                                "neighbourhood_cleansed"                      
##  [41] "neighbourhood_group_cleansed"                 "city"                                         "state"                                        "zipcode"                                     
##  [45] "market"                                       "smart_location"                               "country_code"                                 "country"                                     
##  [49] "latitude"                                     "longitude"                                    "is_location_exact"                            "property_type"                               
##  [53] "room_type"                                    "accommodates"                                 "bathrooms"                                    "bedrooms"                                    
##  [57] "beds"                                         "bed_type"                                     "amenities"                                    "square_feet"                                 
##  [61] "price"                                        "weekly_price"                                 "monthly_price"                                "security_deposit"                            
##  [65] "cleaning_fee"                                 "guests_included"                              "extra_people"                                 "minimum_nights"                              
##  [69] "maximum_nights"                               "minimum_minimum_nights"                       "maximum_minimum_nights"                       "minimum_maximum_nights"                      
##  [73] "maximum_maximum_nights"                       "minimum_nights_avg_ntm"                       "maximum_nights_avg_ntm"                       "calendar_updated"                            
##  [77] "has_availability"                             "availability_30"                              "availability_60"                              "availability_90"                             
##  [81] "availability_365"                             "calendar_last_scraped"                        "number_of_reviews"                            "number_of_reviews_ltm"                       
##  [85] "first_review"                                 "last_review"                                  "review_scores_rating"                         "review_scores_accuracy"                      
##  [89] "review_scores_cleanliness"                    "review_scores_checkin"                        "review_scores_communication"                  "review_scores_location"                      
##  [93] "review_scores_value"                          "requires_license"                             "license"                                      "jurisdiction_names"                          
##  [97] "instant_bookable"                             "is_business_travel_ready"                     "cancellation_policy"                          "require_guest_profile_picture"               
## [101] "require_guest_phone_verification"             "calculated_host_listings_count"               "calculated_host_listings_count_entire_homes"  "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms"  "reviews_per_month"

Questions: how much does an AirBnB typically cost in Portland? Do “instant bookable” ones cost more?

Second, look at the data

summary(airbnb$price)
##    Length     Class      Mode 
##      5634 character character
str(airbnb$price)
##  chr [1:5634] "$65.00" "$275.00" "$200.00" "$125.00" "$29.00" "$130.00" "$55.00" "$79.00" "$61.00" "$78.00" "$95.00" "$40.00" "$160.00" "$90.00" "$60.00" "$175.00" "$425.00" "$85.00" "$75.00" "$55.00" ...

summary(airbnb$instant_bookable)
##    Length     Class      Mode 
##      5634 character character
str(airbnb$instant_bookable)
##  chr [1:5634] "f" "t" "f" "f" "t" "t" "f" "t" "f" "f" "f" "t" "f" "f" "f" "f" "f" "t" "f" "t" "t" "f" "f" "f" "f" "f" "f" "f" "t" "f" "f" "t" "f" "t" "t" "f" "f" "t" "f" "t" "f" "f" "f" "f" "f" "t" "f" "f" ...

Whoops

airbnb$price <- as.numeric(gsub("$", "", airbnb$price, fixed=TRUE))
## Warning: NAs introduced by coercion
airbnb$instant_bookable <- (airbnb$instant_bookable == "t")
summary(airbnb$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    69.0    95.0   119.5   136.0   999.0      32
summary(airbnb$instant_bookable)
##    Mode   FALSE    TRUE 
## logical    2960    2674
table(airbnb$bed_type) # hm
## 
##        Airbed         Couch         Futon Pull-out Sofa      Real Bed 
##             9             7            42            21          5555

How much is a typical night?

mean(airbnb$price, na.rm=TRUE)
## [1] 119.5396
hist(airbnb$price, breaks=40, xlab='nightly price ($)', col=grey(.8), xlim=range(airbnb$price, finite=TRUE), main='AirBnB prices in Portland, OR')

plot of chunk r airbnb_hist

Conclusion?

Do “instant bookable” charge more?

layout(1:2)
instant <- airbnb$price[airbnb$instant_bookable]
not_instant <- airbnb$price[!airbnb$instant_bookable]
hist(not_instant, breaks=40, xlab='nightly price ($)', col=grey(.8), xlim=range(airbnb$price, finite=TRUE), main='not instant bookable') 
hist(instant, breaks=40, xlab='nightly price ($)', col=grey(.8), main='instant bookable')

plot of chunk r airbnb_hist2

(tt <- t.test(instant, not_instant))
## 
##  Welch Two Sample t-test
## 
## data:  instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.475555 14.872518
## sample estimates:
## mean of x mean of y 
##  124.6409  114.9668

Conclusion

Instant bookable hosts cost more than others (P=0.00027, t-test with df=5039.7695486).

Critique this conclusion, and write your own.

Scribe: person with the smallest sample.int(1000, 1).

Don’t forget Steps 1 and 5!

  1. Care, or at least think, about the data.

  2. Communicate.

How big is the difference? How sure are we?

Statistical significance does not imply real-world significance.

Revised conclusion (in class)

So: what did we just do?

“Hypothesis testing and \(p\)-values”