\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \newcommand{\given}{\;\vert\;} \]

Uncertainty: (how to) deal with it

Peter Ralph

1 October – Advanced Biological Statistics

Course overview

a box of tools
a box of tools

Steps in data analysis

  1. Care, or at least think, about the data.

  2. Look at the data.

  3. Query the data.

  4. Sanity check.

  5. Communicate.

Often “statistics” focuses on querying. Doing that effectively requires all the other steps, too.

Overview and mechanics

See the course website.

Some core statistical concepts

Statistics or parameters?

A statistic is

a numerical description of a dataset.

A parameter is

a numerical attribute of a model of reality.

Often, statistics are used to estimate parameters.

The two heads of classical statistics

estimating parameters, with uncertainty (confidence intervals)

evaluating (in-)consistency with a particular situation (\(p\)-values)

  1. What do these data tell us about the world?
  2. How strongly do we believe it?

This week: digging in, with simple examples.

A quick look at some data

Some data

AirBnB hosts in Portland, OR: website and download link.

Questions: how much does an AirBnB typically cost in Portland? Do “instant bookable” ones cost more?

## [1] 5634
##   [1] "id"                                          
##   [2] "listing_url"                                 
##   [3] "scrape_id"                                   
##   [4] "last_scraped"                                
##   [5] "name"                                        
##   [6] "summary"                                     
##   [7] "space"                                       
##   [8] "description"                                 
##   [9] "experiences_offered"                         
##  [10] "neighborhood_overview"                       
##  [11] "notes"                                       
##  [12] "transit"                                     
##  [13] "access"                                      
##  [14] "interaction"                                 
##  [15] "house_rules"                                 
##  [16] "thumbnail_url"                               
##  [17] "medium_url"                                  
##  [18] "picture_url"                                 
##  [19] "xl_picture_url"                              
##  [20] "host_id"                                     
##  [21] "host_url"                                    
##  [22] "host_name"                                   
##  [23] "host_since"                                  
##  [24] "host_location"                               
##  [25] "host_about"                                  
##  [26] "host_response_time"                          
##  [27] "host_response_rate"                          
##  [28] "host_acceptance_rate"                        
##  [29] "host_is_superhost"                           
##  [30] "host_thumbnail_url"                          
##  [31] "host_picture_url"                            
##  [32] "host_neighbourhood"                          
##  [33] "host_listings_count"                         
##  [34] "host_total_listings_count"                   
##  [35] "host_verifications"                          
##  [36] "host_has_profile_pic"                        
##  [37] "host_identity_verified"                      
##  [38] "street"                                      
##  [39] "neighbourhood"                               
##  [40] "neighbourhood_cleansed"                      
##  [41] "neighbourhood_group_cleansed"                
##  [42] "city"                                        
##  [43] "state"                                       
##  [44] "zipcode"                                     
##  [45] "market"                                      
##  [46] "smart_location"                              
##  [47] "country_code"                                
##  [48] "country"                                     
##  [49] "latitude"                                    
##  [50] "longitude"                                   
##  [51] "is_location_exact"                           
##  [52] "property_type"                               
##  [53] "room_type"                                   
##  [54] "accommodates"                                
##  [55] "bathrooms"                                   
##  [56] "bedrooms"                                    
##  [57] "beds"                                        
##  [58] "bed_type"                                    
##  [59] "amenities"                                   
##  [60] "square_feet"                                 
##  [61] "price"                                       
##  [62] "weekly_price"                                
##  [63] "monthly_price"                               
##  [64] "security_deposit"                            
##  [65] "cleaning_fee"                                
##  [66] "guests_included"                             
##  [67] "extra_people"                                
##  [68] "minimum_nights"                              
##  [69] "maximum_nights"                              
##  [70] "minimum_minimum_nights"                      
##  [71] "maximum_minimum_nights"                      
##  [72] "minimum_maximum_nights"                      
##  [73] "maximum_maximum_nights"                      
##  [74] "minimum_nights_avg_ntm"                      
##  [75] "maximum_nights_avg_ntm"                      
##  [76] "calendar_updated"                            
##  [77] "has_availability"                            
##  [78] "availability_30"                             
##  [79] "availability_60"                             
##  [80] "availability_90"                             
##  [81] "availability_365"                            
##  [82] "calendar_last_scraped"                       
##  [83] "number_of_reviews"                           
##  [84] "number_of_reviews_ltm"                       
##  [85] "first_review"                                
##  [86] "last_review"                                 
##  [87] "review_scores_rating"                        
##  [88] "review_scores_accuracy"                      
##  [89] "review_scores_cleanliness"                   
##  [90] "review_scores_checkin"                       
##  [91] "review_scores_communication"                 
##  [92] "review_scores_location"                      
##  [93] "review_scores_value"                         
##  [94] "requires_license"                            
##  [95] "license"                                     
##  [96] "jurisdiction_names"                          
##  [97] "instant_bookable"                            
##  [98] "is_business_travel_ready"                    
##  [99] "cancellation_policy"                         
## [100] "require_guest_profile_picture"               
## [101] "require_guest_phone_verification"            
## [102] "calculated_host_listings_count"              
## [103] "calculated_host_listings_count_entire_homes" 
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms" 
## [106] "reviews_per_month"

Second, look at the data

##    $75.00   $100.00    $80.00   $125.00    $65.00    $95.00   $150.00 
##       256       240       177       174       168       166       162 
##    $85.00    $99.00    $60.00    $90.00   $120.00    $50.00    $70.00 
##       159       145       144       134       129       129       120 
##    $45.00    $55.00   $110.00   $200.00   $199.00    $89.00    $79.00 
##       105        99        86        74        67        67        66 
##   $115.00   $300.00    $40.00   $175.00    $59.00    $35.00   $135.00 
##        65        65        64        63        60        59        56 
##    $69.00   $250.00   $180.00   $130.00   $149.00   $105.00   $119.00 
##        56        51        50        49        48        46        43 
##   $140.00    $49.00    $68.00   $109.00   $145.00   $225.00    $72.00 
##        40        40        37        34        34        33        33 
##   $160.00    $78.00   $215.00    $82.00    $88.00    $98.00   $129.00 
##        31        31        30        30        30        29        28 
##    $30.00   $195.00    $74.00   $350.00    $73.00   $165.00    $36.00 
##        28        26        26        25        23        21        21 
##    $48.00    $71.00   $159.00   $169.00    $39.00    $58.00    $87.00 
##        21        21        20        20        20        20        20 
##   $400.00    $52.00    $64.00    $83.00   $999.00    $42.00    $54.00 
##        19        19        19        19        19        18        18 
##    $92.00 $1,023.00   $108.00   $275.00    $33.00    $43.00   $155.00 
##        18        17        17        17        17        17        16 
##   $179.00   $209.00    $44.00    $47.00    $77.00    $94.00    $97.00 
##        16        16        16        16        16        16        16 
##   $185.00    $38.00    $62.00    $67.00    $41.00    $84.00    $37.00 
##        15        15        15        15        14        14        13 
##   $375.00    $57.00   $128.00    $32.00   $139.00   $325.00    $34.00 
##        13        13        12        12        11        11        11 
##    $91.00   (Other) 
##        11       663
##    f    t 
## 2960 2674

Whoops

## Warning: NAs introduced by coercion
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    69.0    95.0   119.5   136.0   999.0      32
##    Mode   FALSE    TRUE 
## logical    2960    2674
## 
##        Airbed         Couch         Futon Pull-out Sofa      Real Bed 
##             9             7            42            21          5555

How much is a typical night?

## [1] 119.5396

plot of chunk airbnb_hist

Conclusion?

Do “instant bookable” charge more?

plot of chunk airbnb_hist2

## 
##  Welch Two Sample t-test
## 
## data:  instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.475555 14.872518
## sample estimates:
## mean of x mean of y 
##  124.6409  114.9668

Conclusion

Instant bookable hosts cost more than others (P=0.00027, t-test with df=5039.7695486).

Critique this conclusion.

Don’t forget Steps 1 and 5!

  1. Care, or at least think, about the data.

  2. Communicate.

How big is the difference? How sure are we?

Statistical significance does not imply real-world significance.

Revised conclusion

The mean nightly price of Portland AirBnB hosts is $ 120, with a standard deviation of $98 and a range of $ 0 to $ 999. “Instant bookable” hosts charged on average $9.7 more than others, a difference that is statistically significant (P=0.00027, t-test with df=5039.7695486).

Critiques?

So: what did we just do?

Hypothesis testing and \(p\)-values

A \(p\)-value is

the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.

Usually, this means

  • a result - numerical value of a statistic
  • surprising - big
  • null hypothesis - the model we use to calculate the \(p\)-value

which can all be defined to suit the situation.

What does a small \(p\)-value mean?

If the null hypothesis were true, then you’d be really unlikely to see something like what you actually did.

So, either the “null hypothesis” is not a good description of reality or something surprising happened.

How useful this is depends on the null hypothesis.

For instance

## 
##  Welch Two Sample t-test
## 
## data:  instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.475555 14.872518
## sample estimates:
## mean of x mean of y 
##  124.6409  114.9668

Also for instance

## 
##  One Sample t-test
## 
## data:  airbnb$price
## t = 91.32, df = 5601, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  116.9734 122.1058
## sample estimates:
## mean of x 
##  119.5396

Is that \(p\)-value useful?

Exercise:

(class survey) How many people have a longer index finger on the hand they write with?

We want to know \[\begin{equation} \theta = \P(\text{random person has writing finger longer}) . \end{equation}\]

Everyone make a fake dataset with \(\theta = 1/2\), e.g.:

n <- 35 # class size
sum(rbinom(n, 1, 1/2) > 0)

Now we can estimate the \(p\)-value for the hypothesis that \(\theta = 1/2\). Conclusions?

replicate(1000, sum(rbinom(n, 1, 1/2) > 0))

So, where do \(p\)-values come from?

Either math:

table of p-values from a t distribution
table of p-values from a t distribution

Or, computers. (maybe math, maybe simulation, maybe both)

Stochastic minute: the \(t\) distribution

The \(t\) statistic

The \(t\) statistic computed from a collection of \(n\) numbers is the sample mean divided by the estimated standard error of the mean, which is the sample SD divided by \(\sqrt{n}\).

If \(x_1, \ldots, x_n\) are numbers, then \[\begin{aligned} \text{(sample mean)} \qquad \bar x &= \frac{1}{n}\sum_{i=1}^n x_i \\ \text{(sample SD)} \qquad s &= \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2} \end{aligned}\] so \[\begin{equation} t(x) = \frac{\bar x}{s / \sqrt{n}} . \end{equation}\]

Sanity check

##        t          
## 1.318919 1.318919

The \(t\) approximation

Fact: If \(X_1, \ldots, X_n\) are independent random samples from a distribution with mean \(\mu\), then \[\begin{equation} t(X - \mu) = \frac{\bar x - \mu}{s/\sqrt{n}} \approx \StudentsT(n-2) , \end{equation}\] as long as \(n\) is not too small and the distribution isn’t too wierd.

A demonstration

Let’s check this, by doing:

find the sample \(t\) score of 100 random draws from some distribution

lots of times, and looking at the distribution of those \(t\) scores.

Claim: no matter the distribution we sample from, it should look close to \(t\).

One sample

plot of chunk t_one_smaple

More samples

plot of chunk t_more_samples

Distribution of 1,000 sample \(t\) scores

plot of chunk t_sampling_dist

Distribution of 1,000 sample \(t\) scores

plot of chunk t_smpling_dist2

Exercise:

Do this again (use my code) except using

x <- rexp(n) - 1

instead of 2 * runif(n) - 1.

Confident in confidence intervals?

## 
##  Welch Two Sample t-test
## 
## data:  instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.475555 14.872518
## sample estimates:
## mean of x mean of y 
##  124.6409  114.9668

Confidence intervals

A 95% confidence interval for an estimate is constructed so that no matter what the true value, the confidence interval overlaps the truth 95% of the time.

In other words, if we collect 1,000 independent samples from a population with true mean \(\mu\), and construct confidence intervals for the mean from each, then about 950 of these should overlap \(\mu\).

How’s that work?

plot of chunk plot_t

Check this.

if we collect 1,000 independent samples from a population with true mean \(\mu\), and construct confidence intervals from each, then about 950 of these should overlap \(\mu\).

Let’s take independent samples of size \(n=20\) from a Normal distribution with \(\mu = 0\). Example:

## [1] -0.5259054  0.3753652
## attr(,"conf.level")
## [1] 0.95

## [1] 0.05666667

plot of chunk many_conf_int_plot

Sensitivity analysis

Group exercise

How does the margin of error change with sample size? By taking random samples from the price column of the airbnb data, make two plots:

  1. Probability that a sample of size n of Portland AirBnB rooms has a sample mean within $10 of the (true) mean price of all rooms, as a function of n.

  2. Expected difference between the mean price of a random sample of n Portland AirBnB rooms and the (true) mean price of all rooms, as a function of n.

In class

plot of chunk sampling

Stochastic minute: the Central Limit Theorem and the Normal distribution

The CLT

The Central Limit Theorem says, roughly, that net effect of the sum of a bunch of small, independent random things can be well-approximated by a Gaussian distribution, almost regardless of the details.

For instance: say \(X_1, X_2, \ldots, X_n\) are independent, random draws with mean \(\mu\) and standard deviation \(\sigma\).

Then, the difference between the “true” mean, \(\mu\), and the sample mean is Gaussian, \[\begin{aligned} \bar x = \frac{1}{n}\sum_{i=1}^n X_i \approx \Normal\left(\mu, \frac{\sigma}{\sqrt{n}}\right) . \end{aligned}\]

The Gaussian distribution

Also called the Normal distribution: see previous slide.

Saying that a random number \(Z\) “is Normal”: \[\begin{equation} Z \sim \Normal(\mu, \sigma) \end{equation}\] means that \[\begin{equation} \P\left\{Z \ge \frac{x - \mu}{\sigma}\right\} = \int_x^\infty \frac{1}{\sqrt{2 \pi}} e^{-u^2/2} du . \end{equation}\]

What to remember:

  1. \(Z\) is probably no more than a few times \(\sigma\) away from \(\mu\)
  2. Using R,
rnorm(10, mean=3, sd=2)    # random simulations
pnorm(5, mean=3, sd=2)     # probabilities
qnorm(0.975, mean=3, sd=2) # quantiles

A demonstration

Let’s check this, by doing:

find the sample mean of 100 random draws from some distribution

lots of times, and looking at the distribution of those sample means.

Claim: no matter the distribution we sample from, it should look close to Normal.

One sample

plot of chunk one_smaple

More samples

plot of chunk more_samples

Distribution of 1,000 sample means

plot of chunk smpling_dist

Distribution of 1,000 sample means

plot of chunk smpling_dist2

Relationship to the \(t\) distribution

If \(Y\) and \(Z_1, \ldots, Z_n\) are independent \(\Normal(0, \sigma)\), and \[\begin{equation} X = \frac{Y}{ \sqrt{\frac{1}{n}\sum_{j=1}^n Z_j^2} } \end{equation}\] then \[\begin{equation} X \sim \StudentsT(n) . \end{equation}\]

More usefully, a sample mean divided by its standard error is\(^*\) \(t\) distributed.

This is thanks to the Central Limit Theorem. (\(^*\) usually, approximately)

Recap

  • statistics describe data and estimate parameters.

  • \(p\)-values assess (in)consistency with specific models (ie, hypotheses)

  • confidence intervals give a measure of uncertainty

  • A sample mean scaled by (its sample SD over \(\sqrt{n}\)) is approximately \(t\)-distributed,

  • which means that sample means are typically a few multiples of \(\sigma/\sqrt{n}\) away from the true mean.

  • A permutation test gives a way of testing hypotheses with fewer assumptions.