Uncertainty: (how to) deal with it

Peter Ralph

1 October – Advanced Biological Statistics

Course overview

Steps in data analysis

Care, or at least think, about the data.
Look at the data.
Query the data.
Sanity check.
Communicate.

Often “statistics” focuses on querying. Doing that effectively requires all the other steps, too.

Overview and mechanics

See the course website.

Some core statistical concepts

Statistics or parameters?

A statistic is: a numerical description of a dataset.

A parameter is: a numerical attribute of a model of reality.

Often, statistics are used to estimate parameters.

The two heads of classical statistics

estimating parameters, with uncertainty (confidence intervals)

evaluating (in-)consistency with a particular situation ($p$-values)

What do these data tell us about the world?
How strongly do we believe it?

This week: digging in, with simple examples.

A quick look at some data

Some data

AirBnB hosts in Portland, OR: website and download link.

Questions: how much does an AirBnB typically cost in Portland? Do “instant bookable” ones cost more?

airbnb <- read.csv("../Datasets/portland-airbnb-listings.csv")
nrow(airbnb)

## [1] 5634

names(airbnb)

##   [1] "id"                                          
##   [2] "listing_url"                                 
##   [3] "scrape_id"                                   
##   [4] "last_scraped"                                
##   [5] "name"                                        
##   [6] "summary"                                     
##   [7] "space"                                       
##   [8] "description"                                 
##   [9] "experiences_offered"                         
##  [10] "neighborhood_overview"                       
##  [11] "notes"                                       
##  [12] "transit"                                     
##  [13] "access"                                      
##  [14] "interaction"                                 
##  [15] "house_rules"                                 
##  [16] "thumbnail_url"                               
##  [17] "medium_url"                                  
##  [18] "picture_url"                                 
##  [19] "xl_picture_url"                              
##  [20] "host_id"                                     
##  [21] "host_url"                                    
##  [22] "host_name"                                   
##  [23] "host_since"                                  
##  [24] "host_location"                               
##  [25] "host_about"                                  
##  [26] "host_response_time"                          
##  [27] "host_response_rate"                          
##  [28] "host_acceptance_rate"                        
##  [29] "host_is_superhost"                           
##  [30] "host_thumbnail_url"                          
##  [31] "host_picture_url"                            
##  [32] "host_neighbourhood"                          
##  [33] "host_listings_count"                         
##  [34] "host_total_listings_count"                   
##  [35] "host_verifications"                          
##  [36] "host_has_profile_pic"                        
##  [37] "host_identity_verified"                      
##  [38] "street"                                      
##  [39] "neighbourhood"                               
##  [40] "neighbourhood_cleansed"                      
##  [41] "neighbourhood_group_cleansed"                
##  [42] "city"                                        
##  [43] "state"                                       
##  [44] "zipcode"                                     
##  [45] "market"                                      
##  [46] "smart_location"                              
##  [47] "country_code"                                
##  [48] "country"                                     
##  [49] "latitude"                                    
##  [50] "longitude"                                   
##  [51] "is_location_exact"                           
##  [52] "property_type"                               
##  [53] "room_type"                                   
##  [54] "accommodates"                                
##  [55] "bathrooms"                                   
##  [56] "bedrooms"                                    
##  [57] "beds"                                        
##  [58] "bed_type"                                    
##  [59] "amenities"                                   
##  [60] "square_feet"                                 
##  [61] "price"                                       
##  [62] "weekly_price"                                
##  [63] "monthly_price"                               
##  [64] "security_deposit"                            
##  [65] "cleaning_fee"                                
##  [66] "guests_included"                             
##  [67] "extra_people"                                
##  [68] "minimum_nights"                              
##  [69] "maximum_nights"                              
##  [70] "minimum_minimum_nights"                      
##  [71] "maximum_minimum_nights"                      
##  [72] "minimum_maximum_nights"                      
##  [73] "maximum_maximum_nights"                      
##  [74] "minimum_nights_avg_ntm"                      
##  [75] "maximum_nights_avg_ntm"                      
##  [76] "calendar_updated"                            
##  [77] "has_availability"                            
##  [78] "availability_30"                             
##  [79] "availability_60"                             
##  [80] "availability_90"                             
##  [81] "availability_365"                            
##  [82] "calendar_last_scraped"                       
##  [83] "number_of_reviews"                           
##  [84] "number_of_reviews_ltm"                       
##  [85] "first_review"                                
##  [86] "last_review"                                 
##  [87] "review_scores_rating"                        
##  [88] "review_scores_accuracy"                      
##  [89] "review_scores_cleanliness"                   
##  [90] "review_scores_checkin"                       
##  [91] "review_scores_communication"                 
##  [92] "review_scores_location"                      
##  [93] "review_scores_value"                         
##  [94] "requires_license"                            
##  [95] "license"                                     
##  [96] "jurisdiction_names"                          
##  [97] "instant_bookable"                            
##  [98] "is_business_travel_ready"                    
##  [99] "cancellation_policy"                         
## [100] "require_guest_profile_picture"               
## [101] "require_guest_phone_verification"            
## [102] "calculated_host_listings_count"              
## [103] "calculated_host_listings_count_entire_homes" 
## [104] "calculated_host_listings_count_private_rooms"
## [105] "calculated_host_listings_count_shared_rooms" 
## [106] "reviews_per_month"

Second, look at the data

summary(airbnb$price)

##    $75.00   $100.00    $80.00   $125.00    $65.00    $95.00   $150.00 
##       256       240       177       174       168       166       162 
##    $85.00    $99.00    $60.00    $90.00   $120.00    $50.00    $70.00 
##       159       145       144       134       129       129       120 
##    $45.00    $55.00   $110.00   $200.00   $199.00    $89.00    $79.00 
##       105        99        86        74        67        67        66 
##   $115.00   $300.00    $40.00   $175.00    $59.00    $35.00   $135.00 
##        65        65        64        63        60        59        56 
##    $69.00   $250.00   $180.00   $130.00   $149.00   $105.00   $119.00 
##        56        51        50        49        48        46        43 
##   $140.00    $49.00    $68.00   $109.00   $145.00   $225.00    $72.00 
##        40        40        37        34        34        33        33 
##   $160.00    $78.00   $215.00    $82.00    $88.00    $98.00   $129.00 
##        31        31        30        30        30        29        28 
##    $30.00   $195.00    $74.00   $350.00    $73.00   $165.00    $36.00 
##        28        26        26        25        23        21        21 
##    $48.00    $71.00   $159.00   $169.00    $39.00    $58.00    $87.00 
##        21        21        20        20        20        20        20 
##   $400.00    $52.00    $64.00    $83.00   $999.00    $42.00    $54.00 
##        19        19        19        19        19        18        18 
##    $92.00 $1,023.00   $108.00   $275.00    $33.00    $43.00   $155.00 
##        18        17        17        17        17        17        16 
##   $179.00   $209.00    $44.00    $47.00    $77.00    $94.00    $97.00 
##        16        16        16        16        16        16        16 
##   $185.00    $38.00    $62.00    $67.00    $41.00    $84.00    $37.00 
##        15        15        15        15        14        14        13 
##   $375.00    $57.00   $128.00    $32.00   $139.00   $325.00    $34.00 
##        13        13        12        12        11        11        11 
##    $91.00   (Other) 
##        11       663

summary(airbnb$instant_bookable)

##    f    t 
## 2960 2674

Whoops

airbnb$price <- as.numeric(gsub("$", "", airbnb$price, fixed=TRUE))

## Warning: NAs introduced by coercion

airbnb$instant_bookable <- (airbnb$instant_bookable == "t")

summary(airbnb$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    69.0    95.0   119.5   136.0   999.0      32

summary(airbnb$instant_bookable)

##    Mode   FALSE    TRUE 
## logical    2960    2674

table(airbnb$bed_type) # hm

## 
##        Airbed         Couch         Futon Pull-out Sofa      Real Bed 
##             9             7            42            21          5555

How much is a typical night?

mean(airbnb$price, na.rm=TRUE)

## [1] 119.5396

hist(airbnb$price, breaks=40, xlab='nightly price ($)', col=grey(.8), xlim=range(airbnb$price, finite=TRUE), main='instant bookable')

plot of chunk airbnb_hist

Conclusion?

Do “instant bookable” charge more?

layout(1:2)
hist(airbnb$price[!airbnb$instant_bookable], breaks=40, xlab='nightly price ($)', col=grey(.8), xlim=range(airbnb$price, finite=TRUE), main='not instant bookable') 
hist(airbnb$price[airbnb$instant_bookable], breaks=40, xlab='nightly price ($)', col=grey(.8), main='instant bookable')

plot of chunk airbnb_hist2

instant <- airbnb$price[airbnb$instant_bookable]
not_instant <- airbnb$price[!airbnb$instant_bookable]
(tt <- t.test(instant, not_instant))

## 
##  Welch Two Sample t-test
## 
## data:  instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.475555 14.872518
## sample estimates:
## mean of x mean of y 
##  124.6409  114.9668

Conclusion

Instant bookable hosts cost more than others (P=0.00027, t-test with df=5039.7695486).

Critique this conclusion.

Don’t forget Steps 1 and 5!

Care, or at least think, about the data.
Communicate.

How big is the difference? How sure are we?

Statistical significance does not imply real-world significance.

Revised conclusion

The mean nightly price of Portland AirBnB hosts is $ 120, with a standard deviation of $98 and a range of $ 0 to $ 999. “Instant bookable” hosts charged on average $9.7 more than others, a difference that is statistically significant (P=0.00027, t-test with df=5039.7695486).

Critiques?

So: what did we just do?

Hypothesis testing and $p$-values

A $p$-value is

the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.

Usually, this means

a result - numerical value of a statistic
surprising - big
null hypothesis - the model we use to calculate the $p$-value

which can all be defined to suit the situation.

What does a small $p$-value mean?

If the null hypothesis were true, then you’d be really unlikely to see something like what you actually did.

So, either the “null hypothesis” is not a good description of reality or something surprising happened.

How useful this is depends on the null hypothesis.

For instance

## 
##  Welch Two Sample t-test
## 
## data:  instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.475555 14.872518
## sample estimates:
## mean of x mean of y 
##  124.6409  114.9668

Also for instance

t.test(airbnb$price)

## 
##  One Sample t-test
## 
## data:  airbnb$price
## t = 91.32, df = 5601, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  116.9734 122.1058
## sample estimates:
## mean of x 
##  119.5396

Is that $p$-value useful?

Exercise:

(class survey) How many people have a longer index finger on the hand they write with?

We want to know \[\begin{equation} \theta = \P(\text{random person has writing finger longer}) . \end{equation}\]

Everyone make a fake dataset with $\theta = 1/2$, e.g.:

n <- 35 # class size
sum(rbinom(n, 1, 1/2) > 0)

Now we can estimate the $p$-value for the hypothesis that $\theta = 1/2$. Conclusions?

replicate(1000, sum(rbinom(n, 1, 1/2) > 0))

So, where do $p$-values come from?

Either math:

Or, computers. (maybe math, maybe simulation, maybe both)

Stochastic minute: the $t$ distribution

The $t$ statistic

The $t$ statistic computed from a collection of $n$ numbers is the sample mean divided by the estimated standard error of the mean, which is the sample SD divided by $\sqrt{n}$.

If $x_1, \ldots, x_n$ are numbers, then \[\begin{aligned} \text{(sample mean)} \qquad \bar x &= \frac{1}{n}\sum_{i=1}^n x_i \\ \text{(sample SD)} \qquad s &= \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar x)^2} \end{aligned}\] so \[\begin{equation} t(x) = \frac{\bar x}{s / \sqrt{n}} . \end{equation}\]

Sanity check

n <- 20
x <- rnorm(n)
c(t.test(x)$statistic, 
  mean(x) / (sd(x) / sqrt(n)))

##        t          
## 1.318919 1.318919

The $t$ approximation

Fact: If $X_1, \ldots, X_n$ are independent random samples from a distribution with mean $\mu$, then \[\begin{equation} t(X - \mu) = \frac{\bar x - \mu}{s/\sqrt{n}} \approx \StudentsT(n-2) , \end{equation}\] as long as $n$ is not too small and the distribution isn’t too wierd.

A demonstration

Let’s check this, by doing:

find the sample $t$ score of 100 random draws from some distribution

lots of times, and looking at the distribution of those $t$ scores.

Claim: no matter the distribution we sample from, it should look close to $t$.

One sample

n <- 20
x <- 2 * runif(n) - 1
hist(x, xlab='value', col=grey(0.5),
     main=sprintf("t=%f", mean(x)*sqrt(n-1)/sd(x)))
abline(v=0, lwd=2, lty=3)
abline(v=mean(x), col='red', lwd=2)

plot of chunk t_one_smaple

More samples

plot of chunk t_more_samples

Distribution of 1,000 sample $t$ scores

xm <- replicate(1000, {
            x <- 2 * runif(n) - 1;
            mean(x) * sqrt(n-1) / sd(x) })
xh <- hist(xm, breaks=40, main=sprintf('t of %d samples', n), col='red')

plot of chunk t_sampling_dist

Distribution of 1,000 sample $t$ scores

plot(xh, main=sprintf('t of %d samples', n), col='red')
xx <- xh$breaks
polygon(c(xx[-1] - diff(xx)/2, xx[1]),
        c(length(xm)* diff(pt(xx, df=(n-1))), 0),
        col=adjustcolor("blue", 0.4))

plot of chunk t_smpling_dist2

Exercise:

Do this again (use my code) except using

x <- rexp(n) - 1

instead of 2 * runif(n) - 1.

Confident in confidence intervals?

tt

## 
##  Welch Two Sample t-test
## 
## data:  instant and not_instant
## t = 3.6482, df = 5039.8, p-value = 0.0002667
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.475555 14.872518
## sample estimates:
## mean of x mean of y 
##  124.6409  114.9668

Confidence intervals

A 95% confidence interval for an estimate is constructed so that no matter what the true value, the confidence interval overlaps the truth 95% of the time.

In other words, if we collect 1,000 independent samples from a population with true mean $\mu$, and construct confidence intervals for the mean from each, then about 950 of these should overlap $\mu$.

How’s that work?

plot of chunk plot_t

Check this.

if we collect 1,000 independent samples from a population with true mean $\mu$, and construct confidence intervals from each, then about 950 of these should overlap $\mu$.

Let’s take independent samples of size $n=20$ from a Normal distribution with $\mu = 0$. Example:

n <- 20; mu <- 0
t.test(rnorm(n, mean=mu))$conf.int

## [1] -0.5259054  0.3753652
## attr(,"conf.level")
## [1] 0.95

tci <- replicate(300, t.test(rnorm(n, mean=mu))$conf.int)
mean(tci[1,] > 0 | tci[2,] < 0)

## [1] 0.05666667

plot of chunk many_conf_int_plot

Sensitivity analysis

Group exercise

How does the margin of error change with sample size? By taking random samples from the price column of the airbnb data, make two plots:

Probability that a sample of size n of Portland AirBnB rooms has a sample mean within $10 of the (true) mean price of all rooms, as a function of n.
Expected difference between the mean price of a random sample of n Portland AirBnB rooms and the (true) mean price of all rooms, as a function of n.

In class

x <- subset(airbnb, !is.na(price))$price # to save typing
nvals <- 10 * (2:20)
props <- rep(NA, length(nvals))
for (k in 1:length(nvals)) {
    many_means <- replicate(1e4, mean(sample(x, nvals[k])))
    props[k] <- mean(abs(many_means - mean(x)) < 10)
}
plot(nvals, 1-props,
     ylab='proportion of sample means not within $10',
     xlab='sample size',
     ylim=c(0,1), col='plum', pch=20, cex=2)

plot of chunk sampling

Stochastic minute: the Central Limit Theorem and the Normal distribution

The CLT

The Central Limit Theorem says, roughly, that net effect of the sum of a bunch of small, independent random things can be well-approximated by a Gaussian distribution, almost regardless of the details.

For instance: say $X_1, X_2, \ldots, X_n$ are independent, random draws with mean $\mu$ and standard deviation $\sigma$.

Then, the difference between the “true” mean, $\mu$, and the sample mean is Gaussian, \[\begin{aligned} \bar x = \frac{1}{n}\sum_{i=1}^n X_i \approx \Normal\left(\mu, \frac{\sigma}{\sqrt{n}}\right) . \end{aligned}\]

The Gaussian distribution

Also called the Normal distribution: see previous slide.

Saying that a random number $Z$ “is Normal”: \[\begin{equation} Z \sim \Normal(\mu, \sigma) \end{equation}\] means that \[\begin{equation} \P\left\{Z \ge \frac{x - \mu}{\sigma}\right\} = \int_x^\infty \frac{1}{\sqrt{2 \pi}} e^{-u^2/2} du . \end{equation}\]

What to remember:

$Z$ is probably no more than a few times $\sigma$ away from $\mu$
Using R,

rnorm(10, mean=3, sd=2)    # random simulations
pnorm(5, mean=3, sd=2)     # probabilities
qnorm(0.975, mean=3, sd=2) # quantiles

A demonstration

Let’s check this, by doing:

find the sample mean of 100 random draws from some distribution

lots of times, and looking at the distribution of those sample means.

Claim: no matter the distribution we sample from, it should look close to Normal.

One sample

n <- 100
x <- runif(n)
hist(x, xlab='value', main='sample', col=grey(0.5))
abline(v=mean(x), col='red', lwd=2)

plot of chunk one_smaple

More samples

plot of chunk more_samples

Distribution of 1,000 sample means

xm <- replicate(1000, mean(runif(n)))
xh <- hist(xm, breaks=40, main=sprintf('mean of %d samples', n), col='red')

plot of chunk smpling_dist

Distribution of 1,000 sample means

plot(xh, main=sprintf('mean of %d samples', n), col='red')
xx <- xh$breaks
polygon(c(xx[-1] - diff(xx)/2, xx[1]),
        c(length(xm)* diff(pnorm(xx, mean=0.5, sd=1/sqrt(n*12))), 0),
        col=adjustcolor("blue", 0.4))

plot of chunk smpling_dist2

Relationship to the $t$ distribution

If $Y$ and $Z_1, \ldots, Z_n$ are independent $\Normal(0, \sigma)$, and \[\begin{equation} X = \frac{Y}{ \sqrt{\frac{1}{n}\sum_{j=1}^n Z_j^2} } \end{equation}\] then \[\begin{equation} X \sim \StudentsT(n) . \end{equation}\]

More usefully, a sample mean divided by its standard error is$^*$ $t$ distributed.

This is thanks to the Central Limit Theorem. ($^*$ usually, approximately)

Recap

statistics describe data and estimate parameters.
$p$-values assess (in)consistency with specific models (ie, hypotheses)
confidence intervals give a measure of uncertainty
A sample mean scaled by (its sample SD over $\sqrt{n}$) is approximately $t$-distributed,
which means that sample means are typically a few multiples of $\sigma/\sqrt{n}$ away from the true mean.
A permutation test gives a way of testing hypotheses with fewer assumptions.

Uncertainty: (how to) deal with it

Course overview

Steps in data analysis

Overview and mechanics

Some core statistical concepts

Statistics or parameters?

The two heads of classical statistics

A quick look at some data

Some data

Second, look at the data

Whoops

How much is a typical night?

Do “instant bookable” charge more?

Conclusion

Don’t forget Steps 1 and 5!

Revised conclusion

Hypothesis testing and \(p\)-values

A \(p\)-value is

What does a small \(p\)-value mean?

For instance

Also for instance

Exercise:

So, where do \(p\)-values come from?

Stochastic minute: the \(t\) distribution

The \(t\) statistic

Sanity check

The \(t\) approximation

A demonstration

One sample

More samples

Distribution of 1,000 sample \(t\) scores

Distribution of 1,000 sample \(t\) scores

Exercise:

Confident in confidence intervals?

Confidence intervals

How’s that work?

Check this.

Sensitivity analysis

Group exercise

In class

Stochastic minute: the Central Limit Theorem and the Normal distribution

The CLT

The Gaussian distribution

A demonstration

One sample

More samples

Distribution of 1,000 sample means

Distribution of 1,000 sample means

Relationship to the \(t\) distribution

Recap