Peter Ralph
15 October – Advanced Biological Statistics
From a single set of numbers
\[ x_1, x_2, \ldots, x_n \]
we can get both a mean:
\[ \bar x = \frac{1}{n} \sum_{i=1}^n x_i \]
and an estimate of the variability of the mean, the standard error:
\[ \frac{s}{\sqrt{n}} = \sqrt{\frac{1}{n(n-1)}\sum_{i=1}^n \left(x_i - \bar x\right)^2 } .\]
This is amazing!
Sadly, it’s not so easy to simply compute a standard error for most estimates.
What to do?
Idea:
We’d like to get a whole new dataset, and repeat the estimation, to see how different the answer is.
And, well, our best guess at what the data look like is our dataset itself,
sooooo, let’s just resample from the dataset, with replacement, to make a “new” dataset!
If we resample and re-estimate lots of times, this should give us a good idea of the variability of the estimate.
To estimate the uncertainty of an estimate:
Use the computer to take a random sample of observations from the original data, with replacement.
Calculate the estimate from the resampled data set.
Repeat 1-2 many times.
The standard deviation of these esimates is the bootstrap standard error.
Applies to most any statistic
Works when there’s no simple formula for the standard error (e.g., median, trimmed mean, eigenvalue, etc)
Is nonparametric, so doesn’t make specific assumptions about the distribution of the data.
Applies to even complicated sampling procedures.
Use R to make 1000 “pseudo-samples” of size 10 (with replacement),
and store the mean of each in a vector.
Plot the histogram of the resampled means, and calculate their standard deviation (with sd()
).
How does this compare to the usual standard error of the mean, sd(x) / sqrt(length(x))
?
The 2.5% and 97.5% percentiles of the bootstrap samples estimate a 95% confidence interval. (use the quantile( )
function)
Exercise: get a 95% CI and compare it to that given by t.test( )
.