Peter Ralph
Advanced Biological Statistics
If we want a point estimate:
These all convey “where the posterior distribution is”, more or less.
What about uncertainty?
Definition: A 95% credible region is a portion of parameter space having a total of 95% of the posterior probability.
(same with other numbers for “95%”)
If we construct a 95% credible interval for \(\theta\) for each of many datasets; and the coin in each dataset has \(\theta\) drawn independently from the prior, then the true \(\theta\) will fall in its credible interval for 95% of the datasets.
If we construct a 95% credible interval for \(\theta\) with a dataset, and the distribution of the coin’s true \(\theta\) across many parallel universes is given by the prior, then the true \(\theta\) will be in the credible interval in 95% of those universes.
Given my prior beliefs (prior distribution), the posterior distribution is the most rational\({}^*\) way to update my beliefs to account for the data.
\({}^*\) if you do this many times you will be wrong least often
\({}^*\) or you will be wrong in the fewest possible universes
Definition: The “95% highest density interval” is the 95% credible interval with the highest posterior probability density at each point.
Suppose now we have data from \(n\) different coins from the same source. We don’t assume they have the same \(\theta\), but don’t know what its distribution is, so try to learn it.
\[\begin{aligned} &\left. \begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \end{aligned} \right\} \qquad \text{for } 1 \le i \le n \\ &\alpha \sim \Unif(0, 100) \\ &\beta \sim \Unif(0, 100) \end{aligned}\]
Goal: find the posterior distribution of \(\alpha\), \(\beta\).
What is different between:
Pick a value of \(\theta\) at
random from \(\Beta(3,1)\).
Flip one thousand \(\theta\)-coins, 500
times each.
Pick one thousand random \(\theta_i
\sim \Beta(3,1)\) values.
Flip one thousand coins, one for each \(\theta_i\), 500 times each.
For instance: if you were given datasets from one of these situations, how would you tell which situation it was generated from?
ncoins <- 1000
nflips <- 100
theta1 <- rbeta(1,3,1)
binom_Z <- rbinom(ncoins, size=nflips, prob=theta1)
theta2 <- rbeta(ncoins,3,1)
bb_Z <- rbinom(ncoins, size=nflips, prob=theta2)
hist(binom_Z, breaks=30, col=adjustcolor("blue", 0.5), main='', xlim=c(0,nflips), freq=FALSE, xlab='number of Heads')
hist(bb_Z, breaks=30, col=adjustcolor("red", 0.5), add=TRUE, freq=FALSE)
legend("topleft", fill=adjustcolor(c('blue', 'red'), 0.5), legend=c('one theta', 'many thetas'))
Problem: Find the posterior distribution of \(\alpha\) and \(\beta\), given some data \((Z_1, N_1), \ldots, (Z_k, N_k)\), under the model: \[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \\ \alpha &\sim \Unif(0, 100) \\ \beta &\sim \Unif(0, 100) . \end{aligned}\]
Problem: we don’t have a nice mathematical expression for the posterior distribution.
data {
// stuff you input
}
transformed data {
// stuff that's calculated from the data (just once, at the start)
}
parameters {
// stuff you want to learn about
}
transformed parameters {
// stuff that's calculated from the parameters (at every step)
}
model {
// the action!
}
generated quantities {
// stuff you want computed also along the way
}
Goal: Given:
“find”/ask questions of the posterior distribution on \(\theta\),
\[\begin{aligned} p(\theta \given D) = \frac{ p(D \given \theta) p(\theta) }{ p(D) } . \end{aligned}\]
Problem: usually we can’t write down an expression for this (because of the “\(p(D)\)”).
Solution: we’ll make up a way to draw random samples from it.
Toy example:
If we didn’t know about the pbeta( )
function, how could
we find the probability that \(\theta <
0.5\)?
(math:)
(Monte Carlo:)
i.e., “random-walk-based stochastic integration”
Produces a random sequence of samples \(\theta_1, \theta_2, \ldots, \theta_N\).
At each step, starting at \(\theta_k\):
Propose a new location (nearby?): \(\theta_k'\)
Decide whether to accept it.
Set \(k \leftarrow k+1\); if \(k=N\) then stop.
The magic comes from doing proposals and acceptance so that the \(\theta\)’s are samples from the distribution we want.
Rules are chosen so that \(p(\theta \given D)\) is the stationary distribution (long-run average!) of the random walk (the “Markov chain”).
The chain must mix fast enough so the distribution of visited states converges to \(p(\theta \given D)\).
Because of autocorrelation, \((\theta_1, \theta_2, \ldots, \theta_N)\) are not \(N\) independent samples: they are roughly equivalent to \(N_\text{eff} < N\) independent samples.
For better mixing, acceptance probabilities should not be too high or too low.
Starting several chains far apart can help diagnose failure to mix: Gelman’s \(r\) quantifies how different they are.
In each group, have one walker directed by others:
Start somewhere at “random”, then repeat:
Pick a random \(\{N,S,E,W\}\).
Take a step in that direction,
If online: take a screenshot of your path after:
Imagine the heatmap of how much time your “walker” has spent in each place in the library.
Question: What distribution does this sample from?
Now:
Pick a random \(\{N,S,E,W\}\).
Take a \(1+\Poisson(5)\) number of steps in that direction,
If online: again, take a screenshot of your path after:
Does it mix faster?
Would \(1 + \Poisson(50)\) steps be better?
Imagine the walkers are on a hill, and:
Pick a random \(\{N,S,E,W\}\).
If
What would this do?
Thanks to Metropolis-Hastings, if “elevation” is \(p(\theta \given D)\), then setting \(p = p(\theta' \given D) / p(\theta \given D)\) makes the stationary distribution \(p(\theta \given D)\).
library(brms)
df <- data.frame(
Z = Z,
N = N
)
brmfit <- brm(
Z | trials(N) ~ 1,
data=df,
family="binomial"
)
## Compiling Stan program...
## Start sampling
##
## SAMPLING FOR MODEL '782790eabbd10296f779d85fa03cf1e5' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.00012 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.2 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.366489 seconds (Warm-up)
## Chain 1: 0.333338 seconds (Sampling)
## Chain 1: 0.699827 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '782790eabbd10296f779d85fa03cf1e5' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0.000131 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 1.31 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.354822 seconds (Warm-up)
## Chain 2: 0.352986 seconds (Sampling)
## Chain 2: 0.707808 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '782790eabbd10296f779d85fa03cf1e5' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0.000102 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 1.02 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.35954 seconds (Warm-up)
## Chain 3: 0.351643 seconds (Sampling)
## Chain 3: 0.711183 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '782790eabbd10296f779d85fa03cf1e5' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0.000124 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 1.24 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.354619 seconds (Warm-up)
## Chain 4: 0.333186 seconds (Sampling)
## Chain 4: 0.687805 seconds (Total)
## Chain 4: