\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \newcommand{\given}{\;\vert\;} \]
Peter Ralph
20 November 2018 – Advanced Biological Statistics
Hierarchical coins
Introduction to MCMC with Stan
Sharing power, and shrinkage
Baseball
Suppose now we have data from \(n\) different coins from the same source. We don’t assume they have the same \(\theta\), but don’t know what its distribution is, so try to learn it.
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \\ \alpha &\sim \Unif(0, 100) \\ \beta &\sim \Unif(0, 100) \end{aligned}\]
Goal: find the posterior distribution of \(\alpha\), \(\beta\).
Problem: we don’t have a nice mathematical expression for this posterior distribution.
Goal: Given:
“find”/ask questions of the posterior distribution on \(\theta\),
\[\begin{aligned} p(\theta \given D) = \frac{ p(D \given \theta) p(\theta) }{ p(D) } . \end{aligned}\]
Problem: usually we can’t write down an expression for this (because of the “\(p(D)\)”).
Solution: we’ll make up a way to draw random samples from it.
Toy example:
(from beta-binomial coin example)
Do we think that \(\theta < 0.5\)?
(before:)
(now:)
i.e., “random-walk-based stochastic integration”
Example: Gibbs sampling for uniform distribution on a region. (picture)
Produces a random sequence of samples \(\theta_1, \theta_2, \ldots, \theta_N\).
At each step, starting at \(\theta_k\):
Propose a new location (nearby?): \(\theta_k'\)
Decide whether to accept it.
Set \(k \leftarrow k+1\); if \(k=N\) then stop.
The magic comes from doing proposals and acceptance so that the \(\theta\)’s are samples from the distribution we want.
Rules are chosen so that \(p(\theta \given D)\) is the stationary distribution (long-run average!) of the random walk (the “Markov chain”).
The chain must mix fast enough so the distribution of visited states converges to \(p(\theta \given D)\).
Because of autocorrelation, \((\theta_1, \theta_2, \ldots, \theta_N)\) are not \(N\) independent samples: they are roughly equivalent to \(N_\text{eff} < N\) independent samples.
For better mixing, acceptance probabilities should not be too high or too low.
Starting several chains far apart can help diagnose failure to mix: Gelman’s \(r\) quantifies how different they are.
Three people, with randomness provided by others:
Pick a random \(\{N,S,E,W\}\).
Take a step in that direction,
Question: What distribution will this sample from?
Do this for 10 iterations. Have the chains mixed?
Now:
Pick a random \(\{N,S,E,W\}\).
Take a \(1+\Poisson(5)\) number of steps in that direction,
Does it mix faster?
Would \(1 + \Poisson(50)\) steps be better?
Imagine the walkers are on a hill, and:
Pick a random \(\{N,S,E,W\}\).
If
What would this do?
Thanks to Metropolis-Hastings, if “elevation” is \(p(\theta \given D)\), then setting \(p = p(\theta' \given D) / p(\theta \given D)\) makes the stationary distribution \(p(\theta \given D)\).
data {
// stuff you input
}
transformed data {
// stuff that's calculated from the data (just once, at the start)
}
parameters {
// stuff you want to learn the posterior distribution of
}
transformed parameters {
// stuff that's calculated from the parameters (at every step)
}
model {
// the action!
}
generated quantities {
// stuff you want computed also along the way
}
How to do everything: see the user’s manual.
We’ve flipped a coin 10 times and got 6 Heads. We think the coin is close to fair, so put a \(\Beta(20,20)\) prior on it’s probability of heads, and want the posterior distribution.
\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\] Sample from \[\theta \given Z = 6\]
\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]
Sample from \[\theta \given Z = 6\]
data {
int N; // number of flips
int Z; // number of heads
}
parameters {
// probability of heads
real<lower=0,upper=1> theta;
}
model {
Z ~ binomial(N, theta);
theta ~ beta(20, 20);
}
lp__
is the log posterior density. Note n_eff
.
## Inference for Stan model: 5f3115ac9593cded393e67d2d0e3ce84.
## 3 chains, each with iter=10000; warmup=5000; thin=1;
## post-warmup draws per chain=5000, total post-warmup draws=15000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## theta 0.52 0.00 0.07 0.38 0.47 0.52 0.57 0.66 5195 1
## lp__ -35.14 0.01 0.74 -37.22 -35.30 -34.85 -34.67 -34.62 5631 1
##
## Samples were drawn using NUTS(diag_e) at Mon Nov 19 21:35:55 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Fuzzy caterpillars are good.
Stan uses ggplot2.
What’s the posterior probability that \(\theta < 0.5\)?
## [1] 0.3860667
## [1] 0.3555356
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \\ \alpha &\sim \Unif(0, 100) \\ \beta &\sim \Unif(0, 100) \end{aligned}\]
data {
int n; // number of coins
int N[n]; // number of flips
int Z[n]; // number of heads
}
parameters {
// the parameters (output)
}
model {
// how they are related
}
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \\ \alpha &\sim \Unif(0, 100) \\ \beta &\sim \Unif(0, 100) \end{aligned}\]
data {
int n; // number of coins
int N[n]; // number of flips
int Z[n]; // number of heads
}
parameters {
// probability of heads
real<lower=0,upper=1> theta[n];
real<lower=0,upper=100> alpha;
real<lower=0,upper=100> beta;
}
model {
// how they are related
}
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \\ \alpha &\sim \Unif(0, 100) \\ \beta &\sim \Unif(0, 100) \end{aligned}\]
data {
int n; // number of coins
int N[n]; // number of flips
int Z[n]; // number of heads
}
parameters {
// probability of heads
real<lower=0,upper=1> theta[n];
real<lower=0, upper=100> alpha;
real<lower=0, upper=100> beta;
}
model {
Z ~ binomial(N, theta);
theta ~ beta(alpha, beta);
// uniform priors "go without saying"
// alpha ~ uniform(0, 100);
// beta ~ uniform(0, 100);
}
Data:
set.seed(23)
ncoins <- 100
true_theta <- rbeta(ncoins, 20, 50)
N <- rep(50, ncoins)
Z <- rbinom(ncoins, size=N, prob=true_theta)
Find the posterior distribution on alpha
and beta
: check convergence with print()
and stan_trace()
, then plot using stan_hist()
and/or stan_scat()
..
data {
int n; // number of coins
int N[n]; // number of flips
int Z[n]; // number of heads
}
parameters {
// probability of heads
real<lower=0,upper=1> theta[n];
real<lower=0,upper=100> alpha;
real<lower=0,upper=100> beta;
}
model {
Z ~ binomial(N, theta);
theta ~ beta(alpha, beta);
// uniform priors 'go without saying'
// alpha ~ uniform(0, 100);
// beta ~ uniform(0, 100);
}
We have a dataset of batting averages of baseball players, having
## Player PriPos Hits AtBats PlayerNumber PriPosNumber
## 1 Fernando Abad Pitcher 1 7 1 1
## 2 Bobby Abreu Left Field 53 219 2 7
## 3 Tony Abreu 2nd Base 18 70 3 4
## 4 Dustin Ackley 2nd Base 137 607 4 4
## 5 Matt Adams 1st Base 21 86 5 3
## 6 Nathan Adcock Pitcher 0 1 6 1
The overall batting average of the 948 players is 0.2546597.
Here is the average by position.
batting %>% group_by(PriPos) %>%
summarise(num=n(), BatAvg=sum(Hits)/sum(AtBats)) %>%
select(PriPos, num, BatAvg)
## # A tibble: 9 x 3
## PriPos num BatAvg
## <fct> <int> <dbl>
## 1 1st Base 81 0.259
## 2 2nd Base 72 0.256
## 3 3rd Base 75 0.265
## 4 Catcher 103 0.247
## 5 Center Field 67 0.264
## 6 Left Field 103 0.259
## 7 Pitcher 324 0.129
## 8 Right Field 60 0.264
## 9 Shortstop 63 0.255
What’s the overall batting average?
Do some positions tend to be better batters?
How much variation is there?
first_model <- "
data {
int N;
int hits[N];
int at_bats[N];
}
parameters {
real<lower=0, upper=1> theta;
}
model {
hits ~ binomial(at_bats, theta);
theta ~ beta(1, 1);
} "
first_fit <- stan(model_code=first_model, chains=3, iter=1000,
data=list(N=nrow(batting),
hits=batting$Hits,
at_bats=batting$AtBats))
pos_model <- "
data {
int N;
int hits[N];
int at_bats[N];
int npos; // number of positions
int position[N];
}
parameters {
real<lower=0, upper=1> theta[npos];
}
model {
real theta_vec[N];
for (k in 1:N) {
theta_vec[k] = theta[position[k]];
}
hits ~ binomial(at_bats, theta_vec);
theta ~ beta(1, 1);
} "
pos_fit <- stan(model_code=pos_model, chains=3, iter=1000,
data=list(N=nrow(batting),
hits=batting$Hits,
at_bats=batting$AtBats,
npos=nlevels(batting$PriPos),
position=as.numeric(batting$PriPos)))
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha_{p_i}, \beta_{p_i}) \\ \alpha_p &= \omega_p \kappa_p \\ \beta_p &= (1-\omega_p) \kappa_p \\ \omega_p &\sim \Beta(1, 1) \\ \kappa_p &\sim \Gam(0.1, 0.1) . \end{aligned}\]
Variable types in Stan:
int x; // an integer
int y[10]; // ten integers
real z; // a number
real z[2,5]; // a 2x5 array of numbers
vector u[10]; // length 10 vector
matrix v[10,10]; // 10x10 matrix
vector[10] w[10]; // ten length 10 vectors
;
Stan uses MCMC to sample from the posterior
which lets you fit realistic models.