Adding levels of randomness

Peter Ralph

12 November 2019 – Advanced Biological Statistics

Outline

Hierarchical coins
Introduction to MCMC with Stan
Sharing power, and shrinkage
Baseball

Hierarchical coins

Motivating problem: more coins

Suppose now we have data from \(n\) different coins from the same source. We don’t assume they have the same \(\theta\), but don’t know what its distribution is, so try to learn it.

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \\ \alpha &\sim \Unif(0, 100) \\ \beta &\sim \Unif(0, 100) \end{aligned}\]

Goal: find the posterior distribution of \(\alpha\), \(\beta\).

Problem: we don’t have a nice mathematical expression for this posterior distribution.

Markov Chain Monte Carlo

When you can’t do the integrals: MCMC

Goal: Given:

a model with parameters \(\theta\),
a prior distribution \(p(\theta)\) on \(\theta\), and
data, \(D\),

“find”/ask questions of the posterior distribution on \(\theta\),

\[\begin{aligned} p(\theta \given D) = \frac{ p(D \given \theta) p(\theta) }{ p(D) } . \end{aligned}\]

Problem: usually we can’t write down an expression for this (because of the “\(p(D)\)”).

Solution: we’ll make up a way to draw random samples from it.

Toy example:

(from beta-binomial coin example)

Do we think that \(\theta < 0.5\)?

(before:)

pbeta(0.5, post_a, post_b)

(now:)

mean(rbeta(1e6, post_a, post_b) < 0.5)

How? Markov chain Monte Carlo!

i.e., “random-walk-based stochastic integration”

Example: Gibbs sampling for uniform distribution on a region. (picture)

Overview of MCMC

Produces a random sequence of samples \(\theta_1, \theta_2, \ldots, \theta_N\).

Begin somewhere (at \(\theta_1\)).

At each step, starting at \(\theta_k\):

Propose a new location (nearby?): \(\theta_k'\)
Decide whether to accept it.
- if so: set \(\theta_{k+1} \leftarrow \theta_k'\)
- if not: set \(\theta_{k+1} \leftarrow \theta_k\)
Set \(k \leftarrow k+1\); if \(k=N\) then stop.

The magic comes from doing proposals and acceptance so that the \(\theta\)’s are samples from the distribution we want.

Key concepts

Rules are chosen so that \(p(\theta \given D)\) is the stationary distribution (long-run average!) of the random walk (the “Markov chain”).
The chain must mix fast enough so the distribution of visited states converges to \(p(\theta \given D)\).
Because of autocorrelation, \((\theta_1, \theta_2, \ldots, \theta_N)\) are not \(N\) independent samples: they are roughly equivalent to \(N_\text{eff} < N\) independent samples.
For better mixing, acceptance probabilities should not be too high or too low.
Starting several chains far apart can help diagnose failure to mix: Gelman’s \(r\) quantifies how different they are.

On your feet

Three people, with randomness provided by others:

Pick a random \(\{N,S,E,W\}\).
Take a step in that direction,
- unless you’d run into a wall or a table.

Question: What distribution will this sample from?

Do this for 10 iterations. Have the chains mixed?

Now:

Pick a random \(\{N,S,E,W\}\).
Take a \(1+\Poisson(5)\) number of steps in that direction,
- unless you’d run into a wall or a table.

Does it mix faster?

Would \(1 + \Poisson(50)\) steps be better?

How it works

Imagine the walkers are on a hill, and:

Pick a random \(\{N,S,E,W\}\).
If
- the step is uphill, then take it.
- the step is downhill, then flip a \(p\)-coin; if you get Heads, stay were you are.

What would this do?

Thanks to Metropolis-Hastings, if “elevation” is \(p(\theta \given D)\), then setting \(p = p(\theta' \given D) / p(\theta \given D)\) makes the stationary distribution \(p(\theta \given D)\).

MC Stan

“Quick and easy” MCMC: Stan

The skeletal Stan program

data {
    // stuff you input
}
transformed data {
    // stuff that's calculated from the data (just once, at the start)
}
parameters {
    // stuff you want to learn the posterior distribution of
}
transformed parameters {
    // stuff that's calculated from the parameters (at every step)
}
model {
    // the action!
}
generated quantities {
    // stuff you want computed also along the way
}

How to do everything: see the user’s manual.

Beta-Binomial with Stan

From before:

We’ve flipped a coin 10 times and got 6 Heads. We think the coin is close to fair, so put a \(\Beta(20,20)\) prior on it’s probability of heads, and want the posterior distribution.

\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\] Sample from \[\theta \given Z = 6\]

\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]

Sample from \[\theta \given Z = 6\]

data {
    int N;   // number of flips
    int Z;   // number of heads
}
parameters {
    // probability of heads
    real<lower=0,upper=1> theta;
}
model {
    Z ~ binomial(N, theta);
    theta ~ beta(20, 20);
}

Running the MCMC: rstan

library(rstan)
fit <- stan(model_code=stan_block,  # stan block from above
            data=list(N=10, Z=6),
            chains=3, iter=10000)

lp__ is the log posterior density. Note n_eff.

print(fit)

## Inference for Stan model: 5f3115ac9593cded393e67d2d0e3ce84.
## 3 chains, each with iter=10000; warmup=5000; thin=1; 
## post-warmup draws per chain=5000, total post-warmup draws=15000.
## 
##         mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## theta   0.52    0.00 0.07   0.38   0.47   0.52   0.57   0.66  5672    1
## lp__  -35.12    0.01 0.70 -37.14 -35.28 -34.85 -34.67 -34.62  7093    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Nov  9 21:37:18 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Fuzzy caterpillars are good.

stan_trace(fit)

plot of chunk trace_rstan

Stan uses ggplot2.

stan_hist(fit, bins=20) + xlim(0,1)

## Warning: Removed 2 rows containing missing values (geom_bar).

plot of chunk plot_rstan

What’s the posterior probability that \(\theta < 0.5\)?

samples <- extract(fit)
mean(samples$theta < 0.5)

## [1] 0.3826667

# compare to analytic solution
pbeta(0.5, shape1=10+6, shape2=10+4)

## [1] 0.3555356

Hierarchical Coins

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \\ \alpha &\sim \Unif(0, 100) \\ \beta &\sim \Unif(0, 100) \end{aligned}\]

data {
    int n;      // number of coins
    int N[n];   // number of flips
    int Z[n];   // number of heads
}
parameters {
    // the parameters (output)
}
model {
    // how they are related
}

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \\ \alpha &\sim \Unif(0, 100) \\ \beta &\sim \Unif(0, 100) \end{aligned}\]

data {
    int n;      // number of coins
    int N[n];   // number of flips
    int Z[n];   // number of heads
}
parameters {
    // probability of heads
    real<lower=0,upper=1> theta[n];
    real<lower=0,upper=100> alpha;
    real<lower=0,upper=100> beta;
}
model {
    // how they are related
}

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha, \beta) \\ \alpha &\sim \Unif(0, 100) \\ \beta &\sim \Unif(0, 100) \end{aligned}\]

data {
    int n;      // number of coins
    int N[n];   // number of flips
    int Z[n];   // number of heads
}
parameters {
    // probability of heads
    real<lower=0,upper=1> theta[n];
    real<lower=0, upper=100> alpha;
    real<lower=0, upper=100> beta;
}
model {
    Z ~ binomial(N, theta);
    theta ~ beta(alpha, beta);
    // uniform priors "go without saying"
    // alpha ~ uniform(0, 100);
    // beta ~ uniform(0, 100);
}

Your turn

Data:

set.seed(23)
ncoins <- 100
true_theta <- rbeta(ncoins, 20, 50)
N <- rep(50, ncoins)
Z <- rbinom(ncoins, size=N, prob=true_theta)

Find the posterior distribution on alpha and beta: check convergence with print() and stan_trace(), then plot using stan_hist() and/or stan_scat()..

data {
    int n;      // number of coins
    int N[n];   // number of flips
    int Z[n];   // number of heads
}
parameters {
    // probability of heads
    real<lower=0,upper=1> theta[n];
    real<lower=0,upper=100> alpha;
    real<lower=0,upper=100> beta;
}
model {
    Z ~ binomial(N, theta);
    theta ~ beta(alpha, beta);
    // uniform priors 'go without saying'
    // alpha ~ uniform(0, 100);
    // beta ~ uniform(0, 100);
}

Baseball

We have a dataset of batting averages of baseball players, having

name
position
number of at bats
number of hits

batting <- read.csv("BattingAverage.csv", header=TRUE)
head(batting)

##          Player     PriPos Hits AtBats PlayerNumber PriPosNumber
## 1 Fernando Abad    Pitcher    1      7            1            1
## 2   Bobby Abreu Left Field   53    219            2            7
## 3    Tony Abreu   2nd Base   18     70            3            4
## 4 Dustin Ackley   2nd Base  137    607            4            4
## 5    Matt Adams   1st Base   21     86            5            3
## 6 Nathan Adcock    Pitcher    0      1            6            1

The overall batting average of the 948 players is 0.2546597.

Here is the average by position.

batting %>% group_by(PriPos) %>% 
    summarise(num=n(), BatAvg=sum(Hits)/sum(AtBats)) %>% 
    select(PriPos, num, BatAvg)

## # A tibble: 9 x 3
##   PriPos         num BatAvg
##   <fct>        <int>  <dbl>
## 1 1st Base        81  0.259
## 2 2nd Base        72  0.256
## 3 3rd Base        75  0.265
## 4 Catcher        103  0.247
## 5 Center Field    67  0.264
## 6 Left Field     103  0.259
## 7 Pitcher        324  0.129
## 8 Right Field     60  0.264
## 9 Shortstop       63  0.255

Questions?

What’s the overall batting average?
Do some positions tend to be better batters?
How much variation is there?

Everyone is the same

first_model <- "
data {
    int N;
    int hits[N];
    int at_bats[N];
}
parameters {
    real<lower=0, upper=1> theta;
}
model {
    hits ~ binomial(at_bats, theta);
    theta ~ beta(1, 1);
} "
first_fit <- stan(model_code=first_model, chains=3, iter=1000,
                  data=list(N=nrow(batting),
                            hits=batting$Hits,
                            at_bats=batting$AtBats))

stan_hist(first_fit, bins=20)

plot of chunk start_res

Every pitcher is the same

pos_model <- "
data {
    int N;
    int hits[N];
    int at_bats[N];
    int npos; // number of positions
    int position[N];
}
parameters {
    real<lower=0, upper=1> theta[npos];
}
model {
    real theta_vec[N];
    for (k in 1:N) {
        theta_vec[k] = theta[position[k]];
    }
    hits ~ binomial(at_bats, theta_vec);
    theta ~ beta(1, 1);
} "
pos_fit <- stan(model_code=pos_model, chains=3, iter=1000,
                  data=list(N=nrow(batting),
                            hits=batting$Hits,
                            at_bats=batting$AtBats,
                            npos=nlevels(batting$PriPos),
                            position=as.numeric(batting$PriPos)))

theta_samples <- extract(pos_fit)$theta
layout(matrix(1:9, nrow=3))
for (k in 1:ncol(theta_samples)) {
    hist(theta_samples[,k], main=levels(batting$PriPos)[k], xlim=c(0.1, 0.3),
         col=adjustcolor('red',0.6), xlab='batting avg', freq=FALSE)
}

plot of chunk pos_res

Your turn : Every individual different.

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\alpha_{p_i}, \beta_{p_i}) \\ \alpha_p &= \mu_p \kappa_p \\ \beta_p &= (1-\mu_p) \kappa_p \\ \mu_p &\sim \Beta(1, 1) \\ \kappa_p &\sim \Gam(0.1, 0.1) . \end{aligned}\]

Variable types in Stan:

int x;       // an integer
int y[10];   // ten integers
real z;      // a number
real z[2,5]; // a 2x5 array of numbers

vector[10] u;      // length 10 vector
matrix[10,10] v;   // 10x10 matrix
vector[10] w[10];  // ten length 10 vectors

don’t forget the ;
make sure R types match
read the error messages

In class

ind_model <- "
data {
    int N;
    int hits[N];
    int at_bats[N];
    int npos; // number of positions
    int position[N];
}
parameters {
    vector<lower=0, upper=1>[N] theta;
    vector<lower=0, upper=1>[npos] mu;
    vector<lower=0>[npos] kappa;
}
model {
    vector[npos] alpha;
    vector[N] alpha_vec;
    vector[npos] beta;
    vector[N] beta_vec;
    alpha = mu .* kappa;
    beta = (1-mu) .* kappa;
    alpha_vec = alpha[position];
    beta_vec = beta[position];
    hits ~ binomial(at_bats, theta);
    theta ~ beta(alpha_vec, beta_vec);
    mu ~ beta(1, 1);
    kappa ~ gamma(0.1, 0.1);
} "
# ind_model <- stan_model(model_code=ind_model)
# ind_fit <- sampling(ind_model,
#                    data=list(
#                              ))

Back to baseball

The model

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\mu_{p_i} \kappa_{p_i}, (1-\mu_{p_i})\kappa_{p_i}) \\ \mu &\sim \Beta(1, 1) \\ \kappa_p &\sim \Gam(0.1, 0.1) . \end{aligned}\]

pos_model <- "
data {
    int N;   // number of players
    int hits[N];
    int at_bats[N];
    int npos; // number of positions
    int position[N];
}
parameters {
    real<lower=0, upper=1> theta[N];
    real<lower=0, upper=1> mu[npos];
    real<lower=0> kappa[npos];
}
model {
    real alpha;
    real beta;
    hits ~ binomial(at_bats, theta);
    for (i in 1:N) {
        alpha = mu[position[i]] * kappa[position[i]];
        beta = (1 - mu[position[i]]) * kappa[position[i]];
        theta[i] ~ beta(alpha, beta);
    }
    mu ~ beta(1,1);
    kappa ~ gamma(0.1,0.1);
}"

batting <- read.csv("BattingAverage.csv", header=TRUE)
system.time(pos_fit <- stan(model_code=pos_model, chains=3, iter=100,
                            data=list(N=nrow(batting),
                                      hits=batting$Hits,
                                      at_bats=batting$AtBats,
                                      npos=nlevels(batting$PriPos),
                                      position=as.numeric(batting$PriPos))))

##    user  system elapsed 
## 112.200   2.931 108.126

Diagnostics

print(pos_fit, pars=c("mu", "kappa"))

## Inference for Stan model: db775bca82d7fc39d379745fb39b0819.
## 3 chains, each with iter=100; warmup=50; thin=1; 
## post-warmup draws per chain=50, total post-warmup draws=150.
## 
##            mean se_mean    sd  2.5%    25%    50%    75%  97.5% n_eff Rhat
## mu[1]      0.25    0.00  0.01  0.24   0.25   0.25   0.26   0.26   160 0.99
## mu[2]      0.25    0.00  0.01  0.24   0.24   0.25   0.25   0.26   116 1.02
## mu[3]      0.26    0.00  0.01  0.24   0.25   0.26   0.26   0.27    60 1.04
## mu[4]      0.24    0.00  0.01  0.23   0.24   0.24   0.24   0.25   116 1.01
## mu[5]      0.26    0.00  0.01  0.25   0.25   0.26   0.26   0.27   158 1.00
## mu[6]      0.25    0.00  0.01  0.24   0.25   0.25   0.25   0.26   231 1.00
## mu[7]      0.13    0.00  0.01  0.12   0.13   0.13   0.13   0.14    59 1.00
## mu[8]      0.26    0.00  0.01  0.25   0.25   0.26   0.26   0.27   161 1.00
## mu[9]      0.25    0.00  0.01  0.23   0.24   0.25   0.25   0.26   137 1.00
## kappa[1] 109.81    2.49 23.55 69.93  93.60 106.46 123.82 162.69    89 1.01
## kappa[2] 104.26    1.74 21.84 72.03  88.34 102.28 116.02 154.75   158 1.00
## kappa[3]  96.62    1.74 19.94 64.93  82.89  93.66 110.16 145.40   132 1.00
## kappa[4]  97.03    2.66 18.29 63.37  84.76  94.68 108.26 135.40    47 1.06
## kappa[5] 103.86    2.70 23.51 66.72  87.91 103.24 115.75 162.92    76 1.01
## kappa[6] 108.83    2.37 21.57 70.23  93.81 107.52 121.90 154.71    83 1.00
## kappa[7]  54.77    6.40 13.84 32.39  44.14  54.55  63.40  84.23     5 1.34
## kappa[8] 123.79    2.61 27.30 82.97 104.22 120.78 140.10 187.66   110 1.01
## kappa[9]  97.77    1.71 20.98 61.09  82.47  97.73 109.82 137.03   150 0.99
## 
## Samples were drawn using NUTS(diag_e) at Thu Nov 14 09:28:17 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Is it mixing?

stan_trace(pos_fit, pars="mu")

plot of chunk plot_trace

Run longer!

system.time(pos_fit <- stan(model_code=pos_model, chains=3, iter=1000,
                            control=list(max_treedepth=15),
                            data=list(N=nrow(batting),
                                      hits=batting$Hits,
                                      at_bats=batting$AtBats,
                                      npos=nlevels(batting$PriPos),
                                      position=as.numeric(batting$PriPos))))

## recompiling to avoid crashing R session

##     user   system  elapsed 
## 1099.156    2.965  535.504

print(pos_fit, pars=c("mu", "kappa"))

## Inference for Stan model: db775bca82d7fc39d379745fb39b0819.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##            mean se_mean    sd  2.5%   25%    50%    75%  97.5% n_eff Rhat
## mu[1]      0.25    0.00  0.01  0.24  0.25   0.25   0.26   0.26  1110 1.00
## mu[2]      0.25    0.00  0.01  0.24  0.24   0.25   0.25   0.26  1030 1.00
## mu[3]      0.26    0.00  0.01  0.24  0.25   0.26   0.26   0.27  1095 1.00
## mu[4]      0.24    0.00  0.01  0.23  0.24   0.24   0.24   0.25   935 1.00
## mu[5]      0.26    0.00  0.01  0.24  0.25   0.26   0.26   0.27  1032 1.00
## mu[6]      0.25    0.00  0.01  0.24  0.25   0.25   0.25   0.26   912 1.00
## mu[7]      0.13    0.00  0.01  0.12  0.12   0.13   0.13   0.14   239 1.00
## mu[8]      0.26    0.00  0.01  0.25  0.25   0.26   0.26   0.27  1116 1.00
## mu[9]      0.25    0.00  0.01  0.23  0.24   0.25   0.25   0.26  1144 1.00
## kappa[1] 110.61    0.73 22.33 72.48 94.97 108.77 123.51 159.60   930 1.00
## kappa[2] 105.84    1.02 23.97 66.36 88.65 103.45 119.66 160.69   555 1.01
## kappa[3]  98.04    0.69 21.60 62.37 82.70  95.80 110.93 147.55   968 1.00
## kappa[4]  94.23    0.77 19.24 59.56 80.58  92.43 106.35 135.62   620 1.00
## kappa[5] 101.32    0.77 23.37 61.50 85.42  99.19 115.20 153.30   914 1.00
## kappa[6] 108.69    0.93 23.20 71.92 91.83 105.89 123.72 158.34   620 1.00
## kappa[7]  57.00    2.20 15.76 34.11 45.68  54.23  65.65  95.72    52 1.09
## kappa[8] 119.79    0.81 27.49 73.26 99.95 116.87 136.70 181.07  1142 1.00
## kappa[9]  95.20    0.77 22.54 56.01 79.35  93.30 109.22 144.61   861 1.00
## 
## Samples were drawn using NUTS(diag_e) at Thu Nov 14 09:37:18 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Is it mixing?

stan_trace(pos_fit, pars="mu")

plot of chunk plot_trace_again

stan_trace(pos_fit, pars="kappa")

plot of chunk plot_kappa_again

Let’s look at the results!

stan_hist(pos_fit, pars="mu", bins=30) + xlim(0, 0.4)

## Warning: Removed 18 rows containing missing values (geom_bar).

plot of chunk first_hist

With labels: position means \(\mu_p\):

plot of chunk plot_mu

Position “concentrations”, \(\kappa_p\):

plot of chunk plot_kappa