Prior distributions and uncertainty

Peter Ralph

5 November – Advanced Biological Statistics

Biased coins

a motivating example

Suppose I have two trick coins:

one (coin A) comes up heads 75% of the time, and
the other (coin B) only 25% of the time.

But, I lost one and I don’t know which! So, I flip it 10 times and get 6 heads. Which is it, and how sure are you?

Possible answers:

Er, probably coin (A)?
Well, \[\begin{aligned} \P\{ \text{6 H in 10 flips} \given \text{coin A} \} &= \binom{10}{6} (.75)^6 (.25)^4 \\ &= 0.146 \end{aligned}\] and \[\begin{aligned} \P\{ \text{6 H in 10 flips} \given \text{coin B} \} &= \binom{10}{6} (.25)^6 (.75)^4 \\ &= 0.016 \end{aligned}\] … so, probably coin (A)?

For a precise answer…

Before flipping, each coin seems equally likely. Then

\[\begin{aligned} \P\{ \text{coin A} \given \text{6 H in 10 flips} \} &= \frac{ \frac{1}{2} \times 0.146 }{ \frac{1}{2} \times 0.146 + \frac{1}{2} \times 0.016 } \\ &= 0.9 \end{aligned}\]

Probability: review and notation

Probability rules:

Probabilities are proportions: $\hspace{2em} 0 \le \P\{A\} \le 1$
Everything: $\hspace{2em} \P\{ \Omega \} = 1$
Complements: $\hspace{2em} \P\{ \text{not } A\} = 1 - \P\{A\}$
Disjoint events: If $\hspace{2em} \P\{A \text{ and } B\} = 0$ then $\hspace{2em} \P\{A \text{ or } B\} = \P\{A\} + \P\{B\}$.
Independence: $A$ and $B$ are independent iff $\P\{A \text{ and } B\} = \P\{A\} \P\{B\}$.
Conditional probability: \[\P\{A \given B\} = \frac{\P\{A \text{ and } B\}}{ \P\{B\} }\]

Bayes’ rule

A consequence is

\[\P\{B \given A\} = \frac{\P\{B\} \P\{A \given B\}}{ \P\{A\} } .\]

In “Bayesian statistics”:

$B$: possible model
$A$: data
$\P\{B\}$: prior weight on model $B$
$\P\{A \given B\}$: likelihood of data under $B$
$\P\{B\} \P\{A \given B\}$: posterior weight on $B$
$\P\{A\}$: total sum of posterior weights

Bayes’ rule: for coins (in class)

A consequence is

\[\P\{B \given A\} = \frac{\P\{B\} \P\{A \given B\}}{ \P\{A\} } .\]

In our coin example

$B$: possible model (I have the 75% coin)
$A$: data (6 Heads out of 10 flips)
$\P\{B\}$: prior weight on model $B$ (how likely I think it is I got the 75% coin)
$\P\{A \given B\}$: likelihood of data under $B$ (chance of getting 6 H with 10 flips with the 75% coin)
$\P\{B\} \P\{A \given B\}$: posterior weight on $B$ (combined probabilityof getting 75% coin and then getting 6 H out of 10 flips)
$\P\{A\}$: total sum of posterior weights (the sum of the last thing over all possible coins)

Breaking it down with more coins

Suppose instead I had 9 coins, with probabilities 10%, 20%, …, 90%; as before I flipped one 10 times and got 6 heads. For each $\theta$ in $0.1, 0.2, \ldots, 0.8, 0.9,$ find \[\begin{aligned} \P\{\text{ coin has prob $\theta$ } \given \text{ 6 H in 10 flips } \} . \end{aligned}\]

Question: which coin(s) is it, and how sure are we? (And, what does it mean when we say how sure we are?)

Uniform prior

prior

$\times$

likelihood

$\propto$

posterior

plot of chunk the_prior

Weak prior

prior

$\times$

likelihood

$\propto$

posterior

plot of chunk weak_prior

Strong prior

prior

$\times$

likelihood

$\propto$

posterior

plot of chunk strong_prior

The likelihood: 6 H in 10 flips

prior

$\times$

likelihood

$\propto$

posterior

plot of chunk ten_flips

The likelihood: 30 H in 50 flips

prior

$\times$

likelihood

$\propto$

posterior

plot of chunk fifty_flips

The likelihood: 60 H in 100 flips

prior

$\times$

likelihood

$\propto$

posterior

plot of chunk 100_flips

The likelihood: 6,000 H in 10,000 flips

prior

$\times$

likelihood

$\propto$

posterior

plot of chunk ten_thou_flips

A question

What is the right answer to the “coin question”?

Recall: there were nine possible values of $\theta$.

Which coin is it, and how sure are you?

Possible types of answer:

“best guess”
“range of values”
“don’t know”

Give examples of when each type of answer is the right one.

theta <- (1:9)/10
prior <- rep(1/9, 9)
likelihood <- dbinom(6, size=10, prob=theta)
posterior <- prior*likelihood/sum(prior*likelihood)
names(posterior) <- theta
barplot(posterior, xlab='true prob of heads', main='posterior probability')

plot of chunk plot_pos

Stochastic Minute

The Beta Distribution

If \[P \sim \text{Beta}(a,b)\] then $P$ has probability density \[p(\theta) = \frac{ \theta^{a-1} (1 - \theta)^{b-1} }{ B(a,b) } . \]

Takes values between 0 and 1.
If $U_{(1)} < U_{(2)} < \cdots < U_{(n)}$ are sorted, independent $\text{Unif}[0,1]$ then $U_{(k)} \sim \text{Beta}(k, n-k+1)$.
Mean: $a/(a+b)$.
Larger $a+b$ is more tightly concentrated (like $1/\sqrt{a+b}$)

Exercise

In pairs, simulate:

One thousand “random coins” whose probabilities are drawn from a $\Beta(5,5)$ distribution. (rbeta()) Make a histogram of these probabilities.
Flip each coin ten times and record the number of heads. (rbinom())
Make a histogram of the probabilities of those coins that got exactly 3 heads, and compare to the first histogram.

Unknown coins

Motivating example

Now suppose we want to estimate the probability of heads for a coin without knowing the possible values. (or, a disease incidence, or error rate in an experiment, …)

We flip it $n$ times and get $z$ Heads.

The likelihood of this, given the prob-of-heads $\theta$, is: \[p(z \given \theta) = \binom{n}{z}\theta^z (1-\theta)^{n-z} . \]

How to weight the possible $\theta$? Need a flexible set of weighting functions, i.e., prior distributions on $[0,1]$.

Beta distributions.

What would we use if:

the coin is probably close to fair.
the disease is probably quite rare.
no idea whatsoever.

Beta-Binomial Bayesian analysis

If \[\begin{aligned} P &\sim \text{Beta}(a,b) \\ Z &\sim \text{Binom}(n,P) , \end{aligned}\] then by Bayes’ rule: \[\begin{aligned} \P\{ P = \theta \given Z = z\} &= \frac{\P\{Z = z \given P = \theta \} \P\{P = \theta\}}{\P\{Z = z\}} \\ &= \frac{ \binom{n}{z}\theta^z (1-\theta)^{n-z} \times \frac{\theta^{a-1}(1-\theta)^{b-1}}{B(a,b)} }{ \text{(something)} } \\ &= \text{(something else)} \times \theta^{a + z - 1} (1-\theta)^{b + n - z - 1} . \end{aligned}\]

“Miraculously” (the Beta is the conjugate prior for the Binomial), \[\begin{aligned} (P \given Z = z) \sim \text{Beta}(a+z, b+n-z) . \end{aligned}\]

Discuss/demonstration:

We flip an odd-looking coin 100 times, and get 65 heads. What is it’s true* probability of heads?

True = ??
What prior to use?
Is it reasonable that $\theta = 1/2$?
Best guess at $\theta$?
How far off are we, probably?

Reporting uncertainty

How do we communicate results?

If we want a point estimate:

posterior mean,
posterior median, or
maximum a posteriori estimate (“MAP”: highest posterior density).

These all convey “where the posterior distribution is”, more or less.

What about uncertainty?

Credible region

Definition: A 95% credible region is a portion of parameter space having a total of 95% of the posterior probability.

(same with other numbers for “95%”)

Interpretation #1

If we construct a 95% credible interval for $\theta$ for each of many datasets; and the coin in each dataset has $\theta$ drawn independently from the prior, then the true $\theta$ will fall in its credible interval for 95% of the datasets.

Interpretation #2

If we construct a 95% credible interval for $\theta$ with a dataset, and the distribution of the coin’s true $\theta$ across many parallel universes is given by the prior, then the true $\theta$ will be in the credible interval in 95% of those universes.

Interpretation #3

Given my prior beliefs (prior distribution), the posterior distribution is the most rational${}^*$ way to update my beliefs to account for the data.

${}^*$ if you do this many times you will be wrong least often

${}^*$ or you will be wrong in the fewest possible universes

But which credible interval?

Definition: The “95% highest density interval” is the 95% credible interval with the highest posterior probability density at each point.

((back to the simple example))

Hierarchical coins

Motivating problem: more coins

Suppose now we have data from $n$ different coins from the same source. We don’t assume they have the same $\theta$, but don’t know what it’s distribution is, so try to learn it.

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\text{mode}=\omega, \text{conc}=\kappa) \\ \omega & = ? \\ \kappa & = ? \end{aligned}\]

note: The “mode” and “concentration” are related to the shape parameters by: $\alpha = \omega (\kappa - 2) + 1$ and $\beta = (1 - \omega) (\kappa - 2) + 1$.

Binomial versus Beta-Binomial

What is different between:

Pick a value of $\theta$ at random from $\Beta(3,1)$. Flip one thousand $\theta$-coins, 500 times each.
Pick one thousand random $\theta_i \sim \Beta(3,1)$ values. Flip one thousand coins, one for each $\theta_i$, 500 times each.

ncoins <- 1000
nflips <- 100
theta1 <- rbeta(1,3,1)
binom_Z <- rbinom(ncoins, size=nflips, prob=theta1)
theta2 <- rbeta(ncoins,3,1)
bb_Z <- rbinom(ncoins, size=nflips, prob=theta2)
hist(binom_Z, breaks=30, col=adjustcolor("blue", 0.5), main='', xlim=c(0,nflips), freq=FALSE, xlab='number of Heads')
hist(bb_Z, breaks=30, col=adjustcolor("red", 0.5), add=TRUE, freq=FALSE)
legend("topleft", fill=adjustcolor(c('blue', 'red'), 0.5), legend=c('one theta', 'many thetas'))

plot of chunk beta_or_binom

MC Stan

Stan

The skeletal Stan program

data {
    // stuff you input
}
transformed data {
    // stuff that's calculated from the data (just once, at the start)
}
parameters {
    // stuff you want to learn about
}
transformed parameters {
    // stuff that's calculated from the parameters (at every step)
}
model {
    // the action!
}
generated quantities {
    // stuff you want computed also along the way
}

Beta-Binomial with Stan

First, in words:

We’ve flipped a coin 10 times and got 6 Heads. We think the coin is close to fair, so put a $\Beta(20,20)$ prior on it’s probability of heads, and want the posterior distribution.

\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\] What’s our best guess at $\theta$?

\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]

What’s our best guess at $\theta$?

data {
    // stuff you input
}
parameters {
    // stuff you want to learn 
    // the posterior distribution of
}
model {
    // the action!
}

\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]

What’s our best guess at $\theta$?

data {
    int N;   // number of flips
    int Z;   // number of heads
}
parameters {
    // stuff you want to learn 
    // the posterior distribution of
}
model {
    // the action!
}

\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]

What’s our best guess at $\theta$?

data {
    int N;   // number of flips
    int Z;   // number of heads
}
parameters {
    // probability of heads
    real<lower=0,upper=1> theta;  
}
model {
    // the action!
}

\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]

What’s our best guess at $\theta$?

data {
    int N;   // number of flips
    int Z;   // number of heads
}
parameters {
    // probability of heads
    real<lower=0,upper=1> theta;
}
model {
    Z ~ binomial(N, theta);
    theta ~ beta(20, 20);
}

The Stan model

library(rstan)
stan_block <- "
data {
    int N;   // number of flips
    int Z;   // number of heads
}
parameters {
    // probability of heads
    real<lower=0,upper=1> theta;
}
model {
    Z ~ binomial(N, theta);
    theta ~ beta(20, 20);
}
"
bb_model <- stan_model(
               model_code=stan_block)

Optimization: maximum likelihood

With a uniform prior, the “maximum posterior” parameter values are also the maximum likelihood values.

> help(optimizing)

optimizing                package:rstan                R Documentation

Obtain a point estimate by maximizing the joint posterior

Description:

     Obtain a point estimate by maximizing the joint posterior from the
     model defined by class 'stanmodel'.

Usage:

     ## S4 method for signature 'stanmodel'
     optimizing(object, data = list(),

Maximum posterior

library(rstan)
(fit <- optimizing(bb_model,  # stan model from above
                   data=list(N=10, Z=6)))

## $par
##   theta 
## 0.52083 
## 
## $value
## [1] -33.22939
## 
## $return_code
## [1] 0

The answer!

The maximum a posteriori probability estimate (MAP) is 0.52083.

post_fun <- function (p) {
    n <- 10; z <- 6; a <- b <- 20
    lik <- dbinom(z, size=n, prob=p)
    prior <- dbeta(p, a, b)
    return( prior * lik )
}
curve(post_fun, 0, 1, xlab=expression(theta), ylab='posterior prob')
points(fit$par, post_fun(fit$par), col='red', pch=20)

Note: If the prior was uniform, then this would be the maximum likelihood estimate (MLE) also.

plot of chunk stan_check

Wrap-up

Summary

posterior $=$ prior $\times$ likelihood

“updating prior ideas based on (more) data”

Stan lets us find best fitting parameters for models.

Next time: uncertainty, with Stan.

Appendix

Exercise: check the Beta-Binomial

Simulate $10^6$ coin probabilities, called $\theta$, from Beta(5,5). (rbeta)
For each coin, simulate 10 flips. (rbinom)
Make a histogram of the true probabilities (values of $\theta$) of only those coins having 3 of 10 heads.
Compare the distribution to Beta($a$,$b$) – with what $a$ and $b$? (dbeta)
Explain what happened.