\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \newcommand{\given}{\;\vert\;} \]
Peter Ralph
13 and 15 November 2018 – Advanced Biological Statistics
Suppose I have two trick coins:
But, I lost one and I don’t know which! So, I flip it 10 times and get 6 heads. Which is it, and how sure are you?
Er, probably coin (A)?
Well, \[\begin{aligned} \P\{ \text{6 H in 10 flips} \given \text{coin A} \} &= \binom{10}{6} (.75)^6 (.25)^4 \\ &= 0.146 \end{aligned}\] and \[\begin{aligned} \P\{ \text{6 H in 10 flips} \given \text{coin B} \} &= \binom{10}{6} (.25)^6 (.75)^4 \\ &= 0.016 \end{aligned}\] … so, probably coin (A)?
For a precise answer…
Before flipping, each coin seems equally likely. Then
\[\begin{aligned} \P\{ \text{coin A} \given \text{6 H in 10 flips} \} &= \frac{ \frac{1}{2} \times 0.146 }{ \frac{1}{2} \times 0.146 + \frac{1}{2} \times 0.016 } \\ &= 0.9 \end{aligned}\]
Probabilities are proportions: \(\hspace{2em} 0 \le \P\{A\} \le 1\)
Everything: \(\hspace{2em} \P\{ \Omega \} = 1\)
Complements: \(\hspace{2em} \P\{ \text{not } A\} = 1 - \P\{A\}\)
Disjoint events: If \(\hspace{2em} \P\{A \text{ and } B\} = 0\) then \(\hspace{2em} \P\{A \text{ or } B\} = \P\{A\} + \P\{B\}\).
Independence: \(A\) and \(B\) are independent iff \(\P\{A \text{ and } B\} = \P\{A\} \P\{B\}\).
Conditional probability: \[\P\{A \given B\} = \frac{\P\{A \text{ and } B\}}{ \P\{B\} }\]
A consequence is
\[\P\{B \given A\} = \frac{\P\{B\} \P\{A \given B\}}{ \P\{A\} } .\]
In “Bayesian statistics”:
Suppose instead I had 9 coins, with probabilities 10%, 20%, …, 90%; as before I flipped one 10 times and got 6 heads. For each \(\theta\) in \(0.1, 0.2, \ldots, 0.8, 0.9,\) find \[\begin{aligned} \P\{\text{ coin has prob $\theta$ } \given \text{ 6 H in 10 flips } \} . \end{aligned}\]
Question: which coin(s) is it, and how sure are we? (And, what does it mean when we say how sure we are?)
prior
\(\times\)
likelihood
\(\propto\)
posterior
prior
\(\times\)
likelihood
\(\propto\)
posterior
prior
\(\times\)
likelihood
\(\propto\)
posterior
prior
\(\times\)
likelihood
\(\propto\)
posterior
prior
\(\times\)
likelihood
\(\propto\)
posterior
prior
\(\times\)
likelihood
\(\propto\)
posterior
prior
\(\times\)
likelihood
\(\propto\)
posterior
Recall: there were nine possible values of \(\theta\).
Which coin is it, and how sure are you?
Possible types of answer:
Give examples of when each type of answer is the right one.
If \[P \sim \text{Beta}(a,b)\] then \(P\) has probability density \[p(\theta) = \frac{ \theta^{a-1} (1 - \theta)^{b-1} }{ B(a,b) } . \]
Takes values between 0 and 1.
If \(U_{(1)} < U_{(2)} < \cdots < U_{(n)}\) are sorted, independent \(\text{Unif}[0,1]\) then \(U_{(k)} \sim \text{Beta}(k, n-k+1)\).
Mean: \(a/(a+b)\).
Larger \(a+b\) is more tightly concentrated (like \(1/\sqrt{a+b}\))
In pairs, simulate:
One thousand “random coins” whose probabilities are drawn from a \(\Beta(5,5)\) distribution. (rbeta()
) Make a histogram of these probabilities.
Flip each coin ten times and record the number of heads. (rbinom()
)
Make a histogram of the probabilities of those coins that got exactly 3 heads, and compare to the first histogram.
Now suppose we want to estimate the probability of heads for a coin without knowing the possible values. (or, a disease incidence, or error rate in an experiment, …)
We flip it \(n\) times and get \(z\) Heads.
The likelihood of this, given the prob-of-heads \(\theta\), is: \[p(z \given \theta) = \binom{n}{z}\theta^z (1-\theta)^{n-z} . \]
How to weight the possible \(\theta\)? Need a flexible set of weighting functions, i.e., prior distributions on \([0,1]\).
What would we use if:
the coin is probably close to fair.
the disease is probably quite rare.
no idea whatsoever.
If \[\begin{aligned} P &\sim \text{Beta}(a,b) \\ Z &\sim \text{Binom}(n,P) , \end{aligned}\] then by Bayes’ rule: \[\begin{aligned} \P\{ P = \theta \given Z = z\} &= \frac{\P\{Z = z \given P = \theta \} \P\{P = \theta\}}{\P\{Z = z\}} \\ &= \frac{ \binom{n}{z}\theta^z (1-\theta)^{n-z} \times \frac{\theta^{a-1}(1-\theta)^{b-1}}{B(a,b)} }{ \text{(something)} } \\ &= \text{(something else)} \times \theta^{a + z - 1} (1-\theta)^{b + n - z - 1} . \end{aligned}\]
“Miraculously” (the Beta is the conjugate prior for the Binomial), \[\begin{aligned} P \given Z = z \sim \text{Beta}(a+z, b+n-z) . \end{aligned}\]
We flip an odd-looking coin 100 times, and get 65 heads. What is it’s true* probability of heads?
True = ??
What prior to use?
Is it reasonable that \(\theta = 1/2\)?
Best guess at \(\theta\)?
How far off are we, probably?
If we want a point estimate:
These all convey “where the posterior distribution is”, more or less.
What about uncertainty?
Definition: A 95% credible region is a portion of parameter space having a total of 95% of the posterior probability.
(same with other numbers for “95%”)
If we construct a 95% credible interval for \(\theta\) for each of many datasets; and the coin in each dataset has \(\theta\) drawn independently from the prior, then the true \(\theta\) will fall in its credible interval for 95% of the datasets.
If we construct a 95% credible interval for \(\theta\) with a dataset, and the distribution of the coin’s true \(\theta\) across many parallel universes is given by the prior, then the true \(\theta\) will be in the credible interval in 95% of those universes.
Given my prior beliefs (prior distribution), the posterior distribution is the most rational\({}^*\) way to update my beliefs to account for the data.
\({}^*\) if you do this many times you will be wrong least often
\({}^*\) or you will be wrong in the fewest possible universes
Definition: The “95% highest density interval” is the 95% credible interval with the highest posterior probability density at each point.
((back to the simple example))
Suppose now we have data from \(n\) different coins from the same source. We don’t assume they have the same \(\theta\), but don’t know what it’s distribution is, so try to learn it.
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &\sim \Beta(\text{mode}=\omega, \text{conc}=\kappa) \\ \omega & = ? \\ \kappa & = ? \end{aligned}\]
note: The “mode” and “concentration” are related to the shape parameters by: \(\alpha = \omega (\kappa - 2) + 1\) and \(\beta = (1 - \omega) (\kappa - 2) + 1\).
What is different between:
Pick a value of \(\theta\) at random from \(\Beta(3,1)\). Flip one thousand \(\theta\)-coins, 500 times each.
Pick one thousand random \(\theta_i \sim \Beta(3,1)\) values. Flip one thousand coins, one for each \(\theta_i\), 500 times each.
ncoins <- 1000
nflips <- 100
theta1 <- rbeta(1,3,1)
binom_Z <- rbinom(ncoins, size=nflips, prob=theta1)
theta2 <- rbeta(ncoins,3,1)
bb_Z <- rbinom(ncoins, size=nflips, prob=theta2)
hist(binom_Z, breaks=30, col=adjustcolor("blue", 0.5), main='', xlim=c(0,nflips), freq=FALSE, xlab='number of Heads')
hist(bb_Z, breaks=30, col=adjustcolor("red", 0.5), add=TRUE, freq=FALSE)
data {
// stuff you input
}
transformed data {
// stuff that's calculated from the data (just once, at the start)
}
parameters {
// stuff you want to learn about
}
transformed parameters {
// stuff that's calculated from the parameters (at every step)
}
model {
// the action!
}
generated quantities {
// stuff you want computed also along the way
}
We’ve flipped a coin 10 times and got 6 Heads. We think the coin is close to fair, so put a \(\Beta(20,20)\) prior on it’s probability of heads, and want the posterior distribution.
\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\] What’s our best guess at \(\theta\)?
\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]
What’s our best guess at \(\theta\)?
data {
// stuff you input
}
parameters {
// stuff you want to learn
// the posterior distribution of
}
model {
// the action!
}
\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]
What’s our best guess at \(\theta\)?
data {
int N; // number of flips
int Z; // number of heads
}
parameters {
// stuff you want to learn
// the posterior distribution of
}
model {
// the action!
}
\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]
What’s our best guess at \(\theta\)?
data {
int N; // number of flips
int Z; // number of heads
}
parameters {
// probability of heads
real<lower=0,upper=1> theta;
}
model {
// the action!
}
\[\begin{aligned} Z &\sim \Binom(10, \theta) \\ \theta &\sim \Beta(20, 20) \end{aligned}\]
What’s our best guess at \(\theta\)?
data {
int N; // number of flips
int Z; // number of heads
}
parameters {
// probability of heads
real<lower=0,upper=1> theta;
}
model {
Z ~ binomial(N, theta);
theta ~ beta(20, 20);
}
With a uniform prior, the “maximum posterior” parameter values are also the maximum likelihood values.
> help(optimizing)
optimizing package:rstan R Documentation
Obtain a point estimate by maximizing the joint posterior
Description:
Obtain a point estimate by maximizing the joint posterior from the
model defined by class 'stanmodel'.
Usage:
## S4 method for signature 'stanmodel'
optimizing(object, data = list(),
## Initial log joint probability = -42.3369
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## $par
## theta
## 0.5208333
##
## $value
## [1] -33.22939
##
## $return_code
## [1] 0
The maximum a posteriori probability estimate (MAP) is 0.5208333.
post_fun <- function (p) {
n <- 10; z <- 6; a <- b <- 20
lik <- dbinom(z, size=n, prob=p)
prior <- dbeta(p, a, b)
return( prior * lik )
}
curve(post_fun, 0, 1, xlab=expression(theta), ylab='posterior prob')
points(fit$par, post_fun(fit$par), col='red', pch=20)
Note: If the prior was uniform, then this would be the maximum likelihood estimate (MLE) also.
posterior \(=\) prior \(\times\) likelihood
“updating prior ideas based on (more) data”
Stan lets us find best fitting parameters for models.
Next time: uncertainty, with Stan.
Simulate \(10^6\) coin probabilities, called \(\theta\), from Beta(5,5). (rbeta
)
For each coin, simulate 10 flips. (rbinom
)
Make a histogram of the true probabilities (values of \(\theta\)) of only those coins having 3 of 10 heads.
Compare the distribution to Beta(\(a\),\(b\)) – with what \(a\) and \(b\)? (dbeta
)
Explain what happened.