Peter Ralph
7 January 2018 – Advanced Biological Statistics
Course schedule
Modeling
Bayesian statistics, Stan, and MCMC
Reminder: standard linear regression gives a maximum likelihood estimate for \(\vec b\) (and \(\sigma\)) under the following model: \[\begin{aligned} Y_i &\sim \Normal(\mu_i, \sigma) \\ \mu_i &= b_0 + b_1 X_1 + \cdots + b_k X_k . \end{aligned}\]
The key model component here is the linear predictor \(\mu\).
If the predictor, \(X\), is discrete then we are doing a \(t\)-test, or ANOVA, or something.
Simulate data - difference in means of 3.0:
The \(t\)-test
## user system elapsed
## 0.004 0.000 0.005
##
## Welch Two Sample t-test
##
## data: y by x
## t = -29.587, df = 193.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.164550 -1.894004
## sample estimates:
## mean in group 1 mean in group 2
## 0.9611205 2.9903971
with Stan
stt_block <- "
data {
int N;
int x[N]; // group index: 1's and 2's
vector[N] y;
}
parameters {
vector[2] b;
real<lower=0> sigma;
}
model {
y ~ normal(b[x], sigma);
}"
stt_model <- stan_model(model_code=stt_block)
system.time( stantt <- sampling(stt_model,
data=list(N=length(x), x=x, y=y), iter=1e3) )
## user system elapsed
## 0.861 0.211 1.615
## Inference for Stan model: f448e40a81bc1be5fd2fa806fefc48eb.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## b[1] 0.96 0.00 0.05 0.86 0.93 0.96 0.99 1.06 1902 1
## b[2] 2.99 0.00 0.05 2.90 2.96 2.99 3.02 3.09 1885 1
## sigma 0.49 0.00 0.02 0.44 0.47 0.49 0.50 0.54 1940 1
## lp__ 44.08 0.04 1.21 41.02 43.51 44.37 44.97 45.47 1128 1
##
## Samples were drawn using NUTS(diag_e) at Sat Jan 5 12:48:47 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Let’s do both methods many times, and see how well they estimate the difference in means.
do_rep <- function (n, b, sigma) {
x <- sample(c(1, 2), size=n, replace=TRUE)
y <- b[x] + rnorm(n, mean=0, sd=sigma)
tt <- t.test(y ~ x)
stantt <- sampling(stt_model,
data=list(N=length(x), x=x, y=y), iter=1e3, refresh=0, open_progress=FALSE)
bb <- extract(stantt)$b
out <- c(tt$estimate, tt$conf.int,
colMeans(bb),
quantile(bb[,1] - bb[,2], prob=c(0.025, 0.975)))
names(out) <- c('t_b1', 't_b2', 't_lower', 't_upper',
's_b1', 's_b2', 's_lower', 's_upper')
return(out)
}
reps <- replicate(100, do_rep(n, truth$b, truth$sigma))
## Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
Estimates of differences in means across 100 datasets:
What if the noise was not Normal?
Substituting rnorm -> rcauchy
:
The \(t\)-test
## user system elapsed
## 0.002 0.000 0.002
##
## Welch Two Sample t-test
##
## data: y by x
## t = -5.5741, df = 186.48, p-value = 8.618e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.039750 -1.450554
## sample estimates:
## mean in group 1 mean in group 2
## 0.8088953 3.0540473
Let’s do both methods many times, and see how well they estimate the difference in means.
do_rep_cauchy <- function (n, b, sigma) {
x <- sample(c(1, 2), size=n, replace=TRUE)
y <- b[x] + rcauchy(n, location=0, scale=sigma)
tt <- t.test(y ~ x)
stantt <- sampling(rtt_model,
data=list(N=length(x), x=x, y=y), iter=1e3, refresh=0, open_progress=FALSE)
bb <- extract(stantt)$b
out <- c(tt$estimate, tt$conf.int,
colMeans(bb),
quantile(bb[,1] - bb[,2], prob=c(0.025, 0.975)))
names(out) <- c('t_b1', 't_b2', 't_lower', 't_upper',
's_b1', 's_b2', 's_lower', 's_upper')
return(out)
}
cauchy_reps <- replicate(100, do_rep_cauchy(n, truth$b, truth$sigma))
Estimates of differences in means across 100 datasets:
Estimates of differences in means across 100 datasets (zoomed):
Why?
To simulate from: \[\begin{aligned} \mu_i &= b_0 + b_1 X_{i1} + \cdots + b_k X_{i1} \\ Y_i &\sim \Normal(\mu_i, \sigma) . \end{aligned}\]
To simulate from: \[\begin{aligned} \mu_i &= b_0 + b_1 X_{i1} + \cdots + b_k X_{i1} \\ Y_i &\sim \Normal(\mu_i, \sigma) . \end{aligned}\]
In R, %*%
is matrix multiplication: if
then X %*% b
(or, \(X_b\) in math notation) is shorthand for the \(n\)-vector \[ (Xb)_i = \sum_{j=1}^k X_{ij} b_j . \]
In Stan, matrix multiplication is *
.
Implement standard multiple linear regression in Stan, and compare (roughly) to lm()
.
Try doing the same thing with a for
loop and with *
.
Do Cauchy regression.
With a for loop:
Using a model that doesn’t describe the data well often results in poor estimates.
But how do we know this is happening?
It depends, but here’s one common problem.
A \(t\)-test is misled by Cauchy noise because it tries too hard to fit the extreme outliers.
Taking the noise in the data too seriously is known as “overfitting” (or, “overgeneralizing”).
Happily, it’s easy to spot (if you have independent observations).
Maybe your model fits, but is it any good?
Divide your data randomly into 5 pieces.
Fit your model on 4/5ths, and see how well it predicts the remaining 1/5th.
Do this for each of the 5 pieces.
Then, compare the mean crossvalidation accuracy between methods.
Example: for the t-test.
## [1] 1.0096039 0.8462026 1.2437974 0.7625182 0.6861961
(stan_xvals <- sapply(datasets, function (data) {
stantt <- sampling(rtt_model, # cauchy model
data=list(N=nrow(data$train),
x=data$train$x,
y=data$train$y),
iter=1e3)
post_samples <- extract(stantt)
post_means <- colMeans(post_samples$b)
pred <- post_means[data$test$x]
return(sqrt(median( (data$test$y - pred)^2 )))
} ))
## [1] 0.6229829 0.7243239 0.8316605 0.7068205 0.3082926
Stan has about 40% better crossvalidation accuracy:
## mean rep 1 rep 2 rep 3 rep 4 rep 5
## t test 0.9096637 1.0096039 0.8462026 1.2437974 0.7625182 0.6861961
## stan 0.6388161 0.6229829 0.7243239 0.8316605 0.7068205 0.3082926
It turns out that if
\[\begin{aligned} \beta &\sim \Normal(0, 1/\sqrt{\lambda}) \\ \lambda &\sim \Gam(1/2, 1/2) \end{aligned}\]
then
\[\begin{aligned} \beta &\sim \Cauchy(0, 1). \end{aligned}\]
What black magic is this??
It says so here.
If you like to do integrals, you can check mathematically.
Or, you can check with simulation.
If \[\begin{aligned} \beta &\sim \Normal(0, 1/\sqrt{\lambda}) \\ \lambda &\sim \Gam(\nu/2, \nu/2) \end{aligned}\] then \(\beta \sim t(\text{df}=\nu)\), i.e., has “Student’s \(t\) distribution with \(\nu\) degrees of freedom”, which has density \[ \P\{ \beta = t \} \propto \left(1 + \frac{t^2}{\nu} \right)^{-\frac{\nu+1}{2}} .\]
Facts:
How does leaf anthocyanin concentration increase with time in the sun, in ten different inbred lines with different baseline concentrations?
leaf_model <- stan_model(model_code=
"
data {
int N;
vector[N] x; // time
vector[N] y; // anthocyanin
int geno[N]; // genotype index
int ng; // number of genotypes
}
parameters {
real b; // slope
vector[ng] c; // genotype intercept
real<lower=0> sigma; // scale
real<lower=0> nu; // hyperprior SD
}
model {
y ~ cauchy(b * x + c[geno], sigma);
b ~ normal(0, nu);
c ~ normal(0, nu);
sigma ~ normal(0, 10);
nu ~ gamma(0.5, 0.5);
}
")
## Inference for Stan model: ee17257244291f76913790aca9fdaba7.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
## b 1.47 0.00 0.04 1.38 1.44 1.47 1.50 1.56 1474
## c[1] 1.68 0.01 0.50 0.72 1.35 1.68 2.00 2.69 2577
## c[2] 16.20 0.01 0.50 15.21 15.86 16.20 16.54 17.18 3628
## c[3] 7.25 0.01 0.48 6.31 6.93 7.25 7.58 8.17 3182
## c[4] 13.65 0.01 0.56 12.50 13.30 13.65 14.01 14.76 3547
## c[5] 6.31 0.01 0.39 5.55 6.06 6.31 6.56 7.08 3662
## c[6] 2.28 0.01 0.42 1.45 2.01 2.29 2.56 3.10 2641
## c[7] 7.09 0.01 0.63 5.91 6.64 7.08 7.51 8.34 3111
## c[8] 16.00 0.01 0.40 15.21 15.73 15.98 16.26 16.81 2361
## c[9] 0.47 0.01 0.34 -0.20 0.24 0.46 0.70 1.13 2661
## c[10] 2.22 0.01 0.46 1.28 1.92 2.23 2.53 3.09 2439
## sigma 1.47 0.00 0.12 1.25 1.39 1.47 1.55 1.70 4905
## nu 8.07 0.02 1.44 5.84 7.01 7.90 8.91 11.47 4105
## lp__ -487.56 0.06 2.61 -493.56 -489.13 -487.22 -485.66 -483.45 1743
## Rhat
## b 1
## c[1] 1
## c[2] 1
## c[3] 1
## c[4] 1
## c[5] 1
## c[6] 1
## c[7] 1
## c[8] 1
## c[9] 1
## c[10] 1
## sigma 1
## nu 1
## lp__ 1
##
## Samples were drawn using NUTS(diag_e) at Thu Jan 10 11:24:49 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
## 'pars' not specified. Showing first 10 parameters by default.
## 'pars' not specified. Showing first 10 parameters by default.
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)
A single \(k\)-fold crossvalidation step should:
Randomly extract \(1/k\) of the data as “test”, and leave the rest as “training”.
Fit the model on the training data.
Use the result to predict the values in the test data.
Report a measure of prediction error.
(my results:)
> summary(crossvals)
lm stan stan_interactions stan_df
Min. : 5.637 Min. : 2.381 Min. : 2.462 Min. :2.497
1st Qu.: 6.109 1st Qu.: 2.771 1st Qu.: 3.030 1st Qu.:3.237
Median : 6.544 Median : 3.106 Median : 3.597 Median :3.591
Mean : 7.979 Mean : 4.028 Mean : 4.190 Mean :4.120
3rd Qu.: 6.837 3rd Qu.: 3.932 3rd Qu.: 4.217 3rd Qu.:3.836
Max. :14.808 Max. :14.441 Max. :14.265 Max. :9.173
\[ \text{data} \sim \text{fit} + \text{residual} \]
(the “family”; describes the output)
(connects input to output)
(connects linear predictor to probability distribution)
Linear regression: \[ Y \sim \Normal(X \beta, \sigma) .\]
Logistic regression: \[ Y \sim \Binom(n, \logistic(X\beta)) .\]
Gamma regression: \[ Y \sim \Gam(\text{scale}=\exp(X\beta)/k, \text{shape}=k) .\]
glm(y ~ x, family=binomial)
family {stats}
Family objects provide a convenient way to specify the details of the models used by functions such as glm.
binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")
Arguments
link: a specification for the model link function. This can be a name/expression, a literal character string, a length-one character vector, or an object of class "link-glm" (such as generated by make.link) provided it is not specified via one of the standard names given next.
The gaussian family accepts the links (as names) identity, log and inverse; the binomial family the links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy CDFs respectively) log and cloglog (complementary log-log); the Gamma family the links inverse, identity and log; the poisson family the links log, identity, and sqrt; and the inverse.gaussian family the links 1/mu^2, inverse, identity and log.
100 trials, where probability of success depends on \(x\):
glm()
##
## Call:
## glm(formula = zz ~ x, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.50439 -0.25097 -0.05869 0.35005 2.35424
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.4293 1.6038 -4.632 3.62e-06 ***
## x 1.2879 0.2683 4.800 1.59e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 134.60 on 99 degrees of freedom
## Residual deviance: 52.53 on 98 degrees of freedom
## AIC: 56.53
##
## Number of Fisher Scoring iterations: 6
\[\begin{aligned} Z &\sim \Binom(1, P) \\ P &= \logistic(\alpha + \beta X) \end{aligned}\]
## $summary
## mean se_mean sd 2.5% 25% 50%
## alpha -8.062960 0.07100136 1.7142965 -11.7011825 -9.184292 -7.931243
## beta 1.388078 0.01215117 0.2865985 0.8817608 1.179606 1.370420
## lp__ -27.578616 0.02894954 0.9692251 -30.1689775 -27.997974 -27.294339
## 75% 97.5% n_eff Rhat
## alpha -6.835255 -5.095609 582.9597 1.006586
## beta 1.579790 2.005420 556.3031 1.007479
## lp__ -26.868064 -26.584681 1120.8976 1.006472
##
## $c_summary
## , , chains = chain:1
##
## stats
## parameter mean sd 2.5% 25% 50%
## alpha -8.020095 1.7351854 -11.9725743 -9.183053 -7.915507
## beta 1.380041 0.2890422 0.8603793 1.169197 1.362963
## lp__ -27.562393 0.9639240 -30.1381199 -27.951706 -27.270999
## stats
## parameter 75% 97.5%
## alpha -6.804995 -4.904487
## beta 1.572540 2.010255
## lp__ -26.882973 -26.593221
##
## , , chains = chain:2
##
## stats
## parameter mean sd 2.5% 25% 50%
## alpha -7.976819 1.6188518 -11.4078137 -9.038962 -7.899671
## beta 1.373324 0.2700576 0.9073788 1.169358 1.362537
## lp__ -27.494653 0.8989894 -30.0493349 -27.848116 -27.238508
## stats
## parameter 75% 97.5%
## alpha -6.768938 -5.297587
## beta 1.550171 1.948912
## lp__ -26.843572 -26.577144
##
## , , chains = chain:3
##
## stats
## parameter mean sd 2.5% 25% 50%
## alpha -8.250431 1.7856021 -12.0756699 -9.367484 -8.050983
## beta 1.420788 0.2997694 0.9192965 1.210781 1.391901
## lp__ -27.693730 1.0237408 -30.2012787 -28.219400 -27.408870
## stats
## parameter 75% 97.5%
## alpha -6.995168 -5.275996
## beta 1.617131 2.063693
## lp__ -26.927294 -26.592515
##
## , , chains = chain:4
##
## stats
## parameter mean sd 2.5% 25% 50%
## alpha -8.004497 1.7018467 -11.4531795 -9.167022 -7.899202
## beta 1.378157 0.2846246 0.8413456 1.178552 1.360691
## lp__ -27.563688 0.9768450 -30.1898946 -27.991265 -27.275294
## stats
## parameter 75% 97.5%
## alpha -6.823629 -4.848098
## beta 1.580438 1.940882
## lp__ -26.832823 -26.581075
samples <- extract(fit)
layout(t(1:2))
hist(samples$alpha, main=expression(alpha))
abline(v=alpha, col='red'); abline(v=coef(glm_fit)[1], col='green')
hist(samples$beta, main=expression(beta))
abline(v=beta, col='red'); abline(v=coef(glm_fit)[2], col='green')
legend("topright", lty=1, col=c('red', 'green'), legend=c('truth', 'glm'))