Shrinkage and Sharing power

Peter Ralph

23 November 2020 – Advanced Biological Statistics

Simulation

Wouldn’t it be nice if we knew the truth?

Let’s write down a procedure to simulate data that looks like the baseball data.

batting$post_mean <- colMeans(param_samples$theta)
batting$fake_hits <- rbinom(nrow(batting), 
                            size=batting$AtBats, 
                            prob=batting$post_mean)

Option 1: By simulating draws from the posterior mean on \(\theta\), we can check that our model is accurately describing the data. Here is the real data against data simulated under the posterior mean \(\theta\) values. They look similar, which is good.

plot of chunk r plot_simfakedata

But if we want to know if we’re accurately estimating \(\mu\) and \(\kappa\), then we have to start with them, and simulate \(\theta\).

post_mean_mu <- colMeans(param_samples$mu)
post_mean_kappa <- colMeans(param_samples$kappa)
names(post_mean_mu) <- names(post_mean_kappa) <- levels(batting$PriPos)
batting$sim_theta <- rbeta(nrow(batting),
                           shape1=post_mean_mu[batting$PriPos] *
                                   post_mean_kappa[batting$PriPos],
                           shape2=(1-post_mean_mu[batting$PriPos]) * 
                                   post_mean_kappa[batting$PriPos])
batting$sim_theta_hits <- rbinom(nrow(batting), 
                                 size=batting$AtBats, 
                                 prob=batting$sim_theta)

Fit the model to the simulated data:

sim_fit <- sampling(pos_model, chains=3, iter=1000, control=list(max_treedepth=13),
                data=list(N=nrow(batting),
                          hits=batting$sim_theta_hits,
                          at_bats=batting$AtBats,
                          npos=nlevels(batting$PriPos),
                          position=as.numeric(batting$PriPos)))

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

Can we estimate \(\mu\) and \(\kappa\)?

sim_samples <- rstan::extract(sim_fit)
layout(t(1:2))
boxplot(sim_samples$mu, main="posterior distribution of mu")
points(1:9, post_mean_mu, col='red', pch=20, cex=2)
boxplot(sim_samples$kappa, main="posterior distribution of kappa")
points(1:9, post_mean_kappa, col='red', pch=20, cex=2)

plot of chunk r check_sim_mu

General questions with simulated data

Does my statistical inference method work?

Do the credible intervals contain the true value?

(i.e., Is the method “well-calibrated”?)

They usually should.

How wide are credible intervals, typically?

This is (one kind of) statistical power.

Example

Suppose that I have a large pot of coins that are all similar to each other. I flip each one ten times, and record the number of Heads. What is each coin’s probability of Heads?

Treated separately, we would be very uncertain about each coin.
Together, they should tell us very accurately what are likely values of \(\theta\).
This information can improve the estimate of each separate \(\theta\).
The more similar the coins are, the more information we gain.

By shrinking each estimate towards their common mean, we hope to gain power.

Shrinkage and baseball

Some players were at bat very few times. How does the information about their position affect our inference about their batting averages?

batting$post_med <- colMedians(param_samples$theta)
batting$post_Q1 <- colQuantiles(param_samples$theta, probs=0.25)
batting$post_Q3 <- colQuantiles(param_samples$theta, probs=0.75)
pos_means <- colMeans(param_samples$mu)
names(pos_means) <- levels(batting$PriPos)
pos_means

##     1st Base     2nd Base     3rd Base      Catcher Center Field   Left Field      Pitcher  Right Field    Shortstop 
##    0.2526893    0.2489027    0.2553305    0.2388856    0.2579088    0.2495208    0.1289218    0.2586869    0.2478177

Pitchers had posterior mean \(\mu\) of 0.1289218

with(subset(batting[order(batting$post_med),], PriPos=="Pitcher"), {
     plot(Hits / AtBats, main="Pitchers", xaxt='n', ylim=c(0, 0.4),
         xlab='player', ylab="posterior median theta");
     segments(x0=seq_along(Hits), y0=post_Q1, y1=post_Q3,
         col=adjustcolor('red',0.5));
     points(post_med, pch=20) })
abline(h=pos_means["Pitcher"], lwd=2, col=adjustcolor("blue", 0.5))
legend("topleft", pch=c(1,20), legend=c("observed", "estimate"))

plot of chunk r pitchers

Catchers had posterior mean \(\mu\) of 0.2388856:

with(subset(batting[order(batting$post_med),], PriPos=="Catcher"), {
     plot(Hits / AtBats, main="Catchers", xaxt='n', ylim=c(0, 0.4),
         xlab='player', ylab="posterior median theta");
     segments(x0=seq_along(Hits), y0=post_Q1, y1=post_Q3,
         col=adjustcolor('red',0.5));
     points(post_med, pch=20) })
abline(h=pos_means["Catcher"], lwd=2, col=adjustcolor("blue", 0.5))
legend("topleft", pch=c(1,20), legend=c("observed", "estimate"))

plot of chunk r catchers

Is shrinkage here a good idea?

With simulated data, compare median absolute error for

posterior mean \(\theta_i\)
empirical batting average

How’d we do? Let’s look at the true values of \(\theta\) (we know because we generated them) against the posterior means. Red lines are 95% credible intervals. plot of chunk r check_sim_result

Did we do better?

The mean absolute difference between the truth and

posterior mean: 0.0253747
empirical batting average (hits/at bats): 0.0866087

Using information about groups reduced our error by a factor of 4!