Peter Ralph
Advanced Biological Statistics
We have a dataset of batting averages of baseball players, having
batting <- read.csv("data/BattingAveragePlus.csv", header=TRUE, stringsAsFactors=TRUE)
head(batting)
## Player PriPos Hits AtBats PlayerNumber PriPosNumber playerID weight height bats birthYear debut finalGame
## 1 Aaron Cunningham Left Field 17 97 198 7 cunniaa01 215 71 R 1986 2008-08-31 2012-07-24
## 2 Aaron Harang Pitcher 4 56 360 1 haranaa01 260 79 R 1978 2002-05-25 2015-10-03
## 3 Aaron Hill 2nd Base 184 609 394 4 hillaa01 200 71 R 1982 2005-05-20 2016-10-01
## 4 Aaron Laffey Pitcher 0 1 482 1 laffeaa01 190 71 L 1985 2007-08-04 2015-07-31
## 5 Aaron Loup Pitcher 0 1 522 1 loupaa01 210 71 L 1987 2012-07-14 2016-09-30
## 6 Adam Dunn 1st Base 110 539 242 3 dunnad01 285 78 L 1979 2001-07-20 2014-09-28
The overall batting average of the 886 players is 0.254768.
Here is the average by position.
batting %>% group_by(PriPos) %>%
summarise(num=n(), BatAvg=sum(Hits)/sum(AtBats)) %>%
select(PriPos, num, BatAvg)
## # A tibble: 9 × 3
## PriPos num BatAvg
## <fct> <int> <dbl>
## 1 1st Base 77 0.260
## 2 2nd Base 69 0.256
## 3 3rd Base 73 0.265
## 4 Catcher 95 0.247
## 5 Center Field 61 0.265
## 6 Left Field 96 0.258
## 7 Pitcher 305 0.130
## 8 Right Field 53 0.263
## 9 Shortstop 57 0.255
What’s the overall batting average?
Do some positions tend to be better batters?
How much variation in batting average is there between players of the same position?
Do height or weight predict batting average?
Exercise:
Come up with some quick-and-dirty statistics to answer these questions.
Motivation:
which is \[\begin{aligned} \frac{\text{Hits}_i}{\text{AtBats}_i} &\sim \Normal(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_{\text{PriPos}_i} + \beta_w w_i + \beta_h h_i \end{aligned}\]
But: that’s not right - Hits/AtBats
is between 0 and 1!
And, 2/5 is very different than 2000/5000!
\[\begin{aligned} \text{Hits}_i &\sim \Binom(\text{AtBats}_i, \mu_i) \\ \mu_i &= \beta_0 + \beta_{\text{PriPos}_i} + \beta_w w_i + \beta_h h_i \end{aligned}\]
But: what’s to keep \(\mu_i\) between 0 and 1?
\[\begin{aligned} \logistic(x) = \frac{1}{1 + e^{-x}} . \end{aligned}\]
\[\begin{aligned} \logit(x) = \log\left(\frac{x}{1-x}\right) . \end{aligned}\]
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \end{aligned}\]
For the \(i\)th player:
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_0 \end{aligned}\]
For the \(i\)th player:
and
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} \end{aligned}\]
For the \(i\)th player:
and
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} \end{aligned}\] and \[\begin{aligned} \beta_p &\sim \Normal(0, 5) \end{aligned}\]
For the \(i\)th player:
and
We’ll see more about brms soon!
fm <- brm(
Hits | trials(AtBats) ~ 0 + PriPos,
data=batting,
family='binomial',
prior = c(prior(normal(0, 5), class = b)),
iter = 2000, chains = 3
)
## Compiling Stan program...
## Start sampling
## Family: binomial
## Links: mu = logit
## Formula: Hits | trials(AtBats) ~ 0 + PriPos
## Data: batting (Number of observations: 886)
## Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 3000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## PriPos1stBase -1.05 0.02 -1.08 -1.02 1.00 6014 2477
## PriPos2ndBase -1.07 0.02 -1.10 -1.04 1.00 5628 2444
## PriPos3rdBase -1.02 0.02 -1.05 -0.99 1.00 6218 2219
## PriPosCatcher -1.11 0.02 -1.15 -1.08 1.00 5343 2524
## PriPosCenterField -1.02 0.02 -1.05 -0.99 1.00 6430 2572
## PriPosLeftField -1.06 0.02 -1.09 -1.02 1.00 6452 2285
## PriPosPitcher -1.90 0.04 -1.99 -1.82 1.00 5152 2275
## PriPosRightField -1.03 0.02 -1.07 -1.00 1.00 7158 2297
## PriPosShortstop -1.07 0.02 -1.11 -1.04 1.00 5510 2356
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
predict( )
pos_df <- data.frame(
PriPos=levels(batting$PriPos),
AtBats=10000)
fm_post <- predict(fm, newdata=pos_df)
cbind(pos_df, fm_post)
## PriPos AtBats Estimate Est.Error Q2.5 Q97.5
## 1 1st Base 10000 2598.963 53.28684 2498.000 2703.000
## 2 2nd Base 10000 2555.330 53.98679 2449.000 2659.000
## 3 3rd Base 10000 2652.037 53.13323 2547.000 2756.000
## 4 Catcher 10000 2469.425 52.68526 2367.000 2574.025
## 5 Center Field 10000 2648.111 54.53149 2539.975 2755.000
## 6 Left Field 10000 2581.515 54.56530 2472.000 2686.025
## 7 Pitcher 10000 1298.653 59.40278 1180.000 1416.025
## 8 Right Field 10000 2631.101 56.34839 2520.000 2740.025
## 9 Shortstop 10000 2549.997 54.76665 2443.000 2656.025
par(mar=c(5,7,1,1)+.1)
plot(fm_post[,"Estimate"] / pos_df$AtBats, 1:nrow(pos_df), xlim=c(0.1, 0.3), xlab='batting avg', ylab='', yaxt='n', type='n')
segments(x0=fm_post[,"Q2.5"]/pos_df$AtBats, x1=fm_post[,"Q97.5"]/pos_df$AtBats, y0=1:nrow(pos_df), col='red')
points(fm_post[,"Estimate"] / pos_df$AtBats, 1:nrow(pos_df), pch=20)
axis(2, at=1:nrow(pos_df), labels=pos_df$PriPos, las=2)
Let’s get the posterior distribution of the difference between the two batting averages.
Is this a good model for the data?
Simulate from the model:
rbinom( )
.Then, fit the model! (time remaining)
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \frac{1}{1 + \exp(-\mu_i)} \\ \mu_i &= \beta_{p_i} + \epsilon_i \\ \epsilon_i &\sim \Normal(0, 0.05) . \end{aligned}\]
with:
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} + \epsilon_i \\ \epsilon_i &\sim \Normal(0, \sigma) \end{aligned}\]
For the \(i\)th player:
and
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} + \beta_h h_i + \beta_w w_i + \epsilon_i \\ \epsilon_i &\sim \Normal(0, \sigma) \end{aligned}\]
For the \(i\)th player:
and
Scaling variables can improve convergence:
library(brms)
batting$scaled_height <- (batting$height - mean(batting$height))/sd(batting$height)
batting$scaled_weight <- (batting$weight - mean(batting$weight))/sd(batting$weight)
logistic_fit <- brm(
Hits | trials(AtBats) ~ 0 + scaled_weight + scaled_height + PriPos + (1 | PriPos:Player),
data = batting,
family = "binomial",
prior = c(prior(normal(0, 5), class = b),
prior(normal(0, 5), class = sd)),
iter = 2000, chains = 3
)
## Compiling Stan program...
## Start sampling
## Family: binomial
## Links: mu = logit
## Formula: Hits | trials(AtBats) ~ 0 + scaled_weight + scaled_height + PriPos + (1 | PriPos:Player)
## Data: batting (Number of observations: 886)
## Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 3000
##
## Group-Level Effects:
## ~PriPos:Player (Number of levels: 886)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.14 0.01 0.12 0.16 1.00 1119 1824
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## scaled_weight 0.02 0.01 -0.01 0.04 1.00 1052 1611
## scaled_height -0.01 0.01 -0.03 0.02 1.00 1318 2052
## PriPos1stBase -1.09 0.02 -1.13 -1.04 1.00 1248 1624
## PriPos2ndBase -1.09 0.03 -1.15 -1.04 1.00 1061 1475
## PriPos3rdBase -1.06 0.02 -1.10 -1.01 1.00 1335 1680
## PriPosCatcher -1.16 0.03 -1.21 -1.11 1.00 1403 1945
## PriPosCenterField -1.05 0.03 -1.10 -0.99 1.00 1092 1499
## PriPosLeftField -1.10 0.02 -1.14 -1.05 1.00 1449 2084
## PriPosPitcher -1.91 0.05 -2.00 -1.83 1.00 2872 2286
## PriPosRightField -1.05 0.03 -1.11 -1.00 1.00 1154 1888
## PriPosShortstop -1.10 0.03 -1.16 -1.04 1.00 1194 1744
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
posterior_epred()
gives posterior samples of expected responses:
players <- c("Mark Buehrle", "Matt Harvey", "Mike Leake")
player_data <- subset(batting, Player %in% players)
player_preds <- posterior_epred(logistic_fit, newdata=player_data)
player_preds <- player_preds / player_data$AtBats[col(player_preds)]
head(player_preds)
## [,1] [,2] [,3]
## [1,] 0.12867787 0.1294308 0.1846368
## [2,] 0.11611176 0.1344739 0.1437486
## [3,] 0.11453307 0.1547196 0.1460944
## [4,] 0.10216918 0.1308674 0.1539287
## [5,] 0.09822368 0.1409306 0.1597499
## [6,] 0.12774940 0.1335366 0.1364718
What are some questions would you like to answer using this model?
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} \end{aligned}\] and \[\begin{aligned} \beta_p &\sim \Normal(0, 5) \end{aligned}\]
first_code <- "
data {
int N; // number of players
int hits[N];
int at_bats[N];
int npos; // number of positions
int position[N];
}
parameters {
vector[npos] beta;
real alpha_w;
real alpha_h;
vector[N] epsilon;
real<lower=0> sigma;
}
model {
hits ~ binomial(
at_bats,
inv_logit(beta[position]));
beta ~ normal(0, 5);
} "
first_model <- stan_model(model_code=first_code)
logistic_code <- "
data {
int N; // number of players
int hits[N];
int at_bats[N];
int npos; // number of positions
int position[N];
vector[N] height;
vector[N] weight;
}
parameters {
vector[npos] gamma;
real alpha_w;
real alpha_h;
vector[N] epsilon;
vector<lower=0> sigma;
}
transformed parameters {
vector<lower=0, upper=1>[N] theta;
theta = inv_logit(
gamma[position] + alpha_h * height + alpha_w * weight + epsilon
);
}
model {
hits ~ binomial(at_bats, theta);
epsilon ~ normal(0, sigma);
gamma ~ normal(0, 5);
alpha_h ~ normal(0, 5);
alpha_w ~ normal(0, 5);
sigma ~ normal(0, 5);
} "
Next steps:
logistic_model <- stan_model(model_code=logistic_code)
scaled_height <- (batting$height - mean(batting$height))/sd(batting$height)
scaled_weight <- (batting$weight - mean(batting$weight))/sd(batting$weight)
logistic_fit <- sampling(logistic_model, chains=3, iter=1000,
data=list(N=nrow(batting),
hits=batting$Hits,
at_bats=batting$AtBats,
npos=nlevels(batting$PriPos),
position=as.numeric(batting$PriPos),
height=scaled_height,
weight=scaled_weight) )