Peter Ralph
Advanced Biological Statistics
We have a dataset of batting averages of baseball players, having
batting <- read.csv("data/BattingAveragePlus.csv", header=TRUE, stringsAsFactors=TRUE)
head(batting)
## Player PriPos Hits AtBats PlayerNumber PriPosNumber playerID weight height bats birthYear debut finalGame
## 1 Aaron Cunningham Left Field 17 97 198 7 cunniaa01 215 71 R 1986 2008-08-31 2012-07-24
## 2 Aaron Harang Pitcher 4 56 360 1 haranaa01 260 79 R 1978 2002-05-25 2015-10-03
## 3 Aaron Hill 2nd Base 184 609 394 4 hillaa01 200 71 R 1982 2005-05-20 2016-10-01
## 4 Aaron Laffey Pitcher 0 1 482 1 laffeaa01 190 71 L 1985 2007-08-04 2015-07-31
## 5 Aaron Loup Pitcher 0 1 522 1 loupaa01 210 71 L 1987 2012-07-14 2016-09-30
## 6 Adam Dunn 1st Base 110 539 242 3 dunnad01 285 78 L 1979 2001-07-20 2014-09-28
The overall batting average of the 886 players is 0.254768.
Here is the average by position.
batting %>% group_by(PriPos) %>%
summarise(num=n(), BatAvg=sum(Hits)/sum(AtBats)) %>%
select(PriPos, num, BatAvg)
## # A tibble: 9 × 3
## PriPos num BatAvg
## <fct> <int> <dbl>
## 1 1st Base 77 0.260
## 2 2nd Base 69 0.256
## 3 3rd Base 73 0.265
## 4 Catcher 95 0.247
## 5 Center Field 61 0.265
## 6 Left Field 96 0.258
## 7 Pitcher 305 0.130
## 8 Right Field 53 0.263
## 9 Shortstop 57 0.255
What’s the overall batting average?
Do some positions tend to be better batters?
How much variation in batting average is there between players of the same position?
Do height or weight predict batting average?
Exercise:
Come up with some quick-and-dirty statistics to answer these questions.
Motivation:
which is \[\begin{aligned} \frac{\text{Hits}_i}{\text{AtBats}_i} &\sim \Normal(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_{\text{PriPos}_i} + \beta_w w_i + \beta_h h_i \end{aligned}\]
But: that’s not right - Hits/AtBats
is between
0 and 1!
And, 2/5 is very different than 2000/5000!
\[\begin{aligned} \text{Hits}_i &\sim \Binom(\text{AtBats}_i, \mu_i) \\ \mu_i &= \beta_0 + \beta_{\text{PriPos}_i} + \beta_w w_i + \beta_h h_i \end{aligned}\]
But: what’s to keep \(\mu_i\) between 0 and 1?
\[\begin{aligned} \logistic(x) = \frac{1}{1 + e^{-x}} . \end{aligned}\]
\[\begin{aligned} \logit(x) = \log\left(\frac{x}{1-x}\right) . \end{aligned}\]
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \end{aligned}\]
For the \(i\)th player:
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_0 \end{aligned}\]
For the \(i\)th player:
and
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} \end{aligned}\]
For the \(i\)th player:
and
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} \end{aligned}\] and \[\begin{aligned} \beta_p &\sim \Normal(0, 5) \end{aligned}\]
For the \(i\)th player:
and
We’ll see more about brms soon!
fm <- brm(
Hits | trials(AtBats) ~ 0 + PriPos,
data=batting,
family='binomial',
prior = c(prior(normal(0, 5), class = b)),
iter = 2000, chains = 3
)
## Compiling Stan program...
## Start sampling
## Family: binomial
## Links: mu = logit
## Formula: Hits | trials(AtBats) ~ 0 + PriPos
## Data: batting (Number of observations: 886)
## Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 3000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## PriPos1stBase -1.05 0.02 -1.08 -1.02 1.00 4554 2544
## PriPos2ndBase -1.07 0.02 -1.10 -1.04 1.00 4371 2428
## PriPos3rdBase -1.02 0.02 -1.05 -0.99 1.00 4270 2285
## PriPosCatcher -1.12 0.02 -1.15 -1.08 1.00 4080 2153
## PriPosCenterField -1.02 0.02 -1.05 -0.99 1.00 4799 1857
## PriPosLeftField -1.06 0.02 -1.09 -1.02 1.00 4726 2545
## PriPosPitcher -1.90 0.04 -1.99 -1.82 1.00 4558 2304
## PriPosRightField -1.03 0.02 -1.06 -1.00 1.00 4315 2241
## PriPosShortstop -1.07 0.02 -1.11 -1.03 1.00 3849 2180
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
predict( )
pos_df <- data.frame(
PriPos=levels(batting$PriPos),
AtBats=10000)
fm_post <- predict(fm, newdata=pos_df)
cbind(pos_df, fm_post)
## PriPos AtBats Estimate Est.Error Q2.5 Q97.5
## 1 1st Base 10000 2599.869 52.70881 2495.000 2707.000
## 2 2nd Base 10000 2554.882 53.91151 2450.000 2663.000
## 3 3rd Base 10000 2651.301 54.53799 2545.000 2755.025
## 4 Catcher 10000 2470.508 53.82029 2366.000 2577.000
## 5 Center Field 10000 2648.643 55.46119 2539.000 2758.025
## 6 Left Field 10000 2583.180 52.42603 2483.000 2685.000
## 7 Pitcher 10000 1300.044 59.46869 1187.000 1416.000
## 8 Right Field 10000 2631.686 55.95883 2519.975 2740.000
## 9 Shortstop 10000 2550.090 57.14899 2441.000 2664.025
par(mar=c(5,7,1,1)+.1)
plot(fm_post[,"Estimate"] / pos_df$AtBats, 1:nrow(pos_df), xlim=c(0.1, 0.3), xlab='batting avg', ylab='', yaxt='n', type='n')
segments(x0=fm_post[,"Q2.5"]/pos_df$AtBats, x1=fm_post[,"Q97.5"]/pos_df$AtBats, y0=1:nrow(pos_df), col='red')
points(fm_post[,"Estimate"] / pos_df$AtBats, 1:nrow(pos_df), pch=20)
axis(2, at=1:nrow(pos_df), labels=pos_df$PriPos, las=2)
Let’s get the posterior distribution of the difference between the two batting averages.
Is this a good model for the data?
Simulate from the model:
rbinom( )
.Then, fit the model! (time remaining)
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \frac{1}{1 + \exp(-\mu_i)} \\ \mu_i &= \beta_{p_i} + \epsilon_i \\ \epsilon_i &\sim \Normal(0, 0.05) . \end{aligned}\]
with:
df <- data.frame(
name=sample(LETTERS, 20),
PriPos=sample(unique(batting$PriPos), 20, replace=TRUE),
AtBats=rpois(20, 30)
)
beta <- rep(0.24, 9)
names(beta) <- unique(batting$PriPos)
beta['Pitcher'] <- 0.15
df$mu <- beta[as.character(df$PriPos)] + rnorm(20, sd=0.05)
df$theta <- 1/(1 + exp(-df$mu))
df$Z <- rbinom(nrow(df), size=df$AtBats, prob=df$theta)
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} + \epsilon_i \\ \epsilon_i &\sim \Normal(0, \sigma) \end{aligned}\]
For the \(i\)th player:
and
\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} + \beta_h h_i + \beta_w w_i + \epsilon_i \\ \epsilon_i &\sim \Normal(0, \sigma) \end{aligned}\]
For the \(i\)th player:
and
Scaling variables can improve convergence:
library(brms)
batting$scaled_height <- (batting$height - mean(batting$height))/sd(batting$height)
batting$scaled_weight <- (batting$weight - mean(batting$weight))/sd(batting$weight)
logistic_fit <- brm(
Hits | trials(AtBats) ~ 0 + scaled_weight + scaled_height + PriPos + (1 | PriPos:Player),
data = batting,
family = "binomial",
prior = c(prior(normal(0, 5), class = b),
prior(normal(0, 5), class = sd)),
iter = 2000, chains = 3
)
## Compiling Stan program...
## Start sampling
## Family: binomial
## Links: mu = logit
## Formula: Hits | trials(AtBats) ~ 0 + scaled_weight + scaled_height + PriPos + (1 | PriPos:Player)
## Data: batting (Number of observations: 886)
## Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 3000
##
## Group-Level Effects:
## ~PriPos:Player (Number of levels: 886)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.14 0.01 0.12 0.16 1.00 1100 1712
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## scaled_weight 0.02 0.01 -0.01 0.04 1.00 1391 1765
## scaled_height -0.01 0.01 -0.03 0.02 1.00 1515 1700
## PriPos1stBase -1.08 0.03 -1.14 -1.03 1.00 1675 1999
## PriPos2ndBase -1.09 0.03 -1.15 -1.04 1.00 1391 2012
## PriPos3rdBase -1.06 0.02 -1.11 -1.01 1.00 1686 2098
## PriPosCatcher -1.16 0.03 -1.21 -1.11 1.00 1510 1792
## PriPosCenterField -1.05 0.03 -1.10 -0.99 1.00 1604 1971
## PriPosLeftField -1.10 0.02 -1.14 -1.05 1.00 1672 1816
## PriPosPitcher -1.92 0.05 -2.01 -1.83 1.00 4370 2223
## PriPosRightField -1.05 0.03 -1.11 -1.00 1.00 1487 1623
## PriPosShortstop -1.10 0.03 -1.16 -1.04 1.00 1593 1880
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## Setting all 'trials' variables to 1 by default if not specified otherwise.
posterior_epred()
gives posterior samples of expected
responses:
players <- c("Mark Buehrle", "Matt Harvey", "Mike Leake")
(player_data <- subset(batting, Player %in% players))
## Player PriPos Hits AtBats PlayerNumber PriPosNumber playerID weight height bats birthYear debut finalGame scaled_height scaled_weight
## 592 Mark Buehrle Pitcher 3 67 102 1 buehrma01 240 74 L 1979 2000-07-16 2015-10-04 0.2568063 1.3951531
## 617 Matt Harvey Pitcher 6 18 369 1 harvema01 215 76 R 1989 2012-07-26 2016-07-04 1.1219410 0.1723515
## 651 Mike Leake Pitcher 18 61 494 1 leakemi01 170 70 R 1987 2010-04-11 2016-09-28 -1.4734630 -2.0286912
player_preds <- posterior_epred(logistic_fit, newdata=player_data)
colnames(player_preds) <- players
player_preds <- player_preds / player_data$AtBats[col(player_preds)]
head(player_preds)
## Mark Buehrle Matt Harvey Mike Leake
## [1,] 0.09942549 0.1245864 0.1421047
## [2,] 0.11107516 0.1217278 0.1673386
## [3,] 0.10992811 0.1252273 0.1388137
## [4,] 0.11775287 0.1740435 0.1717359
## [5,] 0.11167978 0.1089932 0.1302543
## [6,] 0.12496966 0.1673440 0.1560620
What are some questions would you like to answer using this model?