Baseball data

Peter Ralph

Advanced Biological Statistics

Baseball

We have a dataset of batting averages of baseball players, having

name
position
number of at bats
number of hits
height
weight
handedness

batting <- read.csv("data/BattingAveragePlus.csv", header=TRUE, stringsAsFactors=TRUE)
head(batting)

##             Player     PriPos Hits AtBats PlayerNumber PriPosNumber  playerID weight height bats birthYear      debut  finalGame
## 1 Aaron Cunningham Left Field   17     97          198            7 cunniaa01    215     71    R      1986 2008-08-31 2012-07-24
## 2     Aaron Harang    Pitcher    4     56          360            1 haranaa01    260     79    R      1978 2002-05-25 2015-10-03
## 3       Aaron Hill   2nd Base  184    609          394            4  hillaa01    200     71    R      1982 2005-05-20 2016-10-01
## 4     Aaron Laffey    Pitcher    0      1          482            1 laffeaa01    190     71    L      1985 2007-08-04 2015-07-31
## 5       Aaron Loup    Pitcher    0      1          522            1  loupaa01    210     71    L      1987 2012-07-14 2016-09-30
## 6        Adam Dunn   1st Base  110    539          242            3  dunnad01    285     78    L      1979 2001-07-20 2014-09-28

The overall batting average of the 886 players is 0.254768.

Here is the average by position.

batting %>% group_by(PriPos) %>% 
    summarise(num=n(), BatAvg=sum(Hits)/sum(AtBats)) %>% 
    select(PriPos, num, BatAvg)

## # A tibble: 9 × 3
##   PriPos         num BatAvg
##   <fct>        <int>  <dbl>
## 1 1st Base        77  0.260
## 2 2nd Base        69  0.256
## 3 3rd Base        73  0.265
## 4 Catcher         95  0.247
## 5 Center Field    61  0.265
## 6 Left Field      96  0.258
## 7 Pitcher        305  0.130
## 8 Right Field     53  0.263
## 9 Shortstop       57  0.255

Questions?

What’s the overall batting average?
Do some positions tend to be better batters?
How much variation in batting average is there between players of the same position?
Do height or weight predict batting average?

Exercise:

Come up with some quick-and-dirty statistics to answer these questions.

Predicting probabilities

Motivation:

lm( Hits/AtBats ~ PriPos + weight + height, data=batting )

which is \[\begin{aligned} \frac{\text{Hits}_i}{\text{AtBats}_i} &\sim \Normal(\mu_i, \sigma) \\ \mu_i &= \beta_0 + \beta_{\text{PriPos}_i} + \beta_w w_i + \beta_h h_i \end{aligned}\]

But: that’s not right - Hits/AtBats is between 0 and 1!

And, 2/5 is very different than 2000/5000!

A better Binomial model

\[\begin{aligned} \text{Hits}_i &\sim \Binom(\text{AtBats}_i, \mu_i) \\ \mu_i &= \beta_0 + \beta_{\text{PriPos}_i} + \beta_w w_i + \beta_h h_i \end{aligned}\]

But: what’s to keep \(\mu_i\) between 0 and 1?

The logistic function

\[\begin{aligned} \logistic(x) = \frac{1}{1 + e^{-x}} . \end{aligned}\]

plot of chunk logistic_intro

… is the inverse of the logit function

\[\begin{aligned} \logit(x) = \log\left(\frac{x}{1-x}\right) . \end{aligned}\]

plot of chunk logit

Let’s build a model

Binomial response with logistic predictor

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \end{aligned}\]

For the \(i\)th player:

\(N_i\): number of at-bats
\(Z_i\): number of hits
\(\theta_i\): “true” batting average

Everyone is the same

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_0 \end{aligned}\]

For the \(i\)th player:

\(N_i\): number of at-bats
\(Z_i\): number of hits
\(\theta_i\): “true” batting average

and

\(\beta_0\): overall mean logit batting avg

All pitchers are the same

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} \end{aligned}\]

For the \(i\)th player:

\(N_i\): number of at-bats
\(Z_i\): number of hits
\(p_i\): position

and

\(\beta_p\): mean logit batting avg of position \(p\)

First model, with brms

First, we need priors!

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} \end{aligned}\] and \[\begin{aligned} \beta_p &\sim \Normal(0, 5) \end{aligned}\]

For the \(i\)th player:

\(N_i\): number of at-bats
\(Z_i\): number of hits
\(p_i\): position

and

\(\beta_p\): mean logit batting avg of position \(p\)

Fit the model

We’ll see more about brms soon!

fm <- brm(
      Hits  | trials(AtBats) ~ 0 + PriPos,
      data=batting,
      family='binomial',
      prior = c(prior(normal(0, 5), class = b)),
      iter = 2000, chains = 3
)

## Compiling Stan program...

## Start sampling

Check convergence

print(fm)

##  Family: binomial 
##   Links: mu = logit 
## Formula: Hits | trials(AtBats) ~ 0 + PriPos 
##    Data: batting (Number of observations: 886) 
##   Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 3000
## 
## Population-Level Effects: 
##                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## PriPos1stBase        -1.05      0.02    -1.08    -1.02 1.00     6014     2477
## PriPos2ndBase        -1.07      0.02    -1.10    -1.04 1.00     5628     2444
## PriPos3rdBase        -1.02      0.02    -1.05    -0.99 1.00     6218     2219
## PriPosCatcher        -1.11      0.02    -1.15    -1.08 1.00     5343     2524
## PriPosCenterField    -1.02      0.02    -1.05    -0.99 1.00     6430     2572
## PriPosLeftField      -1.06      0.02    -1.09    -1.02 1.00     6452     2285
## PriPosPitcher        -1.90      0.04    -1.99    -1.82 1.00     5152     2275
## PriPosRightField     -1.03      0.02    -1.07    -1.00 1.00     7158     2297
## PriPosShortstop      -1.07      0.02    -1.11    -1.04 1.00     5510     2356
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

mcmc_trace(fm)

plot of chunk first_plot

Results: on a logit scale

mcmc_intervals(fm, regex_pars="b_PriPos.*") +
    scale_y_discrete(labels=levels(batting$PriPos)) +
    xlab("coefficient (logit scale)")

plot of chunk first_results

Results: with `predict( )`

pos_df <- data.frame(
            PriPos=levels(batting$PriPos),
            AtBats=10000)
fm_post <- predict(fm, newdata=pos_df)
cbind(pos_df, fm_post)

##         PriPos AtBats Estimate Est.Error     Q2.5    Q97.5
## 1     1st Base  10000 2598.963  53.28684 2498.000 2703.000
## 2     2nd Base  10000 2555.330  53.98679 2449.000 2659.000
## 3     3rd Base  10000 2652.037  53.13323 2547.000 2756.000
## 4      Catcher  10000 2469.425  52.68526 2367.000 2574.025
## 5 Center Field  10000 2648.111  54.53149 2539.975 2755.000
## 6   Left Field  10000 2581.515  54.56530 2472.000 2686.025
## 7      Pitcher  10000 1298.653  59.40278 1180.000 1416.025
## 8  Right Field  10000 2631.101  56.34839 2520.000 2740.025
## 9    Shortstop  10000 2549.997  54.76665 2443.000 2656.025

par(mar=c(5,7,1,1)+.1)
plot(fm_post[,"Estimate"] / pos_df$AtBats, 1:nrow(pos_df), xlim=c(0.1, 0.3), xlab='batting avg', ylab='', yaxt='n', type='n')
segments(x0=fm_post[,"Q2.5"]/pos_df$AtBats, x1=fm_post[,"Q97.5"]/pos_df$AtBats, y0=1:nrow(pos_df), col='red')
points(fm_post[,"Estimate"] / pos_df$AtBats, 1:nrow(pos_df), pch=20)
axis(2, at=1:nrow(pos_df), labels=pos_df$PriPos, las=2)

plot of chunk plot_first_pred

Are right fielders better than catchers?

Let’s get the posterior distribution of the difference between the two batting averages.

fm_samps <- posterior_predict(fm, newdata=pos_df)
right_fielders <- fm_samps[,match("Right Field", pos_df$PriPos)]
catchers <- fm_samps[,match("Catcher", pos_df$PriPos)]
post_diff <- right_fielders - catchers

plot of chunk first_diff

Goodness-of-fit

Is this a good model for the data?

fm_pp <- predict(fm)
head(fm_pp)

##         Estimate  Est.Error    Q2.5 Q97.5
## [1,]  25.0263333  4.2955547  17.000    34
## [2,]   7.3333333  2.5731490   3.000    13
## [3,] 155.5816667 10.8231716 135.000   177
## [4,]   0.1296667  0.3359923   0.000     1
## [5,]   0.1360000  0.3428457   0.000     1
## [6,] 139.6756667 10.1927640 120.975   160

plot of chunk plot_first_predall

Exercise

In groups

Simulate from the model:

Make up 20 players’ names, positions, and at-bats,
simulate their true batting averages as on the right,
and draw their Hits using rbinom( ).

Then, fit the model! (time remaining)

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \frac{1}{1 + \exp(-\mu_i)} \\ \mu_i &= \beta_{p_i} + \epsilon_i \\ \epsilon_i &\sim \Normal(0, 0.05) . \end{aligned}\]

with:

\(\beta_\text{pitcher} = 0.15\)
and all other \(\beta_p = 0.24\)

Back to the model

Everyone is different, but pitchers are similar

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} + \epsilon_i \\ \epsilon_i &\sim \Normal(0, \sigma) \end{aligned}\]

For the \(i\)th player:

\(N_i\): number of at-bats
\(Z_i\): number of hits
\(p_i\): position
\(\epsilon_i\): difference to typical for position

and

\(\beta_p\): batting avg for position \(p\)
\(\sigma\): size of differences between players

Add in height and weight

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} + \beta_h h_i + \beta_w w_i + \epsilon_i \\ \epsilon_i &\sim \Normal(0, \sigma) \end{aligned}\]

For the \(i\)th player:

\(N_i\): number of at-bats
\(Z_i\): number of hits
\(p_i\): position
\(\epsilon_i\): difference to typical for position
\(h_i\), \(w_i\): height and weight

and

\(\beta_p\): batting avg for position \(p\)
\(\sigma\): size of differences between players
\(\beta_h\), \(\beta_w\): coefficients on height and weight

Fit the model

Scaling variables can improve convergence:

library(brms)
batting$scaled_height <- (batting$height - mean(batting$height))/sd(batting$height)
batting$scaled_weight <- (batting$weight - mean(batting$weight))/sd(batting$weight)
logistic_fit <- brm(
      Hits  | trials(AtBats) ~ 0 + scaled_weight + scaled_height + PriPos + (1 | PriPos:Player),
      data = batting,
      family = "binomial",
      prior = c(prior(normal(0, 5), class = b),
                prior(normal(0, 5), class = sd)),
      iter = 2000, chains = 3
)

## Compiling Stan program...

## Start sampling

Check convergence

print(logistic_fit)

##  Family: binomial 
##   Links: mu = logit 
## Formula: Hits | trials(AtBats) ~ 0 + scaled_weight + scaled_height + PriPos + (1 | PriPos:Player) 
##    Data: batting (Number of observations: 886) 
##   Draws: 3 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 3000
## 
## Group-Level Effects: 
## ~PriPos:Player (Number of levels: 886) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.14      0.01     0.12     0.16 1.00     1119     1824
## 
## Population-Level Effects: 
##                   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## scaled_weight         0.02      0.01    -0.01     0.04 1.00     1052     1611
## scaled_height        -0.01      0.01    -0.03     0.02 1.00     1318     2052
## PriPos1stBase        -1.09      0.02    -1.13    -1.04 1.00     1248     1624
## PriPos2ndBase        -1.09      0.03    -1.15    -1.04 1.00     1061     1475
## PriPos3rdBase        -1.06      0.02    -1.10    -1.01 1.00     1335     1680
## PriPosCatcher        -1.16      0.03    -1.21    -1.11 1.00     1403     1945
## PriPosCenterField    -1.05      0.03    -1.10    -0.99 1.00     1092     1499
## PriPosLeftField      -1.10      0.02    -1.14    -1.05 1.00     1449     2084
## PriPosPitcher        -1.91      0.05    -2.00    -1.83 1.00     2872     2286
## PriPosRightField     -1.05      0.03    -1.11    -1.00 1.00     1154     1888
## PriPosShortstop      -1.10      0.03    -1.16    -1.04 1.00     1194     1744
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

mcmc_trace(logistic_fit, regex_pars="b_PriPos.*")

plot of chunk secondtrace

Look at results

samps <- as_draws_array(logistic_fit)
mcmc_intervals(samps, regex_pars=c("b_PriPos.*")) +
    scale_y_discrete(labels=levels(batting$PriPos)) +
    xlab("coefficient (logit scale)")

plot of chunk viz

Let’s look at some players?

posterior_epred() gives posterior samples of expected responses:

players <- c("Mark Buehrle", "Matt Harvey", "Mike Leake")
player_data <- subset(batting, Player %in% players)
player_preds <- posterior_epred(logistic_fit, newdata=player_data)
player_preds <- player_preds / player_data$AtBats[col(player_preds)]
head(player_preds)

##            [,1]      [,2]      [,3]
## [1,] 0.12867787 0.1294308 0.1846368
## [2,] 0.11611176 0.1344739 0.1437486
## [3,] 0.11453307 0.1547196 0.1460944
## [4,] 0.10216918 0.1308674 0.1539287
## [5,] 0.09822368 0.1409306 0.1597499
## [6,] 0.12774940 0.1335366 0.1364718

layout(seq_along(players))
for (j in seq_along(players)) {
    hist(player_preds[,j], xlim=range(player_preds), main=players[j], xlab='batting average')
}

plot of chunk plotviz2

Exercise:

What are some questions would you like to answer using this model?

With Stan instead

First model

\[\begin{aligned} Z_i &\sim \Binom(N_i, \theta_i) \\ \theta_i &= \logistic(\mu_i) \\ \mu_i &= \beta_{p_i} \end{aligned}\] and \[\begin{aligned} \beta_p &\sim \Normal(0, 5) \end{aligned}\]

first_code <- "
data {
    int N;    // number of players
    int hits[N];
    int at_bats[N];
    int npos; // number of positions
    int position[N];
}
parameters {
    vector[npos] beta;
    real alpha_w;
    real alpha_h;
    vector[N] epsilon;
    real<lower=0> sigma;
}
model {
    hits ~ binomial(
            at_bats,
            inv_logit(beta[position]));
    beta ~ normal(0, 5);
} "
first_model <- stan_model(model_code=first_code)

Second model

logistic_code <- "
data {
    int N;    // number of players
    int hits[N];
    int at_bats[N];
    int npos; // number of positions
    int position[N];
    vector[N] height;
    vector[N] weight;
}
parameters {
    vector[npos] gamma;
    real alpha_w;
    real alpha_h;
    vector[N] epsilon;
    vector<lower=0> sigma;
}
transformed parameters {
    vector<lower=0, upper=1>[N] theta;
    theta = inv_logit(
        gamma[position] + alpha_h * height + alpha_w * weight + epsilon
    );
}
model {
    hits ~ binomial(at_bats, theta);
    epsilon ~ normal(0, sigma);
    gamma ~ normal(0, 5);
    alpha_h ~ normal(0, 5);
    alpha_w ~ normal(0, 5);
    sigma ~ normal(0, 5);
} "

Next steps:

logistic_model <- stan_model(model_code=logistic_code)
scaled_height <- (batting$height - mean(batting$height))/sd(batting$height)
scaled_weight <- (batting$weight - mean(batting$weight))/sd(batting$weight)
logistic_fit <- sampling(logistic_model, chains=3, iter=1000,
                     data=list(N=nrow(batting),
                               hits=batting$Hits,
                               at_bats=batting$AtBats,
                               npos=nlevels(batting$PriPos),
                               position=as.numeric(batting$PriPos),
                               height=scaled_height,
                               weight=scaled_weight) )

Baseball data

Baseball

Baseball

Questions?

Predicting probabilities

A better Binomial model

The logistic function

The logistic function

… is the inverse of the logit function

Let’s build a model

Binomial response with logistic predictor

Everyone is the same

All pitchers are the same

First model, with brms

First, we need priors!

Fit the model

Check convergence

Results: on a logit scale

Results: with predict( )

Are right fielders better than catchers?

Goodness-of-fit

Exercise

In groups

Back to the model

Everyone is different, but pitchers are similar

Add in height and weight

Fit the model

Check convergence

Look at results

Let’s look at some players?

Exercise:

With Stan instead

First model

Second model

Results: with `predict( )`