(Generalized) Linear models: robustness and crossvalidation

Peter Ralph

7 January 2018 – Advanced Biological Statistics

Reminders

Course schedule
Modeling
Bayesian statistics, Stan, and MCMC

Linear models

Reminder: standard linear regression gives a maximum likelihood estimate for \(\vec b\) (and \(\sigma\)) under the following model: \[\begin{aligned} Y_i &\sim \Normal(\mu_i, \sigma) \\ \mu_i &= b_0 + b_1 X_1 + \cdots + b_k X_k . \end{aligned}\]

The key model component here is the linear predictor \(\mu\).

Power and model fit

Comparison of means

If the predictor, \(X\), is discrete then we are doing a \(t\)-test, or ANOVA, or something.

Simulate data - difference in means of 3.0:

truth <- list(b=c(1.0, 3.0), sigma=0.5)
n <- 200
x <- sample(c(1, 2), size=n, replace=TRUE)
y <- truth$b[x] + rnorm(n, mean=0, sd=truth$sigma)

The \(t\)-test

system.time( tt <- t.test(y ~ x) )

##    user  system elapsed 
##   0.004   0.000   0.005

tt

## 
##  Welch Two Sample t-test
## 
## data:  y by x
## t = -29.587, df = 193.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.164550 -1.894004
## sample estimates:
## mean in group 1 mean in group 2 
##       0.9611205       2.9903971

with Stan

stt_block <- "
data { 
    int N;
    int x[N]; // group index: 1's and 2's
    vector[N] y;
}
parameters {
    vector[2] b;
    real<lower=0> sigma;
}
model {
    y ~ normal(b[x], sigma);
}"
stt_model <- stan_model(model_code=stt_block)
system.time( stantt <- sampling(stt_model,
                                data=list(N=length(x), x=x, y=y), iter=1e3) )

##    user  system elapsed 
##   0.861   0.211   1.615

print(stantt)

## Inference for Stan model: f448e40a81bc1be5fd2fa806fefc48eb.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## b[1]   0.96    0.00 0.05  0.86  0.93  0.96  0.99  1.06  1902    1
## b[2]   2.99    0.00 0.05  2.90  2.96  2.99  3.02  3.09  1885    1
## sigma  0.49    0.00 0.02  0.44  0.47  0.49  0.50  0.54  1940    1
## lp__  44.08    0.04 1.21 41.02 43.51 44.37 44.97 45.47  1128    1
## 
## Samples were drawn using NUTS(diag_e) at Sat Jan  5 12:48:47 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Power comparison

Let’s do both methods many times, and see how well they estimate the difference in means.

do_rep <- function (n, b, sigma) {
    x <- sample(c(1, 2), size=n, replace=TRUE)
    y <- b[x] + rnorm(n, mean=0, sd=sigma)
    tt <- t.test(y ~ x)
    stantt <- sampling(stt_model,
                   data=list(N=length(x), x=x, y=y), iter=1e3, refresh=0, open_progress=FALSE)
    bb <- extract(stantt)$b
    out <- c(tt$estimate, tt$conf.int,
             colMeans(bb),
             quantile(bb[,1] - bb[,2], prob=c(0.025, 0.975)))
    names(out) <- c('t_b1', 't_b2', 't_lower', 't_upper',
                    's_b1', 's_b2', 's_lower', 's_upper')
    return(out)
}
reps <- replicate(100, do_rep(n, truth$b, truth$sigma))

## Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: There were 1 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

Estimates of differences in means across 100 datasets:

plot of chunk plot_power

Robustification

What if the noise was not Normal?

Substituting rnorm -> rcauchy:

x <- sample(c(1, 2), size=n, replace=TRUE)
y <- truth$b[x] + rcauchy(n, location=0, scale=truth$sigma)

The \(t\)-test

system.time( tt <- t.test(y ~ x) )

##    user  system elapsed 
##   0.002   0.000   0.002

tt

## 
##  Welch Two Sample t-test
## 
## data:  y by x
## t = -5.5741, df = 186.48, p-value = 8.618e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.039750 -1.450554
## sample estimates:
## mean in group 1 mean in group 2 
##       0.8088953       3.0540473

Use the correct model, in Stan

rtt_block <- "
data { 
    int N;
    int x[N]; // group index: 1's and 2's
    vector[N] y;
}
parameters {
    vector[2] b;
    real<lower=0> sigma;
}
model {
    y ~ cauchy(b[x], sigma);
}"
rtt_model <- stan_model(model_code=rtt_block)

Power, again

Let’s do both methods many times, and see how well they estimate the difference in means.

do_rep_cauchy <- function (n, b, sigma) {
    x <- sample(c(1, 2), size=n, replace=TRUE)
    y <- b[x] + rcauchy(n, location=0, scale=sigma)
    tt <- t.test(y ~ x)
    stantt <- sampling(rtt_model,
                   data=list(N=length(x), x=x, y=y), iter=1e3, refresh=0, open_progress=FALSE)
    bb <- extract(stantt)$b
    out <- c(tt$estimate, tt$conf.int,
             colMeans(bb),
             quantile(bb[,1] - bb[,2], prob=c(0.025, 0.975)))
    names(out) <- c('t_b1', 't_b2', 't_lower', 't_upper',
                    's_b1', 's_b2', 's_lower', 's_upper')
    return(out)
}
cauchy_reps <- replicate(100, do_rep_cauchy(n, truth$b, truth$sigma))

Estimates of differences in means across 100 datasets:

plot of chunk plot_power2

Estimates of differences in means across 100 datasets (zoomed):

plot of chunk plot_power3

Model fit improves power.

Why?

Math minute: matrix multiplication

Either

To simulate from: \[\begin{aligned} \mu_i &= b_0 + b_1 X_{i1} + \cdots + b_k X_{i1} \\ Y_i &\sim \Normal(\mu_i, \sigma) . \end{aligned}\]

coefs <- list(b0=1.0, b=c(3.0, -1.0, 0.0, 0.0),
              sigma=0.5)
n <- 200
X <- matrix(rnorm(4*n, mean=0, sd=3), ncol=4)
Y <- coefs$b0 
for (k in 1:ncol(X)) {
    Y <- Y + coefs$b[k] * X[,k]
}
Y <- Y + rnorm(n, mean=0, sd=coefs$sigma)

Or

To simulate from: \[\begin{aligned} \mu_i &= b_0 + b_1 X_{i1} + \cdots + b_k X_{i1} \\ Y_i &\sim \Normal(\mu_i, \sigma) . \end{aligned}\]

coefs <- list(b0=1.0, b=c(3.0, -1.0, 0.0, 0.0),
              sigma=0.5)
n <- 200
X <- matrix(rnorm(4*n, mean=0, sd=3), ncol=4)
Y <- coefs$b0 + X %*% coefs$b + rnorm(n, mean=0, sd=coefs$sigma)

Because

In R, %*% is matrix multiplication: if

\(b\) is a \(k\)-vector
\(X\) is an \(n \times k\) matrix

then X %*% b (or, \(X_b\) in math notation) is shorthand for the \(n\)-vector \[ (Xb)_i = \sum_{j=1}^k X_{ij} b_j . \]

In Stan, matrix multiplication is *.

Multiple linear regression

Your turn

Implement standard multiple linear regression in Stan, and compare (roughly) to lm().

Try doing the same thing with a for loop and with *.

Do Cauchy regression.

solution:

With a for loop:

mlm_model_for <- stan_model( model_code="
data { 
    int N;
    matrix[N, 4] x;
    vector[N] y;
}
parameters {
    real b0;
    vector[4] b;
    real<lower=0> sigma;
}
model {
    vector[N] mu;
    mu = rep_vector(b0, N);
    for (k in 1:4) {
        mu = mu + x[,k] * b[k];
    }
    y ~ normal(mu, sigma);
}")

solution:

mlm_model <- stan_model( model_code="
data { 
    int N;
    matrix[N, 4] x;
    vector[N] y;
}
parameters {
    real b0;
    vector[4] b;
    real<lower=0> sigma;
}
model {
    y ~ normal(b0 + x * b, sigma);
}")

Crossvalidation

What if we don’t know the truth?

Using a model that doesn’t describe the data well often results in poor estimates.

But how do we know this is happening?

It depends, but here’s one common problem.

Overfitting

A \(t\)-test is misled by Cauchy noise because it tries too hard to fit the extreme outliers.

Taking the noise in the data too seriously is known as “overfitting” (or, “overgeneralizing”).

Happily, it’s easy to spot (if you have independent observations).

Crossvalidation

Maybe your model fits, but is it any good?

Divide your data randomly into 5 pieces.
Fit your model on 4/5ths, and see how well it predicts the remaining 1/5th.
Do this for each of the 5 pieces.

Then, compare the mean crossvalidation accuracy between methods.

Example: for the t-test.

1. Divide your data randomly into 5 pieces.

random_label <- sample(rep(1:5, each=n/5))
datasets <- lapply(1:5, function (k) {
            list(test = data.frame(x = x[random_label == k],
                                 y = y[random_label == k]),
                 train = data.frame(x = x[random_label != k],
                                 y = y[random_label != k]))
          } )

2. Fit your model on 4/5ths, and see how well it predicts the remaining 1/5th.

t_xvals <- sapply(datasets, function (data) {
                fit <- t.test(y ~ x, data=data$train)
                pred <- fit$estimate[data$test$x]
                return(sqrt(median( (data$test$y - pred)^2 )))
           } )

3. Do this for each of the 5 pieces.

t_xvals

## [1] 1.0096039 0.8462026 1.2437974 0.7625182 0.6861961

Same thing for Stan:

(stan_xvals <- sapply(datasets, function (data) {
                stantt <- sampling(rtt_model, # cauchy model
                                   data=list(N=nrow(data$train), 
                                             x=data$train$x,
                                             y=data$train$y),
                                   iter=1e3)
                post_samples <- extract(stantt)
                post_means <- colMeans(post_samples$b)
                pred <- post_means[data$test$x]
                return(sqrt(median( (data$test$y - pred)^2 )))
          } ))

## [1] 0.6229829 0.7243239 0.8316605 0.7068205 0.3082926

Stan has about 40% better crossvalidation accuracy:

##             mean     rep 1     rep 2     rep 3     rep 4     rep 5
## t test 0.9096637 1.0096039 0.8462026 1.2437974 0.7625182 0.6861961
## stan   0.6388161 0.6229829 0.7243239 0.8316605 0.7068205 0.3082926

Stochastic minute

The Cauchy as a scale mixture

It turns out that if

\[\begin{aligned} \beta &\sim \Normal(0, 1/\sqrt{\lambda}) \\ \lambda &\sim \Gam(1/2, 1/2) \end{aligned}\]

then

\[\begin{aligned} \beta &\sim \Cauchy(0, 1). \end{aligned}\]

What black magic is this??

It says so here.
If you like to do integrals, you can check mathematically.
Or, you can check with simulation.

More generally: Student’s \(t\) distribution

If \[\begin{aligned} \beta &\sim \Normal(0, 1/\sqrt{\lambda}) \\ \lambda &\sim \Gam(\nu/2, \nu/2) \end{aligned}\] then \(\beta \sim t(\text{df}=\nu)\), i.e., has “Student’s \(t\) distribution with \(\nu\) degrees of freedom”, which has density \[ \P\{ \beta = t \} \propto \left(1 + \frac{t^2}{\nu} \right)^{-\frac{\nu+1}{2}} .\]

Facts:

A Cauchy is \(t(\text{df}=1)\).
A Normal is \(t(\text{df}=\infty)\).
If \(X_1, \ldots, X_n\) are independent, mean-zero Normal, and \(\hat \mu\) and \(\hat \sigma\) are their empirical mean and SD, then \(\hat \mu / (\hat \sigma / \sqrt{n}) \sim t(\text{df} = \nu)\).

Crossvalidation for regression

A model

How does leaf anthocyanin concentration increase with time in the sun, in ten different inbred lines with different baseline concentrations?

antho <- read.table("leaf_anthocyanin.tsv", header=TRUE)
plot(y ~ x, col=geno, pch=20, data=antho)

plot of chunk sim_antho_data

leaf_model <- stan_model(model_code=
"
data {
    int N;
    vector[N] x; // time
    vector[N] y; // anthocyanin
    int geno[N]; // genotype index
    int ng; // number of genotypes
}
parameters {
    real b; // slope
    vector[ng] c; // genotype intercept
    real<lower=0> sigma; // scale
    real<lower=0> nu; // hyperprior SD
}
model {
    y ~ cauchy(b * x + c[geno], sigma);
    b ~ normal(0, nu);
    c ~ normal(0, nu);
    sigma ~ normal(0, 10);
    nu ~ gamma(0.5, 0.5);
}
")

leaf_fit <- sampling(leaf_model,
                     data=list(N=nrow(antho),
                               x=antho$x,
                               y=antho$y,
                               geno=antho$geno,
                               ng=length(unique(antho$geno))))

print(leaf_fit)

## Inference for Stan model: ee17257244291f76913790aca9fdaba7.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##          mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff
## b        1.47    0.00 0.04    1.38    1.44    1.47    1.50    1.56  1474
## c[1]     1.68    0.01 0.50    0.72    1.35    1.68    2.00    2.69  2577
## c[2]    16.20    0.01 0.50   15.21   15.86   16.20   16.54   17.18  3628
## c[3]     7.25    0.01 0.48    6.31    6.93    7.25    7.58    8.17  3182
## c[4]    13.65    0.01 0.56   12.50   13.30   13.65   14.01   14.76  3547
## c[5]     6.31    0.01 0.39    5.55    6.06    6.31    6.56    7.08  3662
## c[6]     2.28    0.01 0.42    1.45    2.01    2.29    2.56    3.10  2641
## c[7]     7.09    0.01 0.63    5.91    6.64    7.08    7.51    8.34  3111
## c[8]    16.00    0.01 0.40   15.21   15.73   15.98   16.26   16.81  2361
## c[9]     0.47    0.01 0.34   -0.20    0.24    0.46    0.70    1.13  2661
## c[10]    2.22    0.01 0.46    1.28    1.92    2.23    2.53    3.09  2439
## sigma    1.47    0.00 0.12    1.25    1.39    1.47    1.55    1.70  4905
## nu       8.07    0.02 1.44    5.84    7.01    7.90    8.91   11.47  4105
## lp__  -487.56    0.06 2.61 -493.56 -489.13 -487.22 -485.66 -483.45  1743
##       Rhat
## b        1
## c[1]     1
## c[2]     1
## c[3]     1
## c[4]     1
## c[5]     1
## c[6]     1
## c[7]     1
## c[8]     1
## c[9]     1
## c[10]    1
## sigma    1
## nu       1
## lp__     1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jan 10 11:24:49 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

stan_trace(leaf_fit)

## 'pars' not specified. Showing first 10 parameters by default.

plot of chunk show_leaf_fit2

plot(leaf_fit)

## 'pars' not specified. Showing first 10 parameters by default.

## ci_level: 0.8 (80% intervals)

## outer_level: 0.95 (95% intervals)

plot of chunk show_leaf_fit3

A single \(k\)-fold crossvalidation step should:

Randomly extract \(1/k\) of the data as “test”, and leave the rest as “training”.
Fit the model on the training data.
Use the result to predict the values in the test data.
Report a measure of prediction error.

###
# HELP ME WRITE THESE
###

ttest_crossvalidation <- function (k) {
}

stan_crossvalidation <- function (k) {
}

(my results:)

> summary(crossvals)
       lm            stan        stan_interactions    stan_df
 Min.   : 5.637   Min.   : 2.381   Min.   : 2.462   Min.   :2.497  
 1st Qu.: 6.109   1st Qu.: 2.771   1st Qu.: 3.030   1st Qu.:3.237  
 Median : 6.544   Median : 3.106   Median : 3.597   Median :3.591  
 Mean   : 7.979   Mean   : 4.028   Mean   : 4.190   Mean   :4.120  
 3rd Qu.: 6.837   3rd Qu.: 3.932   3rd Qu.: 4.217   3rd Qu.:3.836  
 Max.   :14.808   Max.   :14.441   Max.   :14.265   Max.   :9.173

Summary

\[ \text{data} \sim \text{fit} + \text{residual} \]

The model of noise determines how to weight deviations from the fit.
An appropriate model leads to more accurate predictions.
Overfitting is a common pitfall,
that crossvalidation can help avoid.

Generalized Linear Models

Ingredients of a GLM:

A probability distribution, \(Y\).

(the “family”; describes the output)

A linear predictor, \(X \beta\).

(connects input to output)

A link function, \(\E[Y] = h(X\beta)\).

(connects linear predictor to probability distribution)

Common choices:

Linear regression: \[ Y \sim \Normal(X \beta, \sigma) .\]
Logistic regression: \[ Y \sim \Binom(n, \logistic(X\beta)) .\]
Gamma regression: \[ Y \sim \Gam(\text{scale}=\exp(X\beta)/k, \text{shape}=k) .\]

Logistic regression, in R

glm(y ~ x, family=binomial)

family {stats}

Family objects provide a convenient way to specify the details of the models used by functions such as glm.

binomial(link = "logit")
gaussian(link = "identity")
Gamma(link = "inverse")
inverse.gaussian(link = "1/mu^2")
poisson(link = "log")

Arguments

link: a specification for the model link function. This can be a name/expression, a literal character string, a length-one character vector, or an object of class "link-glm" (such as generated by make.link) provided it is not specified via one of the standard names given next.

The gaussian family accepts the links (as names) identity, log and inverse; the binomial family the links logit, probit, cauchit, (corresponding to logistic, normal and Cauchy CDFs respectively) log and cloglog (complementary log-log); the Gamma family the links inverse, identity and log; the poisson family the links log, identity, and sqrt; and the inverse.gaussian family the links 1/mu^2, inverse, identity and log.

Unpacking logistic regression

Simulate data

100 trials, where probability of success depends on \(x\):

alpha <- -7; beta <- 1.2
x <- runif(100, 0, 10)
y <- alpha + beta * x
p <- 1 / (1 + exp(-y))
z <- rbinom(100, size=1, prob=p)
plot(z ~ x)
curve(1/(1+exp(-(-7 + 1.2 *x))), 0, 10, col='red', add=TRUE)

plot of chunk sim_logistic

`glm()`

zz <- cbind(z, 1-z)
summary(glm_fit <- glm(zz ~ x, family='binomial'))

## 
## Call:
## glm(formula = zz ~ x, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.50439  -0.25097  -0.05869   0.35005   2.35424  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.4293     1.6038  -4.632 3.62e-06 ***
## x             1.2879     0.2683   4.800 1.59e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 134.60  on 99  degrees of freedom
## Residual deviance:  52.53  on 98  degrees of freedom
## AIC: 56.53
## 
## Number of Fisher Scoring iterations: 6

Stan

\[\begin{aligned} Z &\sim \Binom(1, P) \\ P &= \logistic(\alpha + \beta X) \end{aligned}\]

logistic_block <- "
data {
    int N;
    vector[N] X;
    int<lower=0> Z[N];
}
parameters {
    real alpha;
    real beta;
}
model {
    vector[N] p;
    p = inv_logit(alpha + beta * X);
    Z ~ binomial(1, p);
}"

Stan: fit the model

fit <- stan(model_code=logistic_block,
            data=list(N=100, X=x, Z=z))
rstan::summary(fit)

## $summary
##             mean    se_mean        sd        2.5%        25%        50%
## alpha  -8.062960 0.07100136 1.7142965 -11.7011825  -9.184292  -7.931243
## beta    1.388078 0.01215117 0.2865985   0.8817608   1.179606   1.370420
## lp__  -27.578616 0.02894954 0.9692251 -30.1689775 -27.997974 -27.294339
##              75%      97.5%     n_eff     Rhat
## alpha  -6.835255  -5.095609  582.9597 1.006586
## beta    1.579790   2.005420  556.3031 1.007479
## lp__  -26.868064 -26.584681 1120.8976 1.006472
## 
## $c_summary
## , , chains = chain:1
## 
##          stats
## parameter       mean        sd        2.5%        25%        50%
##     alpha  -8.020095 1.7351854 -11.9725743  -9.183053  -7.915507
##     beta    1.380041 0.2890422   0.8603793   1.169197   1.362963
##     lp__  -27.562393 0.9639240 -30.1381199 -27.951706 -27.270999
##          stats
## parameter        75%      97.5%
##     alpha  -6.804995  -4.904487
##     beta    1.572540   2.010255
##     lp__  -26.882973 -26.593221
## 
## , , chains = chain:2
## 
##          stats
## parameter       mean        sd        2.5%        25%        50%
##     alpha  -7.976819 1.6188518 -11.4078137  -9.038962  -7.899671
##     beta    1.373324 0.2700576   0.9073788   1.169358   1.362537
##     lp__  -27.494653 0.8989894 -30.0493349 -27.848116 -27.238508
##          stats
## parameter        75%      97.5%
##     alpha  -6.768938  -5.297587
##     beta    1.550171   1.948912
##     lp__  -26.843572 -26.577144
## 
## , , chains = chain:3
## 
##          stats
## parameter       mean        sd        2.5%        25%        50%
##     alpha  -8.250431 1.7856021 -12.0756699  -9.367484  -8.050983
##     beta    1.420788 0.2997694   0.9192965   1.210781   1.391901
##     lp__  -27.693730 1.0237408 -30.2012787 -28.219400 -27.408870
##          stats
## parameter        75%      97.5%
##     alpha  -6.995168  -5.275996
##     beta    1.617131   2.063693
##     lp__  -26.927294 -26.592515
## 
## , , chains = chain:4
## 
##          stats
## parameter       mean        sd        2.5%        25%        50%
##     alpha  -8.004497 1.7018467 -11.4531795  -9.167022  -7.899202
##     beta    1.378157 0.2846246   0.8413456   1.178552   1.360691
##     lp__  -27.563688 0.9768450 -30.1898946 -27.991265 -27.275294
##          stats
## parameter        75%      97.5%
##     alpha  -6.823629  -4.848098
##     beta    1.580438   1.940882
##     lp__  -26.832823 -26.581075

Stan: posterior distributions

samples <- extract(fit)
layout(t(1:2))
hist(samples$alpha, main=expression(alpha))
abline(v=alpha, col='red'); abline(v=coef(glm_fit)[1], col='green')
hist(samples$beta, main=expression(beta))
abline(v=beta, col='red'); abline(v=coef(glm_fit)[2], col='green')
legend("topright", lty=1, col=c('red', 'green'), legend=c('truth', 'glm'))

plot of chunk plot_stan_logistic