\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Robust linear models

Peter Ralph

1 December 2020 – Advanced Biological Statistics

Robustness

Statistical robustness:

Huber, Robust statistics, p.1

Huber, Robust Statistics.

Methods for dealing with “outliers”:

  1. (Look at them carefully and) remove them (if appropriate).

  2. Use methods that are robust to the presence of outliers.

Fitting linear models, robustly

Standard “least-squares” fits

\[\begin{aligned} \hat y_i &= b_0 + b_1 x_i \\ y_i &\sim \Normal(\hat y_i, \sigma^2) . \end{aligned}\]

Simulate data:

truth <- list(b0=1.0, b1=2.0, sigma=0.5)
n <- 200
x <- rnorm(n, mean=0, sd=1)
y <- ( truth$b0 + truth$b1 * x 
        + rnorm(n, mean=0, sd=truth$sigma) )

plot of chunk r simdata

Least-squares fit:

system.time( slr <- lm(y ~ x) )
##    user  system elapsed 
##   0.005   0.000   0.005
summary(slr)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43120 -0.29747 -0.02625  0.31638  1.32033 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.98628    0.03459   28.51   <2e-16 ***
## x            2.03262    0.03719   54.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4892 on 198 degrees of freedom
## Multiple R-squared:  0.9378, Adjusted R-squared:  0.9375 
## F-statistic:  2987 on 1 and 198 DF,  p-value: < 2.2e-16

with Stan

slr_block <- "
data {
    int N;
    vector[N] x;
    vector[N] y;
}
parameters {
    real b0;
    real b1;
    real<lower=0> sigma;
}
model {
    y ~ normal(b0 + b1*x, sigma);
}"
system.time(
    stanlr <- stan(model_code=slr_block,
                   data=list(N=length(x), x=x, y=y), iter=1e3))
##    user  system elapsed 
##  63.874   2.942  69.498
print(stanlr)
## Inference for Stan model: 3b05fdd2691a18718e9286fd7f1e1ac1.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## b0     1.01    0.00 0.04  0.94  0.99  1.01  1.04  1.09  2424    1
## b1     1.99    0.00 0.04  1.91  1.96  1.99  2.01  2.06  2119    1
## sigma  0.52    0.00 0.03  0.47  0.50  0.52  0.54  0.57  1552    1
## lp__  31.77    0.04 1.20 28.71 31.23 32.04 32.65 33.21   989    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Nov 23 17:19:09 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Cauchy noise?

Relative axon growth for neurons after \(x\) hours:

truth <- list(b0=1.0, b1=2.0, sigma=0.5)
n <- 200
x <- rnorm(n, mean=0, sd=1)
y <- ( truth$b0 + truth$b1 * x 
        + rcauchy(n, location=0, 
                  scale=truth$sigma) )

plot of chunk r simdata_rr

Compare:

plot of chunk r compare

Least-squares fit:

summary(slr2 <- lm(y ~ x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.973  -0.877   0.120   1.312  66.074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.8918     0.5740   1.554    0.122
## x             0.7230     0.6172   1.171    0.243
## 
## Residual standard error: 8.117 on 198 degrees of freedom
## Multiple R-squared:  0.006883,   Adjusted R-squared:  0.001868 
## F-statistic: 1.372 on 1 and 198 DF,  p-value: 0.2428

with Stan

srr_block <- "
data { 
    int N;
    vector[N] x;
    vector[N] y;
}
parameters {
    real b0;
    real b1;
    real<lower=0> sigma;
}
model {
    y ~ cauchy(b0 + b1*x, sigma);
}"
system.time(
    stanrr <- stan(model_code=srr_block,
                   data=list(N=length(x), x=x, y=y), iter=1e3))
##    user  system elapsed 
##  57.301   2.464  62.265
print(stanrr)
## Inference for Stan model: cd6e07d98bffcf77ad2b924fcd0f1723.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##          mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## b0       1.03    0.00 0.06    0.91    0.99    1.03    1.06    1.14  1950    1
## b1       2.03    0.00 0.07    1.89    1.98    2.03    2.08    2.16  1933    1
## sigma    0.58    0.00 0.06    0.47    0.54    0.57    0.61    0.70  2253    1
## lp__  -173.78    0.04 1.21 -176.86 -174.32 -173.48 -172.90 -172.38  1004    1
## 
## Samples were drawn using NUTS(diag_e) at Mon Nov 23 17:20:15 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Inferred lines, with uncertainty

plot of chunk r plot_simdata_rr

Same, zoomed in

plot of chunk r plot_simdata_rr_again

Why does it work?

The Cauchy distribution has “heavy tails”: extreme values are much more likely.

So, a point that’s far from the line is less surprising,

and the line with the highest likelihood isn’t as strongly affected by outliers.

plot of chunk r dx

// reveal.js plugins