Peter Ralph
1 December 2020 – Advanced Biological Statistics
Statistical robustness:
Methods for dealing with “outliers”:
(Look at them carefully and) remove them (if appropriate).
Use methods that are robust to the presence of outliers.
Least-squares fit:
## user system elapsed
## 0.005 0.000 0.005
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43120 -0.29747 -0.02625 0.31638 1.32033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.98628 0.03459 28.51 <2e-16 ***
## x 2.03262 0.03719 54.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4892 on 198 degrees of freedom
## Multiple R-squared: 0.9378, Adjusted R-squared: 0.9375
## F-statistic: 2987 on 1 and 198 DF, p-value: < 2.2e-16
with Stan
slr_block <- "
data {
int N;
vector[N] x;
vector[N] y;
}
parameters {
real b0;
real b1;
real<lower=0> sigma;
}
model {
y ~ normal(b0 + b1*x, sigma);
}"
system.time(
stanlr <- stan(model_code=slr_block,
data=list(N=length(x), x=x, y=y), iter=1e3))
## user system elapsed
## 63.874 2.942 69.498
## Inference for Stan model: 3b05fdd2691a18718e9286fd7f1e1ac1.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## b0 1.01 0.00 0.04 0.94 0.99 1.01 1.04 1.09 2424 1
## b1 1.99 0.00 0.04 1.91 1.96 1.99 2.01 2.06 2119 1
## sigma 0.52 0.00 0.03 0.47 0.50 0.52 0.54 0.57 1552 1
## lp__ 31.77 0.04 1.20 28.71 31.23 32.04 32.65 33.21 989 1
##
## Samples were drawn using NUTS(diag_e) at Mon Nov 23 17:19:09 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Least-squares fit:
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.973 -0.877 0.120 1.312 66.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8918 0.5740 1.554 0.122
## x 0.7230 0.6172 1.171 0.243
##
## Residual standard error: 8.117 on 198 degrees of freedom
## Multiple R-squared: 0.006883, Adjusted R-squared: 0.001868
## F-statistic: 1.372 on 1 and 198 DF, p-value: 0.2428
with Stan
srr_block <- "
data {
int N;
vector[N] x;
vector[N] y;
}
parameters {
real b0;
real b1;
real<lower=0> sigma;
}
model {
y ~ cauchy(b0 + b1*x, sigma);
}"
system.time(
stanrr <- stan(model_code=srr_block,
data=list(N=length(x), x=x, y=y), iter=1e3))
## user system elapsed
## 57.301 2.464 62.265
## Inference for Stan model: cd6e07d98bffcf77ad2b924fcd0f1723.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## b0 1.03 0.00 0.06 0.91 0.99 1.03 1.06 1.14 1950 1
## b1 2.03 0.00 0.07 1.89 1.98 2.03 2.08 2.16 1933 1
## sigma 0.58 0.00 0.06 0.47 0.54 0.57 0.61 0.70 2253 1
## lp__ -173.78 0.04 1.21 -176.86 -174.32 -173.48 -172.90 -172.38 1004 1
##
## Samples were drawn using NUTS(diag_e) at Mon Nov 23 17:20:15 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The Cauchy distribution has “heavy tails”: extreme values are much more likely.
So, a point that’s far from the line is less surprising,
and the line with the highest likelihood isn’t as strongly affected by outliers.