Peter Ralph
Advanced Biological Statistics
Statistical robustness:
Methods for dealing with “outliers”:
(Look at them carefully and) remove them (if appropriate).
Use methods that are robust to the presence of outliers.
Least-squares fit:
## user system elapsed
## 0.005 0.000 0.005
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43120 -0.29747 -0.02625 0.31638 1.32033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.98628 0.03459 28.51 <2e-16 ***
## x 2.03262 0.03719 54.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4892 on 198 degrees of freedom
## Multiple R-squared: 0.9378, Adjusted R-squared: 0.9375
## F-statistic: 2987 on 1 and 198 DF, p-value: < 2.2e-16
with brms (recall Cauchy is Student’s T with df=1)
## Compiling Stan program...
## Start sampling
## user system elapsed
## 41.596 3.932 49.135
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: y ~ x
## Data: data.frame(x = x, y = y) (Number of observations: 200)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.99 0.03 0.92 1.06 1.00 4178 2650
## x 2.03 0.04 1.96 2.11 1.00 3816 2728
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.49 0.02 0.45 0.55 1.00 3224 2428
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Least-squares fit:
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.973 -0.877 0.120 1.312 66.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8918 0.5740 1.554 0.122
## x 0.7230 0.6172 1.171 0.243
##
## Residual standard error: 8.117 on 198 degrees of freedom
## Multiple R-squared: 0.006883, Adjusted R-squared: 0.001868
## F-statistic: 1.372 on 1 and 198 DF, p-value: 0.2428
with brms
## Compiling Stan program...
## Start sampling
## user system elapsed
## 42.218 2.756 47.414
## Family: student
## Links: mu = identity; sigma = identity; nu = identity
## Formula: y ~ x
## Data: data.frame(x = x, y = y) (Number of observations: 200)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 1.03 0.06 0.91 1.14 1.00 3847 2377
## x 2.03 0.07 1.89 2.15 1.00 3910 2827
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.60 0.06 0.49 0.74 1.00 3530 2682
## nu 1.09 0.08 1.00 1.29 1.00 2246 1586
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The Cauchy distribution has “heavy tails”: extreme values are much more likely.
So, a point that’s far from the line is less surprising,
and the line with the highest likelihood isn’t as strongly affected by outliers.
pp_check
How would we have known that outliers were causing a problem?
## Using 10 posterior draws for ppc type 'hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, after switching to the robust fit:
## Using 10 posterior draws for ppc type 'hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.