Peter Ralph
Advanced Biological Statistics
Statistical robustness:
Methods for dealing with “outliers”:
(Look at them carefully and) remove them (if appropriate).
Use methods that are robust to the presence of outliers.
Least-squares fit:
## user system elapsed
## 0.004 0.000 0.005
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43120 -0.29747 -0.02625 0.31638 1.32033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.98628 0.03459 28.51 <2e-16 ***
## x 2.03262 0.03719 54.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4892 on 198 degrees of freedom
## Multiple R-squared: 0.9378, Adjusted R-squared: 0.9375
## F-statistic: 2987 on 1 and 198 DF, p-value: < 2.2e-16
with brms (recall Cauchy is Student’s T with df=1)
## Compiling Stan program...
## Start sampling
## user system elapsed
## 37.839 3.551 43.768
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: y ~ x
## Data: data.frame(x = x, y = y) (Number of observations: 200)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.99 0.04 0.92 1.05 1.00 3987 2893
## x 2.03 0.04 1.96 2.10 1.00 4533 3401
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.49 0.03 0.45 0.54 1.00 3787 2765
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Least-squares fit:
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.973 -0.877 0.120 1.312 66.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8918 0.5740 1.554 0.122
## x 0.7230 0.6172 1.171 0.243
##
## Residual standard error: 8.117 on 198 degrees of freedom
## Multiple R-squared: 0.006883, Adjusted R-squared: 0.001868
## F-statistic: 1.372 on 1 and 198 DF, p-value: 0.2428
with brms
## Compiling Stan program...
## Start sampling
## user system elapsed
## 38.229 2.421 45.150
## Family: student
## Links: mu = identity; sigma = identity; nu = identity
## Formula: y ~ x
## Data: data.frame(x = x, y = y) (Number of observations: 200)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 1.03 0.06 0.90 1.14 1.00 3363 2513
## x 2.02 0.07 1.88 2.16 1.00 3465 1956
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.60 0.06 0.49 0.74 1.00 2991 2240
## nu 1.09 0.08 1.00 1.30 1.00 1673 1556
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The Cauchy distribution has “heavy tails”: extreme values are much more likely.
So, a point that’s far from the line is less surprising,
and the line with the highest likelihood isn’t as strongly affected by outliers.
pp_check
How would we have known that outliers were causing a problem?
## Using 10 posterior draws for ppc type 'hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, after switching to the robust fit:
## Using 10 posterior draws for ppc type 'hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
with Stan