Peter Ralph
Advanced Biological Statistics
Statistical robustness:
Huber, Robust Statistics.
Methods for dealing with “outliers”:
(Look at them carefully and) remove them (if appropriate).
Use methods that are robust to the presence of outliers.
Least-squares fit:
##    user  system elapsed 
##   0.005   0.000   0.005
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43120 -0.29747 -0.02625  0.31638  1.32033 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.98628    0.03459   28.51   <2e-16 ***
## x            2.03262    0.03719   54.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4892 on 198 degrees of freedom
## Multiple R-squared:  0.9378, Adjusted R-squared:  0.9375 
## F-statistic:  2987 on 1 and 198 DF,  p-value: < 2.2e-16
with brms (recall Cauchy is Student’s T with df=1)
## Compiling Stan program...
## Start sampling
##    user  system elapsed 
##  41.596   3.932  49.135
##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: y ~ x 
##    Data: data.frame(x = x, y = y) (Number of observations: 200) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.99      0.03     0.92     1.06 1.00     4178     2650
## x             2.03      0.04     1.96     2.11 1.00     3816     2728
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.49      0.02     0.45     0.55 1.00     3224     2428
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Least-squares fit:
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.973  -0.877   0.120   1.312  66.074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.8918     0.5740   1.554    0.122
## x             0.7230     0.6172   1.171    0.243
## 
## Residual standard error: 8.117 on 198 degrees of freedom
## Multiple R-squared:  0.006883,   Adjusted R-squared:  0.001868 
## F-statistic: 1.372 on 1 and 198 DF,  p-value: 0.2428
with brms
## Compiling Stan program...
## Start sampling
##    user  system elapsed 
##  42.218   2.756  47.414
##  Family: student 
##   Links: mu = identity; sigma = identity; nu = identity 
## Formula: y ~ x 
##    Data: data.frame(x = x, y = y) (Number of observations: 200) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     1.03      0.06     0.91     1.14 1.00     3847     2377
## x             2.03      0.07     1.89     2.15 1.00     3910     2827
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.60      0.06     0.49     0.74 1.00     3530     2682
## nu        1.09      0.08     1.00     1.29 1.00     2246     1586
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The Cauchy distribution has “heavy tails”: extreme values are much more likely.
So, a point that’s far from the line is less surprising,
and the line with the highest likelihood isn’t as strongly affected by outliers.
pp_checkHow would we have known that outliers were causing a problem?
## Using 10 posterior draws for ppc type 'hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, after switching to the robust fit:
## Using 10 posterior draws for ppc type 'hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.