Peter Ralph
4 February 2020 – Advanced Biological Statistics
diabetes package:lars R Documentation
Blood and other measurements in diabetics
Description:
The ‘diabetes’ data frame has 442 rows and 3 columns. These are
the data used in the Efron et al "Least Angle Regression" paper.
Format:
This data frame contains the following columns:
x a matrix with 10 columns
y a numeric vector
x2 a matrix with 64 columns
The dataset has
age:ldl
age^2
y
## age sex bmi map tc ldl hdl tch ltg glu y
## age 1.000 0.174 0.185 0.34 0.260 0.22 -0.075 0.20 0.27 0.30 0.188
## sex 0.174 1.000 0.088 0.24 0.035 0.14 -0.379 0.33 0.15 0.21 0.043
## bmi 0.185 0.088 1.000 0.40 0.250 0.26 -0.367 0.41 0.45 0.39 0.586
## map 0.335 0.241 0.395 1.00 0.242 0.19 -0.179 0.26 0.39 0.39 0.441
## tc 0.260 0.035 0.250 0.24 1.000 0.90 0.052 0.54 0.52 0.33 0.212
## ldl 0.219 0.143 0.261 0.19 0.897 1.00 -0.196 0.66 0.32 0.29 0.174
## hdl -0.075 -0.379 -0.367 -0.18 0.052 -0.20 1.000 -0.74 -0.40 -0.27 -0.395
## tch 0.204 0.332 0.414 0.26 0.542 0.66 -0.738 1.00 0.62 0.42 0.430
## ltg 0.271 0.150 0.446 0.39 0.516 0.32 -0.399 0.62 1.00 0.46 0.566
## glu 0.302 0.208 0.389 0.39 0.326 0.29 -0.274 0.42 0.46 1.00 0.382
## y 0.188 0.043 0.586 0.44 0.212 0.17 -0.395 0.43 0.57 0.38 1.000
Put aside 20% of the data for testing.
Refit the model.
Predict the test data; compute \[\begin{aligned} S = \sqrt{\frac{1}{M} \sum_{k=1}^M (\hat y_i - y_i)^2} \end{aligned}\]
Repeat for the other four 20%s.
Compare.
First let’s split the data into testing and training just once:
##
## Call:
## lm(formula = y ~ ., data = training_d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.3 -32.5 -1.1 30.8 150.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.688 2.952 51.04 < 2e-16 ***
## age 84.949 76.868 1.11 0.27004
## sex -269.634 74.608 -3.61 0.00036 ***
## bmi 472.782 95.168 4.97 1.2e-06 ***
## map 360.868 83.425 4.33 2.1e-05 ***
## tc -5344.136 61836.313 -0.09 0.93119
## ldl 4723.000 54348.535 0.09 0.93081
## hdl 1680.538 23107.326 0.07 0.94207
## tch -85.184 310.280 -0.27 0.78387
## ltg 2350.565 20327.550 0.12 0.90802
## glu 89.653 82.363 1.09 0.27730
## `age^2` 66.327 81.718 0.81 0.41767
## `bmi^2` -15.807 98.076 -0.16 0.87207
## `map^2` -52.390 81.903 -0.64 0.52291
## `tc^2` 4501.315 7881.979 0.57 0.56839
## `ldl^2` 1315.348 5909.806 0.22 0.82403
## `hdl^2` 1030.418 1782.583 0.58 0.56369
## `tch^2` 1153.506 714.697 1.61 0.10764
## `ltg^2` 1092.080 1792.281 0.61 0.54280
## `glu^2` 128.336 105.649 1.21 0.22547
## `age:sex` 148.196 90.659 1.63 0.10323
## `age:bmi` 0.261 91.854 0.00 0.99773
## `age:map` 20.594 92.257 0.22 0.82352
## `age:tc` -381.435 724.342 -0.53 0.59889
## `age:ldl` 210.218 572.335 0.37 0.71367
## `age:hdl` 200.905 332.316 0.60 0.54595
## `age:tch` 61.393 261.202 0.24 0.81435
## `age:ltg` 226.963 253.924 0.89 0.37217
## `age:glu` 123.565 97.013 1.27 0.20381
## `sex:bmi` 151.251 90.586 1.67 0.09608 .
## `sex:map` 34.898 92.578 0.38 0.70649
## `sex:tc` 710.474 742.200 0.96 0.33925
## `sex:ldl` -583.090 593.504 -0.98 0.32671
## `sex:hdl` -89.042 339.134 -0.26 0.79308
## `sex:tch` -61.988 232.199 -0.27 0.78969
## `sex:ltg` -210.102 273.395 -0.77 0.44283
## `sex:glu` 2.214 83.695 0.03 0.97891
## `bmi:map` 232.783 105.268 2.21 0.02781 *
## `bmi:tc` -449.811 783.837 -0.57 0.56652
## `bmi:ldl` 449.714 655.611 0.69 0.49331
## `bmi:hdl` 123.457 381.237 0.32 0.74630
## `bmi:tch` -132.984 266.229 -0.50 0.61781
## `bmi:ltg` 132.106 300.483 0.44 0.66053
## `bmi:glu` 88.775 100.372 0.88 0.37720
## `map:tc` 164.889 829.983 0.20 0.84267
## `map:ldl` -35.365 692.990 -0.05 0.95934
## `map:hdl` -84.927 384.110 -0.22 0.82517
## `map:tch` -114.154 239.770 -0.48 0.63437
## `map:ltg` 3.840 326.807 0.01 0.99063
## `map:glu` -244.481 107.705 -2.27 0.02396 *
## `tc:ldl` -4837.758 13111.399 -0.37 0.71242
## `tc:hdl` -2183.968 4297.196 -0.51 0.61169
## `tc:tch` -2109.486 1982.692 -1.06 0.28825
## `tc:ltg` -2127.476 13625.947 -0.16 0.87604
## `tc:glu` 950.939 944.060 1.01 0.31465
## `ldl:hdl` 750.673 3596.170 0.21 0.83480
## `ldl:tch` 685.721 1687.047 0.41 0.68471
## `ldl:ltg` 1301.031 11332.178 0.11 0.90868
## `ldl:glu` -997.177 828.415 -1.20 0.22970
## `hdl:tch` 1423.428 1141.877 1.25 0.21358
## `hdl:ltg` 579.689 4796.146 0.12 0.90388
## `hdl:glu` -207.188 418.504 -0.50 0.62094
## `tch:ltg` 231.583 710.805 0.33 0.74481
## `tch:glu` 195.247 265.217 0.74 0.46223
## `ltg:glu` -262.969 369.010 -0.71 0.47666
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53 on 284 degrees of freedom
## Multiple R-squared: 0.627, Adjusted R-squared: 0.543
## F-statistic: 7.47 on 64 and 284 DF, p-value: <2e-16
With ordinary linear regression, we got a root-mean-square-prediction-error of 61.35 (on the test data), compared to a root-mean-square-error of 47.91 for the training data.
This suggests there’s some overfitting going on.
We have a lot of predictors: 64 of them. A good guess is that only a few are really useful. So, we can put a sparsifying prior on the coefficients, i.e., \(\beta\)s in \[\begin{aligned} y = \beta_0 + \beta_1 x_1 + \cdots \beta_n x_n + \epsilon \end{aligned}\]
y ~ a + b x[1] + c x[2]
, and fit a linear model.kfold <- function (K, df) {
N <- nrow(df)
Kfold <- sample(rep(1:K, N/K))
do_xval <- function (k) {
the_lm <- lm(y ~ ., data=df, subset=(Kfold != k))
train_error <- sqrt(mean(resid(the_lm)^2))
test_y <- df$y[Kfold == k]
test_error <- sqrt(mean( (test_y - predict(the_lm, newdata=subset(df, Kfold==k)))^2 ))
return(c('test'=test_error, 'train'=train_error))
}
results <- sapply(1:K, do_xval)
return(results)
}
Suppose we do regression with a large number of predictor variables.
The resulting coefficients are sparse if most are zero.
The idea is to “encourage” all the coefficients to be zero, unless they really want to be nonzero, in which case we let them be whatever they want.
This tends to discourage overfitting.
The idea is to “encourage” all the coefficients to be zero, unless they really want to be nonzero, in which case we let them be whatever they want.
To do this, we want a prior which is very peak-ey at zero but flat away from zero (“spike-and-slab”).
Compare the Normal
\[\begin{aligned} X \sim \Normal(0,1) \end{aligned}\]
to the “exponential scale mixture of Normals”,
\[\begin{aligned} X &\sim \Normal(0,\sigma) \\ \sigma &\sim \Exp(1) . \end{aligned}\]
Lets the data choose the appropriate scale of variation.
Weakly encourages \(\sigma\) to be small: so, as much variation as possible is explained by signal instead of noise.
Gets you a prior that is more peaked at zero and flatter otherwise.
Note that
\[\begin{aligned} \beta &\sim \Normal(0,\sigma) \\ \sigma &\sim \Exp(1) . \end{aligned}\]
is equivalent to
\[\begin{aligned} \beta &= \sigma \gamma \\ \gamma &\sim \Normal(0,1) \\ \sigma &\sim \Exp(1) . \end{aligned}\]
parameters {
real beta;
real<lower=0> sigma;
}
model {
beta ~ normal(0, sigma);
sigma ~ exponential(1);
}
is equivalent to
parameters {
real gamma;
real<lower=0> sigma;
}
transformed parameters {
real beta;
beta = gamma * sigma;
}
model {
gamma ~ normal(0, 1);
sigma ~ exponential(1);
}
The second version is better for Stan.
Why is it better?
parameters {
real beta;
real<lower=0> sigma;
}
model {
beta ~ normal(0, sigma);
sigma ~ exponential(1);
}
In the first, the optimal step size depends on sigma
.
parameters {
real gamma;
real<lower=0> sigma;
}
transformed parameters {
real beta;
beta = gamma * sigma;
}
model {
gamma ~ normal(0, 1);
sigma ~ exponential(1);
}
The “horseshoe”:
\[\begin{aligned} \beta_j &\sim \Normal(0, \lambda_j) \\ \lambda_j &\sim \Cauchy(0, \tau) \\ \tau &\sim \Unif(0, 1) \end{aligned}\]
parameters {
vector[p] d_beta;
vector[p] d_lambda;
real<lower=0, upper=1> tau;
}
transformed parameters {
vector[p] beta;
beta = d_beta .* d_lambda * tau;
}
model {
d_beta ~ normal(0, 1);
d_lambda ~ cauchy(0, 1);
// tau ~ uniform(0, 1); // uniform
}
Recall that if
\[\begin{aligned} \beta &\sim \Normal(0, 1/\sqrt{\lambda}) \\ \lambda &\sim \Gam(1/2, 1/2) \end{aligned}\]
then
\[\begin{aligned} \beta &\sim \Cauchy(0, 1). \end{aligned}\]
xy <- cbind(data.frame(y=diabetes$y), diabetes$x2)
names(xy) <- gsub("[:^]", "_", names(xy))
# ols
blm <- brm(y ~ ., data=xy, family=gaussian(link='identity'))
## Compiling the C++ model
## Start sampling
## Warning: There were 4000 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.32, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
bhs <- brm(y ~ ., data=xy, family=gaussian(link='identity'),
prior=c(set_prior(horseshoe(), class="b")))
## Compiling the C++ model
## Start sampling
## Fitting model 1 out of 5
## Fitting model 2 out of 5
## Fitting model 3 out of 5
## Fitting model 4 out of 5
## Fitting model 5 out of 5
## Start sampling
## Start sampling
## Start sampling
## Start sampling
## Start sampling
##
## Based on 5-fold cross-validation
##
## Estimate SE
## elpd_kfold -2427.0 14.2
## p_kfold NA NA
## kfoldic 4854.1 28.4
## Fitting model 1 out of 5
## Fitting model 2 out of 5
## Fitting model 3 out of 5
## Fitting model 4 out of 5
## Fitting model 5 out of 5
## Start sampling
## Start sampling
## Start sampling
## Start sampling
## Start sampling
##
## Based on 5-fold cross-validation
##
## Estimate SE
## elpd_kfold -2399.6 13.6
## p_kfold NA NA
## kfoldic 4799.3 27.2
Use the data to try to reproduce their model:
BIR = 14,195.35 - 141.75 (sand%) - 142.10 (silt%) - 142.56 (clay%)
They’re not wrong! What’s going on?
The idea is to plot the quantiles of each distribution against each other.
If these are datasets, this means just plotting their sorted values against each other.
Uses a reparameterization of the Cauchy as a scale mixture of normals.
horseshoe_block <- "
data {
int N;
int p;
vector[N] y;
matrix[N,p] x;
}
parameters {
real b0;
vector[p] d_beta;
vector[p] d_a;
vector<lower=0>[p] d_b;
real<lower=0, upper=1> tau;
real<lower=0> sigma;
}
transformed parameters {
vector[p] beta;
vector[N] f;
beta = d_beta .* d_a .* sqrt(d_b) * tau;
f = b0 + x * beta;
}
model {
y ~ normal(f, sigma);
// HORSESHOE PRIOR:
d_beta ~ normal(0, 1);
d_a ~ normal(0, 1);
d_b ~ inv_gamma(0.5, 0.5);
// tau ~ uniform(0, 1); // uniform
// priors on noise distribution:
sigma ~ normal(0, 10);
}"
Note the data have already been normalized, with the exception of \(y\):
## y age sex bmi map tc ldl hdl tch ltg glu age^2
## Min. : 25 Min. :-0.107 Min. :-0.045 Min. :-0.089 Min. :-0.109 Min. :-0.109 Min. :-0.116 Min. :-0.102 Min. :-0.076 Min. :-0.126 Min. :-0.138 Min. :-0.041
## 1st Qu.: 84 1st Qu.:-0.038 1st Qu.:-0.045 1st Qu.:-0.034 1st Qu.:-0.033 1st Qu.:-0.033 1st Qu.:-0.029 1st Qu.:-0.036 1st Qu.:-0.039 1st Qu.:-0.035 1st Qu.:-0.034 1st Qu.:-0.037
## Median :141 Median : 0.005 Median :-0.045 Median :-0.007 Median :-0.006 Median :-0.004 Median :-0.004 Median :-0.007 Median :-0.003 Median :-0.004 Median : 0.003 Median :-0.020
## Mean :152 Mean :-0.001 Mean : 0.000 Mean : 0.001 Mean :-0.001 Mean : 0.000 Mean : 0.001 Mean :-0.001 Mean : 0.002 Mean : 0.000 Mean : 0.001 Mean :-0.001
## 3rd Qu.:214 3rd Qu.: 0.034 3rd Qu.: 0.051 3rd Qu.: 0.034 3rd Qu.: 0.032 3rd Qu.: 0.027 3rd Qu.: 0.031 3rd Qu.: 0.027 3rd Qu.: 0.034 3rd Qu.: 0.034 3rd Qu.: 0.032 3rd Qu.: 0.017
## Max. :341 Max. : 0.111 Max. : 0.051 Max. : 0.171 Max. : 0.132 Max. : 0.154 Max. : 0.199 Max. : 0.181 Max. : 0.185 Max. : 0.134 Max. : 0.136 Max. : 0.183
## bmi^2 map^2 tc^2 ldl^2 hdl^2 tch^2 ltg^2 glu^2 age:sex age:bmi age:map age:tc
## Min. :-0.03 Min. :-0.039 Min. :-0.032 Min. :-0.03 Min. :-0.03 Min. :-0.03 Min. :-0.035 Min. :-0.032 Min. :-0.121 Min. :-0.147 Min. :-0.166 Min. :-0.136
## 1st Qu.:-0.03 1st Qu.:-0.035 1st Qu.:-0.029 1st Qu.:-0.03 1st Qu.:-0.02 1st Qu.:-0.03 1st Qu.:-0.030 1st Qu.:-0.029 1st Qu.:-0.031 1st Qu.:-0.022 1st Qu.:-0.020 1st Qu.:-0.021
## Median :-0.02 Median :-0.020 Median :-0.018 Median :-0.02 Median :-0.01 Median :-0.01 Median :-0.017 Median :-0.017 Median : 0.001 Median :-0.007 Median :-0.009 Median :-0.010
## Mean : 0.00 Mean :-0.002 Mean :-0.001 Mean : 0.00 Mean : 0.00 Mean : 0.00 Mean : 0.001 Mean : 0.001 Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean :-0.003
## 3rd Qu.: 0.02 3rd Qu.: 0.018 3rd Qu.: 0.005 3rd Qu.: 0.01 3rd Qu.: 0.01 3rd Qu.:-0.01 3rd Qu.: 0.017 3rd Qu.: 0.011 3rd Qu.: 0.035 3rd Qu.: 0.019 3rd Qu.: 0.021 3rd Qu.: 0.015
## Max. : 0.39 Max. : 0.264 Max. : 0.303 Max. : 0.49 Max. : 0.37 Max. : 0.43 Max. : 0.241 Max. : 0.236 Max. : 0.112 Max. : 0.170 Max. : 0.228 Max. : 0.185
## age:ldl age:hdl age:tch age:ltg age:glu sex:bmi sex:map sex:tc sex:ldl sex:hdl sex:tch sex:ltg
## Min. :-0.155 Min. :-0.180 Min. :-0.201 Min. :-0.160 Min. :-0.120 Min. :-0.126 Min. :-0.140 Min. :-0.121 Min. :-0.130 Min. :-0.166 Min. :-0.160 Min. :-0.143
## 1st Qu.:-0.021 1st Qu.:-0.025 1st Qu.:-0.018 1st Qu.:-0.022 1st Qu.:-0.021 1st Qu.:-0.036 1st Qu.:-0.033 1st Qu.:-0.032 1st Qu.:-0.032 1st Qu.:-0.034 1st Qu.:-0.020 1st Qu.:-0.034
## Median :-0.008 Median : 0.001 Median :-0.009 Median :-0.008 Median :-0.010 Median :-0.002 Median : 0.001 Median :-0.001 Median :-0.001 Median :-0.004 Median : 0.008 Median : 0.004
## Mean :-0.002 Mean :-0.003 Mean : 0.000 Mean : 0.000 Mean : 0.002 Mean : 0.000 Mean :-0.001 Mean :-0.002 Mean :-0.001 Mean :-0.001 Mean : 0.001 Mean :-0.001
## 3rd Qu.: 0.014 3rd Qu.: 0.017 3rd Qu.: 0.021 3rd Qu.: 0.020 3rd Qu.: 0.017 3rd Qu.: 0.034 3rd Qu.: 0.035 3rd Qu.: 0.031 3rd Qu.: 0.029 3rd Qu.: 0.030 3rd Qu.: 0.022 3rd Qu.: 0.032
## Max. : 0.206 Max. : 0.180 Max. : 0.284 Max. : 0.187 Max. : 0.184 Max. : 0.179 Max. : 0.104 Max. : 0.162 Max. : 0.206 Max. : 0.162 Max. : 0.191 Max. : 0.136
## sex:glu bmi:map bmi:tc bmi:ldl bmi:hdl bmi:tch bmi:ltg bmi:glu map:tc map:ldl map:hdl map:tch
## Min. :-0.140 Min. :-0.125 Min. :-0.293 Min. :-0.287 Min. :-0.268 Min. :-0.132 Min. :-0.185 Min. :-0.154 Min. :-0.202 Min. :-0.204 Min. :-0.283 Min. :-0.145
## 1st Qu.:-0.033 1st Qu.:-0.021 1st Qu.:-0.019 1st Qu.:-0.022 1st Qu.:-0.019 1st Qu.:-0.024 1st Qu.:-0.024 1st Qu.:-0.024 1st Qu.:-0.020 1st Qu.:-0.020 1st Qu.:-0.020 1st Qu.:-0.018
## Median :-0.005 Median :-0.011 Median :-0.008 Median :-0.008 Median : 0.011 Median :-0.016 Median :-0.012 Median :-0.015 Median :-0.009 Median :-0.007 Median : 0.005 Median :-0.010
## Mean :-0.002 Mean : 0.002 Mean : 0.000 Mean : 0.001 Mean : 0.000 Mean : 0.001 Mean : 0.000 Mean : 0.000 Mean :-0.002 Mean :-0.001 Mean :-0.001 Mean : 0.000
## 3rd Qu.: 0.034 3rd Qu.: 0.017 3rd Qu.: 0.020 3rd Qu.: 0.016 3rd Qu.: 0.026 3rd Qu.: 0.021 3rd Qu.: 0.022 3rd Qu.: 0.019 3rd Qu.: 0.018 3rd Qu.: 0.018 3rd Qu.: 0.020 3rd Qu.: 0.021
## Max. : 0.133 Max. : 0.228 Max. : 0.227 Max. : 0.232 Max. : 0.146 Max. : 0.276 Max. : 0.219 Max. : 0.225 Max. : 0.190 Max. : 0.192 Max. : 0.141 Max. : 0.229
## map:ltg map:glu tc:ldl tc:hdl tc:tch tc:ltg tc:glu ldl:hdl ldl:tch ldl:ltg ldl:glu hdl:tch
## Min. :-0.146 Min. :-0.143 Min. :-0.04 Min. :-0.21 Min. :-0.12 Min. :-0.105 Min. :-0.121 Min. :-0.256 Min. :-0.11 Min. :-0.183 Min. :-0.123 Min. :-0.235
## 1st Qu.:-0.024 1st Qu.:-0.023 1st Qu.:-0.03 1st Qu.:-0.02 1st Qu.:-0.02 1st Qu.:-0.026 1st Qu.:-0.021 1st Qu.:-0.017 1st Qu.:-0.02 1st Qu.:-0.021 1st Qu.:-0.020 1st Qu.:-0.019
## Median :-0.012 Median :-0.013 Median :-0.02 Median : 0.00 Median :-0.02 Median :-0.015 Median :-0.011 Median : 0.005 Median :-0.01 Median :-0.007 Median :-0.010 Median : 0.014
## Mean :-0.001 Mean :-0.001 Mean : 0.00 Mean : 0.00 Mean : 0.00 Mean : 0.000 Mean : 0.002 Mean :-0.002 Mean : 0.00 Mean : 0.001 Mean : 0.002 Mean :-0.002
## 3rd Qu.: 0.024 3rd Qu.: 0.011 3rd Qu.: 0.01 3rd Qu.: 0.01 3rd Qu.: 0.01 3rd Qu.: 0.017 3rd Qu.: 0.014 3rd Qu.: 0.017 3rd Qu.: 0.01 3rd Qu.: 0.025 3rd Qu.: 0.015 3rd Qu.: 0.031
## Max. : 0.187 Max. : 0.222 Max. : 0.40 Max. : 0.32 Max. : 0.47 Max. : 0.223 Max. : 0.237 Max. : 0.161 Max. : 0.56 Max. : 0.203 Max. : 0.299 Max. : 0.061
## hdl:ltg hdl:glu tch:ltg tch:glu ltg:glu
## Min. :-0.255 Min. :-0.223 Min. :-0.16 Min. :-0.13 Min. :-0.092
## 1st Qu.:-0.019 1st Qu.:-0.015 1st Qu.:-0.03 1st Qu.:-0.02 1st Qu.:-0.023
## Median : 0.007 Median : 0.009 Median :-0.01 Median :-0.02 Median :-0.013
## Mean :-0.002 Mean : 0.001 Mean : 0.00 Mean : 0.00 Mean : 0.000
## 3rd Qu.: 0.021 3rd Qu.: 0.023 3rd Qu.: 0.02 3rd Qu.: 0.02 3rd Qu.: 0.014
## Max. : 0.163 Max. : 0.210 Max. : 0.38 Max. : 0.32 Max. : 0.195
horseshoe_fit <- stan(model_code=horseshoe_block,
data=list(N=nrow(training_d),
p=ncol(training_d)-1,
y=(training_d$y
- median(training_d$y))
/mad(training_d$y),
x=as.matrix(training_d[,-1])),
iter=1000,
control=list(adapt_delta=0.999,
max_treedepth=15))
## Warning: There were 27 divergent transitions after warmup. Increasing adapt_delta above 0.999 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.53, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess
## mean se_mean sd 2.5% 25% 50% 75% 98% n_eff Rhat
## b0 0.11847 0.00054 0.031 0.058 0.09724 0.11828 0.141 0.18 3347 1
## sigma 0.59494 0.00048 0.024 0.549 0.57915 0.59438 0.610 0.64 2477 1
## beta[1] 0.18468 0.01081 0.441 -0.439 -0.04236 0.05639 0.330 1.36 1669 1
## beta[2] -1.27036 0.03230 1.000 -3.355 -1.98366 -1.22500 -0.383 0.12 959 1
## beta[3] 5.97824 0.02349 0.928 4.142 5.36459 5.99472 6.606 7.83 1561 1
## beta[4] 3.43216 0.03138 0.990 1.345 2.83589 3.46677 4.079 5.29 996 1
## beta[5] -0.23122 0.02070 0.669 -1.934 -0.34336 -0.04547 0.047 0.65 1045 1
## beta[6] -0.00017 0.01544 0.560 -1.000 -0.14021 -0.00303 0.109 0.96 1317 1
## beta[7] -1.31586 0.03931 1.197 -3.845 -2.21610 -1.14696 -0.159 0.18 928 1
## beta[8] 0.30767 0.02202 0.755 -0.590 -0.03962 0.05539 0.410 2.67 1177 1
## beta[9] 5.71389 0.02644 1.018 3.725 5.03216 5.68509 6.420 7.72 1484 1
## beta[10] 0.26253 0.01448 0.518 -0.414 -0.02317 0.07275 0.423 1.73 1278 1
## beta[11] 0.23320 0.01261 0.482 -0.416 -0.02111 0.07704 0.393 1.58 1463 1
## beta[12] 0.14319 0.01199 0.442 -0.587 -0.04693 0.03456 0.267 1.36 1361 1
## beta[13] 0.03119 0.01080 0.339 -0.672 -0.08882 0.00737 0.129 0.85 987 1
## beta[14] -0.02838 0.01354 0.486 -1.202 -0.16016 -0.00246 0.121 1.02 1288 1
## beta[15] -0.08398 0.01429 0.535 -1.337 -0.17772 -0.01292 0.084 0.91 1400 1
## beta[16] 0.03915 0.01208 0.441 -0.915 -0.09980 0.00799 0.166 1.05 1333 1
## beta[17] 0.07360 0.01198 0.432 -0.698 -0.08954 0.01054 0.178 1.18 1302 1
## beta[18] -0.28898 0.01467 0.545 -1.751 -0.46633 -0.09520 0.014 0.39 1382 1
## beta[19] 0.78015 0.02871 0.871 -0.278 0.04895 0.51543 1.368 2.74 920 1
## beta[20] 1.47129 0.02600 0.938 -0.032 0.72236 1.48715 2.159 3.26 1301 1
## beta[21] 0.20772 0.01309 0.481 -0.499 -0.03633 0.06031 0.352 1.52 1351 1
## beta[22] 0.26289 0.01393 0.512 -0.417 -0.01699 0.09774 0.455 1.67 1349 1
## beta[23] 0.00158 0.01182 0.417 -0.879 -0.11609 0.00167 0.134 0.84 1245 1
## beta[24] -0.10350 0.01260 0.462 -1.316 -0.21664 -0.01530 0.069 0.68 1344 1
## beta[25] 0.03007 0.01111 0.385 -0.725 -0.10381 0.00717 0.153 0.92 1199 1
## beta[26] 0.04793 0.01247 0.434 -0.792 -0.10275 0.00745 0.172 1.06 1213 1
## beta[27] 0.35088 0.01561 0.605 -0.384 -0.00555 0.11684 0.572 1.98 1502 1
## beta[28] 0.28770 0.01490 0.555 -0.429 -0.01956 0.09428 0.492 1.84 1388 1
## beta[29] 0.55050 0.01892 0.715 -0.249 0.00938 0.29048 0.930 2.34 1426 1
## beta[30] 0.16897 0.01170 0.444 -0.483 -0.03571 0.04440 0.289 1.40 1440 1
## beta[31] 0.07436 0.01259 0.429 -0.668 -0.09090 0.01118 0.166 1.21 1162 1
## beta[32] -0.13863 0.01324 0.446 -1.328 -0.25271 -0.03182 0.059 0.59 1134 1
## beta[33] 0.41573 0.01688 0.645 -0.315 -0.00077 0.16688 0.666 2.14 1461 1
## beta[34] -0.13366 0.01228 0.452 -1.323 -0.27521 -0.03291 0.057 0.59 1354 1
## beta[35] 0.06844 0.01060 0.375 -0.656 -0.07767 0.00983 0.189 0.99 1251 1
## beta[36] 0.00597 0.00986 0.351 -0.773 -0.11192 0.00097 0.135 0.76 1267 1
## beta[37] 0.84159 0.02197 0.878 -0.203 0.08052 0.59820 1.447 2.82 1598 1
## beta[38] 0.00356 0.01001 0.393 -0.814 -0.11802 0.00027 0.123 0.87 1539 1
## beta[39] 0.03318 0.01143 0.408 -0.838 -0.09997 0.00638 0.154 0.97 1272 1
## beta[40] -0.03956 0.01084 0.406 -0.994 -0.16742 -0.00808 0.101 0.80 1402 1
## beta[41] 0.04430 0.01212 0.428 -0.801 -0.11201 0.00518 0.164 1.06 1248 1
## beta[42] 0.03220 0.01094 0.357 -0.746 -0.09465 0.00515 0.152 0.86 1068 1
## beta[43] 0.32835 0.01665 0.595 -0.426 -0.01796 0.12077 0.522 1.98 1278 1
## beta[44] 0.08403 0.01295 0.431 -0.689 -0.08137 0.01713 0.202 1.13 1109 1
## beta[45] 0.08795 0.01288 0.438 -0.625 -0.07965 0.01571 0.176 1.20 1155 1
## beta[46] 0.20045 0.01362 0.523 -0.593 -0.03954 0.04958 0.323 1.64 1474 1
## beta[47] -0.12439 0.01312 0.468 -1.346 -0.23290 -0.02234 0.066 0.66 1275 1
## beta[48] 0.04445 0.01014 0.390 -0.703 -0.10447 0.00717 0.151 1.00 1479 1
## beta[49] -0.17629 0.01495 0.499 -1.577 -0.28276 -0.03913 0.057 0.52 1112 1
## beta[50] -0.01323 0.01286 0.501 -1.080 -0.15172 -0.00465 0.112 0.98 1518 1
## beta[51] 0.07131 0.01251 0.448 -0.667 -0.09582 0.01067 0.167 1.19 1282 1
## beta[52] -0.37967 0.02132 0.758 -2.461 -0.54911 -0.09195 0.024 0.46 1263 1
## beta[53] -0.13073 0.01312 0.498 -1.500 -0.22152 -0.01822 0.072 0.63 1438 1
## beta[54] 0.05582 0.01172 0.435 -0.805 -0.08886 0.01069 0.173 1.10 1377 1
## beta[55] -0.07641 0.01428 0.466 -1.315 -0.17775 -0.00895 0.094 0.71 1063 1
## beta[56] -0.03597 0.01354 0.468 -1.096 -0.16718 -0.00506 0.109 0.96 1196 1
## beta[57] 0.23560 0.01706 0.606 -0.483 -0.04384 0.04846 0.352 2.00 1262 1
## beta[58] 0.02036 0.01282 0.432 -0.871 -0.12112 0.00130 0.143 0.96 1136 1
## beta[59] -0.15576 0.01334 0.488 -1.542 -0.25669 -0.02821 0.057 0.61 1340 1
## beta[60] 0.10674 0.01163 0.417 -0.574 -0.06224 0.02681 0.212 1.20 1284 1
## beta[61] 0.01456 0.01035 0.364 -0.768 -0.11944 0.00029 0.129 0.86 1233 1
## beta[62] -0.14660 0.01325 0.492 -1.361 -0.26495 -0.02706 0.060 0.70 1379 1
## beta[63] 0.12335 0.01578 0.496 -0.704 -0.07143 0.02268 0.233 1.49 989 1
## beta[64] 0.15840 0.01482 0.490 -0.603 -0.05885 0.03289 0.306 1.50 1092 1
First compare the resulting regression parameters to OLS values.
The coefficient estimates from OLS are wierd.
## ols stan brms
## (Intercept) 150.69 151.697 152.12
## tc -5344.14 -4.112 -39.32
## `tc:ldl` -4837.76 -0.420 2.57
## ldl 4723.00 -0.274 -15.40
## `tc^2` 4501.31 -0.222 0.41
## ltg 2350.56 514.151 503.02
## `tc:hdl` -2183.97 0.965 15.63
## `tc:ltg` -2127.48 -1.647 -9.43
## `tc:tch` -2109.49 -8.316 -37.94
## hdl 1680.54 -103.729 -198.07
## `hdl:tch` 1423.43 -2.551 -15.64
## `ldl^2` 1315.35 -1.168 -8.08
## `ldl:ltg` 1301.03 4.383 31.38
## `tch^2` 1153.51 0.953 9.84
## `ltg^2` 1092.08 -8.610 -16.80
## `hdl^2` 1030.42 0.723 4.75
## `ldl:glu` -997.18 0.118 14.30
## `tc:glu` 950.94 0.967 7.18
## `ldl:hdl` 750.67 -0.809 -6.66
## `sex:tc` 710.47 1.011 1.77
## `ldl:tch` 685.72 -0.458 2.79
## `sex:ldl` -583.09 -2.878 -14.72
## `hdl:ltg` 579.69 2.424 16.42
## bmi 472.78 542.154 510.40
## `bmi:tc` -449.81 0.024 -6.41
## `bmi:ldl` 449.71 0.577 -2.40
## `age:tc` -381.44 0.151 -4.40
## map 360.87 313.530 290.61
## sex -269.63 -110.788 -160.69
## `ltg:glu` -262.97 2.975 7.05
## `map:glu` -244.48 -3.539 -11.83
## `bmi:map` 232.78 54.101 77.03
## `tch:ltg` 231.58 -2.447 -24.62
## `age:ltg` 226.96 10.567 32.23
## `age:ldl` 210.22 -1.384 -20.88
## `sex:ltg` -210.10 0.889 2.05
## `hdl:glu` -207.19 0.027 3.76
## `age:hdl` 200.91 0.648 12.25
## `tch:glu` 195.25 2.051 22.71
## `map:tc` 164.89 1.549 11.64
## `sex:bmi` 151.25 26.270 23.66
## `age:sex` 148.20 134.496 120.77
## `bmi:tch` -132.98 0.469 6.28
## `bmi:ltg` 132.11 0.466 10.37
## `glu^2` 128.34 46.615 49.61
## `age:glu` 123.57 8.527 25.26
## `bmi:hdl` 123.46 -0.731 -2.65
## `map:tch` -114.15 -2.020 -9.89
## glu 89.65 6.579 26.70
## `sex:hdl` -89.04 15.092 24.71
## `bmi:glu` 88.77 10.922 16.41
## tch -85.18 5.010 34.45
## age 84.95 5.100 7.73
## `map:hdl` -84.93 4.484 16.25
## `age^2` 66.33 6.968 28.84
## `sex:tch` -61.99 -2.976 -15.37
## `age:tch` 61.39 0.673 0.95
## `map^2` -52.39 0.667 4.76
## `map:ldl` -35.37 1.421 7.81
## `sex:map` 34.90 4.016 28.94
## `age:map` 20.59 8.839 22.15
## `bmi^2` -15.81 3.125 29.56
## `map:ltg` 3.84 0.648 7.32
## `sex:glu` 2.21 0.088 8.23
## `age:bmi` 0.26 5.454 3.13
And, quite different than what Stan gets.
## ols stan brms
## (Intercept) 150.69 151.697 152.12
## bmi 472.78 542.154 510.40
## ltg 2350.56 514.151 503.02
## map 360.87 313.530 290.61
## `age:sex` 148.20 134.496 120.77
## sex -269.63 -110.788 -160.69
## hdl 1680.54 -103.729 -198.07
## `bmi:map` 232.78 54.101 77.03
## `glu^2` 128.34 46.615 49.61
## `sex:bmi` 151.25 26.270 23.66
## `sex:hdl` -89.04 15.092 24.71
## `bmi:glu` 88.77 10.922 16.41
## `age:ltg` 226.96 10.567 32.23
## `age:map` 20.59 8.839 22.15
## `ltg^2` 1092.08 -8.610 -16.80
## `age:glu` 123.57 8.527 25.26
## `tc:tch` -2109.49 -8.316 -37.94
## `age^2` 66.33 6.968 28.84
## glu 89.65 6.579 26.70
## `age:bmi` 0.26 5.454 3.13
## age 84.95 5.100 7.73
## tch -85.18 5.010 34.45
## `map:hdl` -84.93 4.484 16.25
## `ldl:ltg` 1301.03 4.383 31.38
## tc -5344.14 -4.112 -39.32
## `sex:map` 34.90 4.016 28.94
## `map:glu` -244.48 -3.539 -11.83
## `bmi^2` -15.81 3.125 29.56
## `sex:tch` -61.99 -2.976 -15.37
## `ltg:glu` -262.97 2.975 7.05
## `sex:ldl` -583.09 -2.878 -14.72
## `hdl:tch` 1423.43 -2.551 -15.64
## `tch:ltg` 231.58 -2.447 -24.62
## `hdl:ltg` 579.69 2.424 16.42
## `tch:glu` 195.25 2.051 22.71
## `map:tch` -114.15 -2.020 -9.89
## `tc:ltg` -2127.48 -1.647 -9.43
## `map:tc` 164.89 1.549 11.64
## `map:ldl` -35.37 1.421 7.81
## `age:ldl` 210.22 -1.384 -20.88
## `ldl^2` 1315.35 -1.168 -8.08
## `sex:tc` 710.47 1.011 1.77
## `tc:glu` 950.94 0.967 7.18
## `tc:hdl` -2183.97 0.965 15.63
## `tch^2` 1153.51 0.953 9.84
## `sex:ltg` -210.10 0.889 2.05
## `ldl:hdl` 750.67 -0.809 -6.66
## `bmi:hdl` 123.46 -0.731 -2.65
## `hdl^2` 1030.42 0.723 4.75
## `age:tch` 61.39 0.673 0.95
## `map^2` -52.39 0.667 4.76
## `age:hdl` 200.91 0.648 12.25
## `map:ltg` 3.84 0.648 7.32
## `bmi:ldl` 449.71 0.577 -2.40
## `bmi:tch` -132.98 0.469 6.28
## `bmi:ltg` 132.11 0.466 10.37
## `ldl:tch` 685.72 -0.458 2.79
## `tc:ldl` -4837.76 -0.420 2.57
## ldl 4723.00 -0.274 -15.40
## `tc^2` 4501.31 -0.222 0.41
## `age:tc` -381.44 0.151 -4.40
## `ldl:glu` -997.18 0.118 14.30
## `sex:glu` 2.21 0.088 8.23
## `hdl:glu` -207.19 0.027 3.76
## `bmi:tc` -449.81 0.024 -6.41
Now let’s look at out-of-sample prediction error, using the posterior median coefficient estimates:
pred_stan <- function (x) {
post_med_intercept + as.matrix(x) %*% post_med_slopes
}
pred_y <- pred_stan(test_d[,-1])
stan_pred_error <- sqrt(mean((test_d$y - pred_y)^2))
stan_mse_resid <- sqrt(mean((training_d$y - pred_stan(training_d[,-1]))^2))
plot(test_d$y, pred_y, xlab="true values", ylab="predicted values", main="test data")
abline(0,1)
Our “sparse” model is certainly more sparse, and arguably more interpretable.
It has a root-mean-square prediction error of 52.42 on the test data, and 52.92 on the training data.
This is substantially better than ordinary linear regression, which had a root-mean-square prediction error of 61.35 on the test data, and a root-mean-square-error of 47.91 on the training data.
The sparse model is more interpretable, and more generalizable.
Care, or at least think, about the data.
Look at the data.
Query the data.
Sanity check.
Communicate.
Often “statistics” focuses on querying. Doing that effectively requires all the other steps, too.
a numerical description of a dataset.
a numerical attribute of a model of reality.
Often, statistics are used to estimate parameters.
estimating parameters, with uncertainty (confidence or credible intervals)
evaluating (in-)consistency with a particular situation (\(p\)-values)
prediction, with uncertainty
model adequacy and fit (simulation, crossvalidation)
Parameters are numbers. With units.
So, they need to mean something concrete and precise.
In other words, they need to be part of a model.
To separate information from noise, we need to understand the noise.
How much of these differences are due to measurement error? or, unpredictable variation? How much is telling me something about the world?
A full model also describes the statistical noise in the data.
(It’s a full model if you can simulate from it.)
Fitting a model: Finding parameter values that make the data look plausible.
There may be a wide range of plausible parameter values.
Priors help get answers.
What variables?
Response distribution(s).
Relationship of predictors to the response(s).
Relationships between parameters.
Every statistical method has a model lurking somewhere behind it.
The basic tool is likelihood:
how likely are the data, using this set of parameters?
Some methods:
Hill climbing (maximum likelihood).
Hill wandering (Monte Carlo to explore posterior distributions)