Overfitting in the wild

Peter Ralph

Advanced Biological Statistics

Soil infiltration

Estimation of infiltration rate from soil properties using regression model for cultivated land

EIR = 14,195.35 - 141.75 (sand%) - 142.10 (silt%) - 142.56 (clay%)

Use the data to try to reproduce their model:

BIR = 14,195.35 - 141.75 (sand%) - 142.10 (silt%) - 142.56 (clay%)

They’re not wrong! What’s up with those coefficients, though?

IN CLASS

Read in the data:

x <- read.table("data/infiltration_data.tsv", header=TRUE)

range(rowSums(x[, c("Sand", "Clay", "Silt")]))

## [1] 100 100

Fit the model:

the_lm <- lm( BIR ~ Sand + Silt + Clay, data=x)
summary(the_lm)

## 
## Call:
## lm(formula = BIR ~ Sand + Silt + Clay, data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7.39  -1.52   0.63   1.99   4.27 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -62.112     14.617   -4.25  0.00033 ***
## Sand           0.831      0.160    5.21  3.2e-05 ***
## Silt           0.473      0.234    2.02  0.05577 .  
## Clay              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.9 on 22 degrees of freedom
## Multiple R-squared:  0.627,  Adjusted R-squared:  0.593 
## F-statistic: 18.5 on 2 and 22 DF,  p-value: 1.96e-05

There is a small difference between our predictions and theirs - due to rounding, it will turn out.

preds <- predict(the_lm)
ours <- -62.1123 +  0.8315 * x$Sand + 0.4733 * x$Silt
theirs <- 14195.35 - 141.75 * x$Sand - 142.10 * x$Silt - 142.56 * x$Clay
layout(1:2)
plot(preds, ours)
abline(0, 1)
plot(ours, theirs)
abline(0, 1)

plot of chunk r predictions

Ok, so Sand + Silt + Clay is always equal to 100, so for any number a, it’s always true that a * Sand + a * Silt + a * Clay - a * 100 = 0. So, we can add this to our formula and get exactly the same thing. Our model has no coefficient for Clay, so to make their model look like ours we can do this with a equal to their coefficient for Clay:

clay_coef <- 142.56
# this is the same as
theirs_noclay <- 14195.35 - clay_coef * 100 + (clay_coef - 141.75) * x$Sand + (clay_coef - 142.10) * x$Silt + (clay_coef - 142.56) * x$Clay
plot(theirs, theirs_noclay)
abline(0, 1)

plot of chunk r new

So, what are these numbers?

The new coefficients that are equivalent to theirs are nearly the same as ours - the difference is because of rounding error!

coefs <- as.data.frame(coef(the_lm))
names(coefs)[[1]] <- "lm"
coefs$theirs <- c(14195.35, - 141.75, - 142.10, - 142.56)
coefs$theirs_noclay <- c(
    14195.35 - clay_coef * 100,
    (clay_coef - 141.75),
    (clay_coef - 142.10),
    (clay_coef - 142.56)
)
coefs

##                 lm theirs theirs_noclay
## (Intercept) -62.11  14195        -60.65
## Sand          0.83   -142          0.81
## Silt          0.47   -142          0.46
## Clay            NA   -143          0.00