Linear models and least squares

Peter Ralph

Advanced Biological Statistics

Linear models

Parent-offspring “regression”

plot of chunk r plot_galton

“Regression”???

This resulted in Galton’s formulation of the Law of Ancestral Heredity, which states that the two parents of an offspring jointly contribute one half of an offspring’s heritage, while more-removed ancestors constitute a smaller proportion of the offspring’s heritage. Galton viewed reversion as a spring, that when stretched, would return the distribution of traits back to the normal distribution. When Mendel’s principles were rediscovered in 1900, this resulted in a fierce battle between the followers of Galton’s Law of Ancestral Heredity, the biometricians, and those who advocated Mendel’s principles.

Francis Galton, Wikipedia

Covariance and correlation

“Pearson’s product-moment correlation coefficient” (by Bravais): \[ r^2 = \frac{\sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\sqrt{\sum_{i=1}^n (y_i - \bar y)^2} \sqrt{\sum_{i=1}^n (y_i - \bar y)^2}} \]

> help(cor)

Correlation, Variance and Covariance (Matrices)

Description:

     ‘var’, ‘cov’ and ‘cor’ compute the variance of ‘x’ and the
     covariance or correlation of ‘x’ and ‘y’ if these are vectors.  If
     ‘x’ and ‘y’ are matrices then the covariances (or correlations)
     between the columns of ‘x’ and the columns of ‘y’ are computed.

     ‘cov2cor’ scales a covariance matrix into the corresponding
     correlation matrix _efficiently_.

Usage:

     var(x, y = NULL, na.rm = FALSE, use)
     
     cov(x, y = NULL, use = "everything",
         method = c("pearson", "kendall", "spearman"))
     
     cor(x, y = NULL, use = "everything",
         method = c("pearson", "kendall", "spearman"))

Covariance and correlation

plot of chunk r plot_cors

Anscombe’s quartet

plot of chunk r plot_ansc

Linear models

\[ \text{(response)} = \text{(intercept)} + \text{(explanatory variables)} + \text{("error"}) \] in the general form: \[ y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik} + \epsilon_i , \] where \(\beta_0, \beta_1, \ldots, \beta_k\) are the parameters of the linear model.

Goal: find \(b_0, \ldots, b_k\) to best fit the model: \[ y_i = b_0 + b_1 x_{i1} + \cdots + b_k x_{ik} + e_i, \] so that \(b_i\) is an estimate of \(\beta_i\) and \(e_i\) is the residual, an estimate of \(\epsilon_i\).

Least-squares fitting of a linear model

Define the predicted values: \[ \hat y_i = b_0 + b_1 x_{i1} + \cdots + b_k x_{ik}, \] and find \(b_0, \ldots, b_k\) to minimize the sum of squared residuals, or \[ \sum_i \left(y_i - \hat y_i\right)^2 . \]

Amazing fact: if \(k=1\) then the slope is \(r\) (scaled): \[b_1 = r \frac{\text{sd}(y)}{\text{sd}(x)} .\]

Why least squares?

With predicted values \[ \hat y_i = b_0 + b_1 x_{i1} + \cdots + b_k x_{ik}, \] find \(b_0, \ldots, b_k\) to minimize the sum of squared residuals, or \[ \sum_i \left(y_i - \hat y_i\right)^2 . \]

Relationship to likelihood: the Normal distribution.

Examples

Let’s fit some least-squares lines!

source("../Datasets/PanTHERIA/read_pantheria.R")
pantheria <- read_pantheria("../Datasets/PanTHERIA/")
the_lm <- lm(LitterSize ~ TeatNumber, data=pantheria)
summary(the_lm)

## 
## Call:
## lm(formula = LitterSize ~ TeatNumber, data = pantheria)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3575 -0.9730 -0.0226  0.6055  4.8095 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.06443    0.11898  -0.542    0.588    
## TeatNumber   0.53349    0.01897  28.118   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.255 on 521 degrees of freedom
##   (4893 observations deleted due to missingness)
## Multiple R-squared:  0.6028, Adjusted R-squared:  0.602 
## F-statistic: 790.6 on 1 and 521 DF,  p-value: < 2.2e-16

plot(LitterSize ~ TeatNumber, data=pantheria, pch=20, asp=1)
abline(coef(the_lm), col='red')

plot of chunk r plot_lm

Your turn: do some lm( )! Plot some lines!