Peter Ralph
Advanced Biological Statistics
This resulted in Galton’s formulation of the Law of Ancestral Heredity, which states that the two parents of an offspring jointly contribute one half of an offspring’s heritage, while more-removed ancestors constitute a smaller proportion of the offspring’s heritage. Galton viewed reversion as a spring, that when stretched, would return the distribution of traits back to the normal distribution. When Mendel’s principles were rediscovered in 1900, this resulted in a fierce battle between the followers of Galton’s Law of Ancestral Heredity, the biometricians, and those who advocated Mendel’s principles.
“Pearson’s product-moment correlation coefficient” (by Bravais): \[ r^2 = \frac{\sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\sqrt{\sum_{i=1}^n (y_i - \bar y)^2} \sqrt{\sum_{i=1}^n (y_i - \bar y)^2}} \]
> help(cor)
Correlation, Variance and Covariance (Matrices)
Description:
‘var’, ‘cov’ and ‘cor’ compute the variance of ‘x’ and the
covariance or correlation of ‘x’ and ‘y’ if these are vectors. If
‘x’ and ‘y’ are matrices then the covariances (or correlations)
between the columns of ‘x’ and the columns of ‘y’ are computed.
‘cov2cor’ scales a covariance matrix into the corresponding
correlation matrix _efficiently_.
Usage:
var(x, y = NULL, na.rm = FALSE, use)
cov(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
\[ \text{(response)} = \text{(intercept)} + \text{(explanatory variables)} + \text{("error"}) \] in the general form: \[ y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik} + \epsilon_i , \] where \(\beta_0, \beta_1, \ldots, \beta_k\) are the parameters of the linear model.
Goal: find \(b_0, \ldots, b_k\) to best fit the model: \[ y_i = b_0 + b_1 x_{i1} + \cdots + b_k x_{ik} + e_i, \] so that \(b_i\) is an estimate of \(\beta_i\) and \(e_i\) is the residual, an estimate of \(\epsilon_i\).
Define the predicted values: \[ \hat y_i = b_0 + b_1 x_{i1} + \cdots + b_k x_{ik}, \] and find \(b_0, \ldots, b_k\) to minimize the sum of squared residuals, or \[ \sum_i \left(y_i - \hat y_i\right)^2 . \]
Amazing fact: if \(k=1\) then the slope is \(r\) (scaled): \[b_1 = r \frac{\text{sd}(y)}{\text{sd}(x)} .\]
With predicted values \[ \hat y_i = b_0 + b_1 x_{i1} + \cdots + b_k x_{ik}, \] find \(b_0, \ldots, b_k\) to minimize the sum of squared residuals, or \[ \sum_i \left(y_i - \hat y_i\right)^2 . \]
Relationship to likelihood: the Normal distribution.
source("../Datasets/PanTHERIA/read_pantheria.R")
pantheria <- read_pantheria("../Datasets/PanTHERIA/")
the_lm <- lm(LitterSize ~ TeatNumber, data=pantheria)
summary(the_lm)
##
## Call:
## lm(formula = LitterSize ~ TeatNumber, data = pantheria)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3575 -0.9730 -0.0226 0.6055 4.8095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.06443 0.11898 -0.542 0.588
## TeatNumber 0.53349 0.01897 28.118 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.255 on 521 degrees of freedom
## (4893 observations deleted due to missingness)
## Multiple R-squared: 0.6028, Adjusted R-squared: 0.602
## F-statistic: 790.6 on 1 and 521 DF, p-value: < 2.2e-16
Your turn: do some lm( )
! Plot some lines!