Peter Ralph
22 October – Advanced Biological Statistics
Linear models…
Say we have \(n\) observations coming from combinations of two factors, so that the \(k\)th observation in the \(i\)th group of factor \(A\) and the \(j\)th group of factor \(B\) is \[\begin{equation} X_{ijk} = \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk} , \end{equation}\] where
In words, \[\begin{equation} \begin{split} \text{(value)} &= \text{(overall mean)} + \text{(A group mean)} \\ &\qquad {} + \text{(B group mean)} + \text{(AB group mean)} + \text{(residual)} \end{split}\end{equation}\]
We’re looking at how mean pumpkin weight depends on both
So, we
draw the pictures
Ignore any “plant” and “plot” effects.
(e.g., only one pumpkin per vine and one plot per combination of conditions)
Say that \(i=1, 2, 3\) indexes fertilizer levels (low to high), and \(j=1, 2\) indexes late watering (no or yes), and \[\begin{equation}\begin{split} X_{ijk} &= \text{(weight of $k$th pumpkin in plot with conditions $i$, $j$)} \\ &= \mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk} , \end{split}\end{equation}\] where
A good way to get a better concrete understanding of something is by simulating it –
by writing code to generate a (random) dataset that you design to look, more or less like what you expect the real data to look like.
This lets you explore statistical power, choose sample sizes, etcetera… but also makes you realize things you hadn’t, previously.
head( expand.grid(
fertilizer=c("low", "medium", "high"),
water=c("no water", "water"),
plot=1:4,
plant=1:5,
weight=NA))
## fertilizer water plot plant weight
## 1 low no water 1 1 NA
## 2 medium no water 1 1 NA
## 3 high no water 1 1 NA
## 4 low water 1 1 NA
## 5 medium water 1 1 NA
## 6 high water 1 1 NA
IN CLASS:
# true values for means
mu <- 10 # pounds
alpha <- c(high=5, medium=0, low=-5)
beta <- c("no water"=-3, water=3)
gamma <- c("low.no water" = 0,
"medium.no water" = 0,
"high.no water" = 0,
"low.water" = 0,
"medium.water" = -3,
"high.water" = -6)
sigma <- 0.5
pumpkins <- expand.grid(
fertilizer=c("low", "medium", "high"),
water=c("no water", "water"),
plot=1:4,
plant=1:5,
weight=NA)
pumpkins$mean_weight <- (mu + alpha[as.character(pumpkins$fertilizer)]
+ beta[as.character(pumpkins$water)]
+ gamma[paste(as.character(pumpkins$fertilizer),
as.character(pumpkins$water), sep=".")])
pumpkins$weight <- rnorm(nrow(pumpkins),
mean=pumpkins$mean_weight,
sd=sigma)
write.table(pumpkins, file="data/pumpkins.tsv")
IN CLASS:
The simulated dataset is available at data/pumpkins.tsv.
##
## Call:
## lm(formula = weight ~ fertilizer + water, data = pumpkins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77906 -1.15970 -0.01605 1.17450 2.32947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4415 0.2455 14.02 <2e-16 ***
## fertilizermedium 3.5873 0.3007 11.93 <2e-16 ***
## fertilizerhigh 7.0027 0.3007 23.29 <2e-16 ***
## waterwater 3.1258 0.2455 12.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.345 on 116 degrees of freedom
## Multiple R-squared: 0.8586, Adjusted R-squared: 0.855
## F-statistic: 234.9 on 3 and 116 DF, p-value: < 2.2e-16
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 981.0 490.5 271.3 <2e-16 ***
## water 1 293.1 293.1 162.1 <2e-16 ***
## Residuals 116 209.8 1.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Or equivalently,
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 980.95 490.48 271.25 < 2.2e-16 ***
## water 1 293.11 293.11 162.10 < 2.2e-16 ***
## Residuals 116 209.75 1.81
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
John Tukey has a method for that.
> ?TukeyHSD
TukeyHSD package:stats R Documentation
Compute Tukey Honest Significant Differences
Description:
Create a set of confidence intervals on the differences between
the means of the levels of a factor with the specified family-wise
probability of coverage. The intervals are based on the
Studentized range statistic, Tukey's ‘Honest Significant
Difference’ method.
...
When comparing the means for the levels of a factor in an analysis
of variance, a simple comparison using t-tests will inflate the
probability of declaring a significant difference when it is not
in fact present. This because the intervals are calculated with a
given coverage probability for each interval but the
interpretation of the coverage is usually with respect to the
entire family of intervals.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight ~ fertilizer + water, data = pumpkins)
##
## $fertilizer
## diff lwr upr p adj
## medium-low 3.587309 2.873438 4.301180 0
## high-low 7.002706 6.288835 7.716576 0
## high-medium 3.415396 2.701525 4.129267 0
##
## $water
## diff lwr upr p adj
## water-no water 3.125756 2.639501 3.612011 0
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 981.0 490.5 2195.0 <2e-16 ***
## water 1 293.1 293.1 1311.7 <2e-16 ***
## fertilizer:water 2 184.3 92.1 412.3 <2e-16 ***
## Residuals 114 25.5 0.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = weight ~ fertilizer * water, data = pumpkins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.23696 -0.26951 0.00257 0.32312 0.85524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9672 0.1057 18.61 <2e-16 ***
## fertilizermedium 4.9780 0.1495 33.30 <2e-16 ***
## fertilizerhigh 10.0347 0.1495 67.13 <2e-16 ***
## waterwater 6.0742 0.1495 40.63 <2e-16 ***
## fertilizermedium:waterwater -2.7814 0.2114 -13.16 <2e-16 ***
## fertilizerhigh:waterwater -6.0640 0.2114 -28.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4727 on 114 degrees of freedom
## Multiple R-squared: 0.9828, Adjusted R-squared: 0.9821
## F-statistic: 1305 on 5 and 114 DF, p-value: < 2.2e-16
Or equivalently,
## Analysis of Variance Table
##
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 980.95 490.48 2194.98 < 2.2e-16 ***
## water 1 293.11 293.11 1311.73 < 2.2e-16 ***
## fertilizer:water 2 184.28 92.14 412.34 < 2.2e-16 ***
## Residuals 114 25.47 0.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Me: Hey, I made our model more complicated, and look, it fits better!
You: Yeah, of course it does. How much better?
Me: How can we tell?
You: Well, does it reduce the residual variance more than you’d expect by chance?
To compare two models, \[\begin{aligned} F &= \frac{\text{(explained variance)}}{\text{(residual variance)}} \\ &= \frac{\text{(mean square model)}}{\text{(mean square residual)}} \\ &= \frac{\frac{\text{RSS}_1 - \text{RSS}_2}{p_2-p_1}}{\frac{\text{RSS}_2}{n-p_2}} \end{aligned}\]
anova(
lm(weight ~ water, data=pumpkins),
lm(weight ~ fertilizer + water, data=pumpkins),
lm(weight ~ fertilizer * water, data=pumpkins)
)
## Analysis of Variance Table
##
## Model 1: weight ~ water
## Model 2: weight ~ fertilizer + water
## Model 3: weight ~ fertilizer * water
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 118 1190.70
## 2 116 209.75 2 980.95 2194.98 < 2.2e-16 ***
## 3 114 25.47 2 184.28 412.34 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Do a stepwise model comparison of nested linear models, including plant
and plot
in the analysis. Think about what order to do the comparison in. Make sure they are nested!
Data: data/pumpkins.tsv
weight ~ fertilizer + water
is read something like
mean weight is determined by additive effects of fertlizer and of water
\[\begin{equation}\begin{split} y &\sim x \qquad \text{means} \\ y &= a + b x + \text{(mean-zero noise)} . \end{split}\end{equation}\]
The intercept is included implicitly, so these are equivalent:
weight ~ fertilizer + water
weight ~ 1 + fertilizer + water
… so if you don’t want an intercept, do
weight ~ 0 + fertilizer + water
To assign an effect to each element of a crossed design, use :
, e.g.
weight ~ fertilizer + water + fertilizer:water
which is the same as
weight ~ fertilizer * water
since lower-order effects are included implicitly.
~
: depends on+
: and also, independently:
: in combination with*
: and alsoI(x+y)
: actually x
plus y
I(x^2)
: actually x
squaredTrickier things:
1
: an intercept0
: but not an intercept-
: but not.
: all columns not otherwise in the formulax/y
: x
, and y
nested within x
(same as x + x:y
)If you want to know what a formula is really doing, look at its model.matrix( )
, whose columns correspond to the coefficients of the resulting model.
## (Intercept) fertilizermedium fertilizerhigh
## 1 1 0 0
## 2 1 1 0
## 3 1 0 1
## 4 1 0 0
## 5 1 1 0
## 6 1 0 1
## 7 1 0 0
## 8 1 1 0
## 9 1 0 1
## 10 1 0 0
## 11 1 1 0
## 12 1 0 1
## 13 1 0 0
## 14 1 1 0
## 15 1 0 1
## 16 1 0 0
## 17 1 1 0
## 18 1 0 1
## 19 1 0 0
## 20 1 1 0
## 21 1 0 1
## 22 1 0 0
## 23 1 1 0
## 24 1 0 1
## 25 1 0 0
## 26 1 1 0
## 27 1 0 1
## 28 1 0 0
## 29 1 1 0
## 30 1 0 1
## 31 1 0 0
## 32 1 1 0
## 33 1 0 1
## 34 1 0 0
## 35 1 1 0
## 36 1 0 1
## 37 1 0 0
## 38 1 1 0
## 39 1 0 1
## 40 1 0 0
## 41 1 1 0
## 42 1 0 1
## 43 1 0 0
## 44 1 1 0
## 45 1 0 1
## 46 1 0 0
## 47 1 1 0
## 48 1 0 1
## 49 1 0 0
## 50 1 1 0
## 51 1 0 1
## 52 1 0 0
## 53 1 1 0
## 54 1 0 1
## 55 1 0 0
## 56 1 1 0
## 57 1 0 1
## 58 1 0 0
## 59 1 1 0
## 60 1 0 1
## 61 1 0 0
## 62 1 1 0
## 63 1 0 1
## 64 1 0 0
## 65 1 1 0
## 66 1 0 1
## 67 1 0 0
## 68 1 1 0
## 69 1 0 1
## 70 1 0 0
## 71 1 1 0
## 72 1 0 1
## 73 1 0 0
## 74 1 1 0
## 75 1 0 1
## 76 1 0 0
## 77 1 1 0
## 78 1 0 1
## 79 1 0 0
## 80 1 1 0
## 81 1 0 1
## 82 1 0 0
## 83 1 1 0
## 84 1 0 1
## 85 1 0 0
## 86 1 1 0
## 87 1 0 1
## 88 1 0 0
## 89 1 1 0
## 90 1 0 1
## 91 1 0 0
## 92 1 1 0
## 93 1 0 1
## 94 1 0 0
## 95 1 1 0
## 96 1 0 1
## 97 1 0 0
## 98 1 1 0
## 99 1 0 1
## 100 1 0 0
## 101 1 1 0
## 102 1 0 1
## 103 1 0 0
## 104 1 1 0
## 105 1 0 1
## 106 1 0 0
## 107 1 1 0
## 108 1 0 1
## 109 1 0 0
## 110 1 1 0
## 111 1 0 1
## 112 1 0 0
## 113 1 1 0
## 114 1 0 1
## 115 1 0 0
## 116 1 1 0
## 117 1 0 1
## 118 1 0 0
## 119 1 1 0
## 120 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$fertilizer
## [1] "contr.treatment"
## (Intercept) fertilizermedium fertilizerhigh waterwater
## 1 1 0 0 0
## 2 1 1 0 0
## 3 1 0 1 0
## 4 1 0 0 1
## 5 1 1 0 1
## 6 1 0 1 1
## 7 1 0 0 0
## 8 1 1 0 0
## 9 1 0 1 0
## 10 1 0 0 1
## 11 1 1 0 1
## 12 1 0 1 1
## 13 1 0 0 0
## 14 1 1 0 0
## 15 1 0 1 0
## 16 1 0 0 1
## 17 1 1 0 1
## 18 1 0 1 1
## 19 1 0 0 0
## 20 1 1 0 0
## 21 1 0 1 0
## 22 1 0 0 1
## 23 1 1 0 1
## 24 1 0 1 1
## 25 1 0 0 0
## 26 1 1 0 0
## 27 1 0 1 0
## 28 1 0 0 1
## 29 1 1 0 1
## 30 1 0 1 1
## 31 1 0 0 0
## 32 1 1 0 0
## 33 1 0 1 0
## 34 1 0 0 1
## 35 1 1 0 1
## 36 1 0 1 1
## 37 1 0 0 0
## 38 1 1 0 0
## 39 1 0 1 0
## 40 1 0 0 1
## 41 1 1 0 1
## 42 1 0 1 1
## 43 1 0 0 0
## 44 1 1 0 0
## 45 1 0 1 0
## 46 1 0 0 1
## 47 1 1 0 1
## 48 1 0 1 1
## 49 1 0 0 0
## 50 1 1 0 0
## 51 1 0 1 0
## 52 1 0 0 1
## 53 1 1 0 1
## 54 1 0 1 1
## 55 1 0 0 0
## 56 1 1 0 0
## 57 1 0 1 0
## 58 1 0 0 1
## 59 1 1 0 1
## 60 1 0 1 1
## 61 1 0 0 0
## 62 1 1 0 0
## 63 1 0 1 0
## 64 1 0 0 1
## 65 1 1 0 1
## 66 1 0 1 1
## 67 1 0 0 0
## 68 1 1 0 0
## 69 1 0 1 0
## 70 1 0 0 1
## 71 1 1 0 1
## 72 1 0 1 1
## 73 1 0 0 0
## 74 1 1 0 0
## 75 1 0 1 0
## 76 1 0 0 1
## 77 1 1 0 1
## 78 1 0 1 1
## 79 1 0 0 0
## 80 1 1 0 0
## 81 1 0 1 0
## 82 1 0 0 1
## 83 1 1 0 1
## 84 1 0 1 1
## 85 1 0 0 0
## 86 1 1 0 0
## 87 1 0 1 0
## 88 1 0 0 1
## 89 1 1 0 1
## 90 1 0 1 1
## 91 1 0 0 0
## 92 1 1 0 0
## 93 1 0 1 0
## 94 1 0 0 1
## 95 1 1 0 1
## 96 1 0 1 1
## 97 1 0 0 0
## 98 1 1 0 0
## 99 1 0 1 0
## 100 1 0 0 1
## 101 1 1 0 1
## 102 1 0 1 1
## 103 1 0 0 0
## 104 1 1 0 0
## 105 1 0 1 0
## 106 1 0 0 1
## 107 1 1 0 1
## 108 1 0 1 1
## 109 1 0 0 0
## 110 1 1 0 0
## 111 1 0 1 0
## 112 1 0 0 1
## 113 1 1 0 1
## 114 1 0 1 1
## 115 1 0 0 0
## 116 1 1 0 0
## 117 1 0 1 0
## 118 1 0 0 1
## 119 1 1 0 1
## 120 1 0 1 1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$fertilizer
## [1] "contr.treatment"
##
## attr(,"contrasts")$water
## [1] "contr.treatment"
## fertilizerlow fertilizermedium fertilizerhigh waterwater fertilizermedium:waterwater fertilizerhigh:waterwater
## 1 1 0 0 0 0 0
## 2 0 1 0 0 0 0
## 3 0 0 1 0 0 0
## 4 1 0 0 1 0 0
## 5 0 1 0 1 1 0
## 6 0 0 1 1 0 1
## 7 1 0 0 0 0 0
## 8 0 1 0 0 0 0
## 9 0 0 1 0 0 0
## 10 1 0 0 1 0 0
## 11 0 1 0 1 1 0
## 12 0 0 1 1 0 1
## 13 1 0 0 0 0 0
## 14 0 1 0 0 0 0
## 15 0 0 1 0 0 0
## 16 1 0 0 1 0 0
## 17 0 1 0 1 1 0
## 18 0 0 1 1 0 1
## 19 1 0 0 0 0 0
## 20 0 1 0 0 0 0
## 21 0 0 1 0 0 0
## 22 1 0 0 1 0 0
## 23 0 1 0 1 1 0
## 24 0 0 1 1 0 1
## 25 1 0 0 0 0 0
## 26 0 1 0 0 0 0
## 27 0 0 1 0 0 0
## 28 1 0 0 1 0 0
## 29 0 1 0 1 1 0
## 30 0 0 1 1 0 1
## 31 1 0 0 0 0 0
## 32 0 1 0 0 0 0
## 33 0 0 1 0 0 0
## 34 1 0 0 1 0 0
## 35 0 1 0 1 1 0
## 36 0 0 1 1 0 1
## 37 1 0 0 0 0 0
## 38 0 1 0 0 0 0
## 39 0 0 1 0 0 0
## 40 1 0 0 1 0 0
## 41 0 1 0 1 1 0
## 42 0 0 1 1 0 1
## 43 1 0 0 0 0 0
## 44 0 1 0 0 0 0
## 45 0 0 1 0 0 0
## 46 1 0 0 1 0 0
## 47 0 1 0 1 1 0
## 48 0 0 1 1 0 1
## 49 1 0 0 0 0 0
## 50 0 1 0 0 0 0
## 51 0 0 1 0 0 0
## 52 1 0 0 1 0 0
## 53 0 1 0 1 1 0
## 54 0 0 1 1 0 1
## 55 1 0 0 0 0 0
## 56 0 1 0 0 0 0
## 57 0 0 1 0 0 0
## 58 1 0 0 1 0 0
## 59 0 1 0 1 1 0
## 60 0 0 1 1 0 1
## 61 1 0 0 0 0 0
## 62 0 1 0 0 0 0
## 63 0 0 1 0 0 0
## 64 1 0 0 1 0 0
## 65 0 1 0 1 1 0
## 66 0 0 1 1 0 1
## 67 1 0 0 0 0 0
## 68 0 1 0 0 0 0
## 69 0 0 1 0 0 0
## 70 1 0 0 1 0 0
## 71 0 1 0 1 1 0
## 72 0 0 1 1 0 1
## 73 1 0 0 0 0 0
## 74 0 1 0 0 0 0
## 75 0 0 1 0 0 0
## 76 1 0 0 1 0 0
## 77 0 1 0 1 1 0
## 78 0 0 1 1 0 1
## 79 1 0 0 0 0 0
## 80 0 1 0 0 0 0
## 81 0 0 1 0 0 0
## 82 1 0 0 1 0 0
## 83 0 1 0 1 1 0
## 84 0 0 1 1 0 1
## 85 1 0 0 0 0 0
## 86 0 1 0 0 0 0
## 87 0 0 1 0 0 0
## 88 1 0 0 1 0 0
## 89 0 1 0 1 1 0
## 90 0 0 1 1 0 1
## 91 1 0 0 0 0 0
## 92 0 1 0 0 0 0
## 93 0 0 1 0 0 0
## 94 1 0 0 1 0 0
## 95 0 1 0 1 1 0
## 96 0 0 1 1 0 1
## 97 1 0 0 0 0 0
## 98 0 1 0 0 0 0
## 99 0 0 1 0 0 0
## 100 1 0 0 1 0 0
## 101 0 1 0 1 1 0
## 102 0 0 1 1 0 1
## 103 1 0 0 0 0 0
## 104 0 1 0 0 0 0
## 105 0 0 1 0 0 0
## 106 1 0 0 1 0 0
## 107 0 1 0 1 1 0
## 108 0 0 1 1 0 1
## 109 1 0 0 0 0 0
## 110 0 1 0 0 0 0
## 111 0 0 1 0 0 0
## 112 1 0 0 1 0 0
## 113 0 1 0 1 1 0
## 114 0 0 1 1 0 1
## 115 1 0 0 0 0 0
## 116 0 1 0 0 0 0
## 117 0 0 1 0 0 0
## 118 1 0 0 1 0 0
## 119 0 1 0 1 1 0
## 120 0 0 1 1 0 1
## attr(,"assign")
## [1] 1 1 1 2 3 3
## attr(,"contrasts")
## attr(,"contrasts")$fertilizer
## [1] "contr.treatment"
##
## attr(,"contrasts")$water
## [1] "contr.treatment"
For fine control of factors in linear models, either
contrasts
.Make formulas that give you estimates of
A global mean (\(\mu\)), two fertilizer effects (\(\alpha_\text{medium}\) and \(\alpha_\text{high}\)), and one water effect (\(\beta\text{water}\)).
Three fertilizer effects (\(\alpha_\text{low}\), \(\alpha_\text{medium}\) and \(\alpha_\text{high}\)), and one water effect (\(\beta\text{water}\)).
Two fertilizer effects (\(\alpha_\text{medium}\) and \(\alpha_\text{high}\)), and two water effect (\(\beta\text{no water}\) and \(\beta\text{water}\)).
A single mean for each of the six conditions (\(\gamma_\text{high, water}\), \(\gamma_\text{medium, water}\), \(\gamma_\text{low, water}\), \(\gamma_\text{high, no water}\), \(\gamma_\text{medium, no water}\), \(\gamma_\text{low, no water}\)).
Example:
##
## Call:
## lm(formula = weight ~ fertilizer + water, data = pumpkins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77906 -1.15970 -0.01605 1.17450 2.32947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4415 0.2455 14.02 <2e-16 ***
## fertilizermedium 3.5873 0.3007 11.93 <2e-16 ***
## fertilizerhigh 7.0027 0.3007 23.29 <2e-16 ***
## waterwater 3.1258 0.2455 12.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.345 on 116 degrees of freedom
## Multiple R-squared: 0.8586, Adjusted R-squared: 0.855
## F-statistic: 234.9 on 3 and 116 DF, p-value: < 2.2e-16
This resulted in Galton’s formulation of the Law of Ancestral Heredity, which states that the two parents of an offspring jointly contribute one half of an offspring’s heritage, while more-removed ancestors constitute a smaller proportion of the offspring’s heritage. Galton viewed reversion as a spring, that when stretched, would return the distribution of traits back to the normal distribution. When Mendel’s principles were rediscovered in 1900, this resulted in a fierce battle between the followers of Galton’s Law of Ancestral Heredity, the biometricians, and those who advocated Mendel’s principles.
Pearson’s product-moment correlation coefficient: \[ r = \frac{\sum_{i=1}^n (x_i - \bar x) (y_i - \bar y)}{\sqrt{\sum_{i=1}^n (y_i - \bar y)^2} \sqrt{\sum_{i=1}^n (y_i - \bar y)^2}} \]
> help(cor)
Correlation, Variance and Covariance (Matrices)
Description:
‘var’, ‘cov’ and ‘cor’ compute the variance of ‘x’ and the
covariance or correlation of ‘x’ and ‘y’ if these are vectors. If
‘x’ and ‘y’ are matrices then the covariances (or correlations)
between the columns of ‘x’ and the columns of ‘y’ are computed.
‘cov2cor’ scales a covariance matrix into the corresponding
correlation matrix _efficiently_.
Usage:
var(x, y = NULL, na.rm = FALSE, use)
cov(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
\[ \text{(response)} = \text{(intercept)} + \text{(explanatory variables)} + \text{("error"}) \] in the general form: \[ y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik} + \epsilon_i , \] where \(\beta_0, \beta_1, \ldots, \beta_k\) are the parameters of the linear model.
Goal: find \(b_0, \ldots, b_k\) to best fit the model: \[ y_i = b_0 + b_1 x_{i1} + \cdots + b_k x_{ik} + e_i, \] so that \(b_i\) is an estimate of \(\beta_i\) and \(e_i\) is the residual, an estimate of \(\epsilon_i\).
Define the predicted values: \[ \hat y_i = b_0 + b_1 x_{i1} + \cdots + b_k x_{ik}, \] and find \(b_0, \ldots, b_k\) to minimize the sum of squared residuals, or \[ \sum_i \left(y_i - \hat y_i\right)^2 . \]
Amazing fact: if \(k=1\) then \[b_1 = r \frac{\text{sd}(y)}{\text{sd}(x)} .\]
Relationship to likelihood: the Normal distribution.
Let’s recreate Galton’s classic analysis: midparent height, adjusted for gender, is a good predictor of child height. (How good?)
Link to the data.
## family father mother gender height kids male female
## 1 1 78.5 67.0 M 73.2 4 1 0
## 2 1 78.5 67.0 F 69.2 4 0 1
## 3 1 78.5 67.0 F 69.0 4 0 1
## 4 1 78.5 67.0 F 69.0 4 0 1
## 5 2 75.5 66.5 M 73.5 4 1 0
## 6 2 75.5 66.5 M 72.5 4 1 0