Let’s recreate Galton’s classic analysis: midparent height, adjusted for gender, is a good predictor of child height. (How good?)
Link to the data.
galton <- read.table("../Datasets/galton/galton-all.tsv", header=TRUE)
head(galton)
## family father mother gender height kids male female
## 1 1 78.5 67.0 M 73.2 4 1 0
## 2 1 78.5 67.0 F 69.2 4 0 1
## 3 1 78.5 67.0 F 69.0 4 0 1
## 4 1 78.5 67.0 F 69.0 4 0 1
## 5 2 75.5 66.5 M 73.5 4 1 0
## 6 2 75.5 66.5 M 72.5 4 1 0
Here are the heights of children:
hist(galton$height, breaks=30, xlab='height', main='')
Now let’s look at heights by sex:
layout((1:2))
hist(galton$height[galton$gender=="M"], breaks=20, xlab='height', main='', xlim=range(galton$height))
hist(galton$height[galton$gender=="F"], breaks=20, xlab='height', main='', xlim=range(galton$height))
Also, we should look at parent and child height relationships: red is male, black is female; circles are fathers, squares are mothers.
plot(father ~ height, data=galton, col=gender, ylim=range(father, mother), ylab='parent heights', pch=20, xlab='child height', asp=1)
points(mother ~ height, data=galton, col=gender, pch=22)
Is there a correlation between parents’ heights?
plot(jitter(father) ~ jitter(mother), data=galton, asp=1, col=adjustcolor("black", 0.5), pch=20)
Not obviously.
To adjust by sex, we need to, separately for parents and children,
parent_means <- c('father' = mean(tapply(galton$father, galton$family, unique)),
'mother' = mean(tapply(galton$mother, galton$family, unique)))
child_means <- c('F' = mean(subset(galton, gender=="F")$height),
'M' = mean(subset(galton, gender=="M")$height))
galton$adj_father <- galton$father + diff(parent_means)
galton$adj_height <- galton$height - ifelse(galton$gender == "F", 0, abs(diff(child_means)))
Now the histograms line up:
layout((1:2))
hist(galton$adj_height[galton$gender=="M"], breaks=20, xlab='height', main='', xlim=range(galton$adj_height))
hist(galton$adj_height[galton$gender=="F"], breaks=20, xlab='height', main='', xlim=range(galton$adj_height))
The midparent height is the average of the adjusted father height and the mother height.
galton$midparent <- (galton$adj_father + galton$mother)/2
the_lm <- lm(adj_height ~ midparent, data=galton)
summary(the_lm)
##
## Call:
## lm(formula = adj_height ~ midparent, data = galton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4785 -1.4561 0.0934 1.4471 9.1557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.30909 2.63234 6.576 8.22e-11 ***
## midparent 0.73154 0.04113 17.786 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.157 on 896 degrees of freedom
## Multiple R-squared: 0.2609, Adjusted R-squared: 0.2601
## F-statistic: 316.3 on 1 and 896 DF, p-value: < 2.2e-16
Diagnostics: first, residuals versus fitted:
plot(fitted(the_lm), resid(the_lm), xlab='fitted values',
ylab='residuals')
abline(h=0, lty=3)
Look at residual distribution:
qqnorm(resid(the_lm))
qqline(resid(the_lm))
It doesn’t look like we’re missing anything obvious:
summary(lm(adj_height ~ midparent + I(father - mother) + kids + gender, data=galton))
##
## Call:
## lm(formula = adj_height ~ midparent + I(father - mother) + kids +
## gender, data = galton)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4748 -1.4500 0.0889 1.4716 9.1656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.11712 2.68299 6.753 2.61e-11 ***
## midparent 0.71926 0.04149 17.335 < 2e-16 ***
## I(father - mother) 0.03867 0.02225 1.738 0.0825 .
## kids -0.04382 0.02718 -1.612 0.1073
## genderM 0.09129 0.14422 0.633 0.5269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.152 on 893 degrees of freedom
## Multiple R-squared: 0.2665, Adjusted R-squared: 0.2632
## F-statistic: 81.13 on 4 and 893 DF, p-value: < 2.2e-16
anova(lm(adj_height ~ midparent, data=galton),
lm(adj_height ~ midparent + I(father - mother) + kids + gender, data=galton))
## Analysis of Variance Table
##
## Model 1: adj_height ~ midparent
## Model 2: adj_height ~ midparent + I(father - mother) + kids + gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 896 4168.7
## 2 893 4137.1 3 31.581 2.2722 0.07874 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1