Let’s recreate Galton’s classic analysis: midparent height, adjusted for gender, is a good predictor of child height. (How good?)

Link to the data.

galton <- read.table("../Datasets/galton/galton-all.tsv", header=TRUE)
head(galton)
##   family father mother gender height kids male female
## 1      1   78.5   67.0      M   73.2    4    1      0
## 2      1   78.5   67.0      F   69.2    4    0      1
## 3      1   78.5   67.0      F   69.0    4    0      1
## 4      1   78.5   67.0      F   69.0    4    0      1
## 5      2   75.5   66.5      M   73.5    4    1      0
## 6      2   75.5   66.5      M   72.5    4    1      0

Look at the data

Here are the heights of children:

hist(galton$height, breaks=30, xlab='height', main='')

Now let’s look at heights by sex:

layout((1:2))
hist(galton$height[galton$gender=="M"], breaks=20, xlab='height', main='', xlim=range(galton$height))
hist(galton$height[galton$gender=="F"], breaks=20, xlab='height', main='', xlim=range(galton$height))

Also, we should look at parent and child height relationships: red is male, black is female; circles are fathers, squares are mothers.

plot(father ~ height, data=galton, col=gender, ylim=range(father, mother), ylab='parent heights', pch=20, xlab='child height', asp=1)
points(mother ~ height, data=galton, col=gender, pch=22)

Is there a correlation between parents’ heights?

plot(jitter(father) ~ jitter(mother), data=galton, asp=1, col=adjustcolor("black", 0.5), pch=20)

Not obviously.

Adjust by gender

To adjust by sex, we need to, separately for parents and children,

parent_means <- c('father' = mean(tapply(galton$father, galton$family, unique)),
                  'mother' = mean(tapply(galton$mother, galton$family, unique)))
child_means <- c('F' = mean(subset(galton, gender=="F")$height),
                 'M' = mean(subset(galton, gender=="M")$height))
galton$adj_father <- galton$father + diff(parent_means)
galton$adj_height <- galton$height - ifelse(galton$gender == "F", 0, abs(diff(child_means)))

Now the histograms line up:

layout((1:2))
hist(galton$adj_height[galton$gender=="M"], breaks=20, xlab='height', main='', xlim=range(galton$adj_height))
hist(galton$adj_height[galton$gender=="F"], breaks=20, xlab='height', main='', xlim=range(galton$adj_height))

Compute midparent height

The midparent height is the average of the adjusted father height and the mother height.

galton$midparent <- (galton$adj_father + galton$mother)/2

Fit a model

the_lm <- lm(adj_height ~ midparent, data=galton)
summary(the_lm)
## 
## Call:
## lm(formula = adj_height ~ midparent, data = galton)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4785 -1.4561  0.0934  1.4471  9.1557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.30909    2.63234   6.576 8.22e-11 ***
## midparent    0.73154    0.04113  17.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.157 on 896 degrees of freedom
## Multiple R-squared:  0.2609, Adjusted R-squared:  0.2601 
## F-statistic: 316.3 on 1 and 896 DF,  p-value: < 2.2e-16

Diagnostics: first, residuals versus fitted:

plot(fitted(the_lm), resid(the_lm), xlab='fitted values',
     ylab='residuals')
abline(h=0, lty=3)

Look at residual distribution:

qqnorm(resid(the_lm))
qqline(resid(the_lm))

Look for residual signal

It doesn’t look like we’re missing anything obvious:

summary(lm(adj_height ~ midparent + I(father - mother) + kids + gender, data=galton))
## 
## Call:
## lm(formula = adj_height ~ midparent + I(father - mother) + kids + 
##     gender, data = galton)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4748 -1.4500  0.0889  1.4716  9.1656 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        18.11712    2.68299   6.753 2.61e-11 ***
## midparent           0.71926    0.04149  17.335  < 2e-16 ***
## I(father - mother)  0.03867    0.02225   1.738   0.0825 .  
## kids               -0.04382    0.02718  -1.612   0.1073    
## genderM             0.09129    0.14422   0.633   0.5269    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.152 on 893 degrees of freedom
## Multiple R-squared:  0.2665, Adjusted R-squared:  0.2632 
## F-statistic: 81.13 on 4 and 893 DF,  p-value: < 2.2e-16
anova(lm(adj_height ~ midparent, data=galton),
      lm(adj_height ~ midparent + I(father - mother) + kids + gender, data=galton))
## Analysis of Variance Table
## 
## Model 1: adj_height ~ midparent
## Model 2: adj_height ~ midparent + I(father - mother) + kids + gender
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    896 4168.7                              
## 2    893 4137.1  3    31.581 2.2722 0.07874 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1