Peter Ralph
Advanced Biological Statistics
HairEyeColor package:datasets R Documentation
Hair and Eye Color of Statistics Students
Description:
Distribution of hair and eye color and sex in 592 statistics
students.
Usage:
HairEyeColor
Format:
A 3-dimensional array resulting from cross-tabulating 592
observations on 3 variables. The variables and their levels are
as follows:
No Name Levels
1 Hair Black, Brown, Red, Blond
2 Eye Brown, Blue, Hazel, Green
3 Sex Male, Female
Details:
The Hair x Eye table comes from a survey of students at the
University of Delaware reported by Snee (1974). The split by
‘Sex’ was added by Friendly (1992a) for didactic purposes.
This data set is useful for illustrating various techniques for
the analysis of contingency tables, such as the standard
chi-squared test or, more generally, log-linear modelling, and
graphical methods such as mosaic plots, sieve diagrams or
association plots.
Source:
<URL:
http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas>
Snee (1974) gives the two-way table aggregated over ‘Sex’. The
‘Sex’ split of the ‘Brown hair, Brown eye’ cell was changed to
agree with that used by Friendly (2000).
References:
Snee, R. D. (1974). Graphical display of two-way contingency
tables. _The American Statistician_, *28*, 9-12. doi:
10.2307/2683520 (URL: http://doi.org/10.2307/2683520).
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
haireye <- as.data.frame(HairEyeColor)
names(haireye) <- tolower(names(haireye))
names(haireye)[names(haireye) == "freq"] <- "number"
haireye
## hair eye sex number
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
## 7 Red Blue Male 10
## 8 Blond Blue Male 30
## 9 Black Hazel Male 10
## 10 Brown Hazel Male 25
## 11 Red Hazel Male 7
## 12 Blond Hazel Male 5
## 13 Black Green Male 3
## 14 Brown Green Male 15
## 15 Red Green Male 7
## 16 Blond Green Male 8
## 17 Black Brown Female 36
## 18 Brown Brown Female 66
## 19 Red Brown Female 16
## 20 Blond Brown Female 4
## 21 Black Blue Female 9
## 22 Brown Blue Female 34
## 23 Red Blue Female 7
## 24 Blond Blue Female 64
## 25 Black Hazel Female 5
## 26 Brown Hazel Female 29
## 27 Red Hazel Female 7
## 28 Blond Hazel Female 5
## 29 Black Green Female 2
## 30 Brown Green Female 14
## 31 Red Green Female 7
## 32 Blond Green Female 8
Questions:
If hair and eye color are independent, then probabilities of combinations are multiplicative:
\[\begin{aligned} &\P\{\text{black hair and blue eyes}\} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\given\text{black hair}\} \\ \end{aligned}\]
which if independent is \[\begin{aligned} &\hphantom{\P\{\text{black hair and blue eyes}\}} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\} \end{aligned}\]
A model of independence will have a multiplicative form: \[ p_{ab} = p_a \times p_b . \]
Let’s start by looking at just hair and eye color, summing over sex:
## Eye
## Hair Brown Blue Hazel Green
## Black 68 20 15 5
## Brown 119 84 54 29
## Red 26 17 14 14
## Blond 7 94 10 16
In this dataset…
Let \[\begin{aligned} n_{ij} &= (\text{observed}_{ij}) \\ &=(\text{observed number with hair $i$ and eye $j$}) \\ E_{ij} &= (\text{expected}_{ij}) \\ &=(\text{total number}) \times(\text{proportion with hair $i$}) \\ &\qquad \times (\text{proportion with eye $j$}) \\ &= n \times \left(\frac{n_{i\cdot}}{n}\right) \times \left(\frac{n_{\cdot j}}{n}\right) . \end{aligned}\]
Here \(n_{i \cdot}\) and \(n_{\cdot j}\) are the row and column sums.
We want to quantify how different the observed and expected are, inversely weighted by their noisiness: \[\begin{aligned} \sum_{ij} \left( \frac{ (\text{observed})_{ij} - (\text{expected})_{ij} }{ \SE[\text{observed}_{ij}] } \right)^2 \end{aligned}\]
So, what is \(\SE[\text{observed}_{ij}]\)?
Under the model of independence, \[\begin{aligned} n_{ij} &\sim \Binom(n, p_i q_j) , \\ \text{where}\quad p_i &= (\text{prob of hair color $i$}) \\ q_j &= (\text{prob of eye color $j$}) . \end{aligned}\]
So, \[\begin{aligned} \sd[n_{ij}] = \sqrt{ n p_i q_j (1 - p_i q_j) } , \end{aligned}\]
… and so how about this \[\begin{aligned} \SE[n_{ij}] &\approx \sqrt{ n p_i q_j } \\ &= \sqrt{(\text{expected}_{ij})} \qquad \ldots? \end{aligned}\]
\[\begin{aligned} \chi^2 &= \sum_{ij} \frac{ \left((\text{observed})_{ij} - (\text{expected})_{ij} \right)^2 }{ (\text{expected})_{ij} } . \end{aligned}\]
i.e., “observed minus expected squared, divided by expected”.
This gives us a number. What does it mean?
##
## Pearson's Chi-squared test
##
## data: haireye_2d
## X-squared = 138, df = 9, p-value <2e-16
Um, ok? Hair and eye color are not independent?
Let’s actually look at “observed minus expected”:
haireye_exp <- 0 * haireye_2d
haireye_exp[] <- ( rowSums(haireye_2d)[row(haireye_exp)]
* colSums(haireye_2d)[col(haireye_exp)]
/ sum(haireye_2d) )
## Eye
## Hair Brown Blue Hazel Green
## Black 40.1 39.2 17.0 11.7
## Brown 106.3 103.9 44.9 30.9
## Red 26.4 25.8 11.2 7.7
## Blond 47.2 46.1 20.0 13.7
Observed minus expected:
## Eye
## Hair Brown Blue Hazel Green
## Black 27.86 -19.22 -1.97 -6.68
## Brown 12.72 -19.87 9.07 -1.92
## Red -0.39 -8.79 2.85 6.32
## Blond -40.20 47.88 -9.95 2.27
Normalized by \(\sqrt{\text{expected}}\):
## Eye
## Hair Brown Blue Hazel Green
## Black 4.398 -3.069 -0.477 -1.954
## Brown 1.233 -1.949 1.353 -0.345
## Red -0.075 -1.730 0.852 2.283
## Blond -5.851 7.050 -2.228 0.613
No! (unsurprisingly) Also, we know how nonindependent they are.