Peter Ralph
19 January 2021 – Advanced Biological Statistics
HairEyeColor package:datasets R Documentation
Hair and Eye Color of Statistics Students
Description:
Distribution of hair and eye color and sex in 592 statistics
students.
Usage:
HairEyeColor
Format:
A 3-dimensional array resulting from cross-tabulating 592
observations on 3 variables. The variables and their levels are
as follows:
No Name Levels
1 Hair Black, Brown, Red, Blond
2 Eye Brown, Blue, Hazel, Green
3 Sex Male, Female
Details:
The Hair x Eye table comes from a survey of students at the
University of Delaware reported by Snee (1974). The split by
‘Sex’ was added by Friendly (1992a) for didactic purposes.
This data set is useful for illustrating various techniques for
the analysis of contingency tables, such as the standard
chi-squared test or, more generally, log-linear modelling, and
graphical methods such as mosaic plots, sieve diagrams or
association plots.
Source:
<URL:
http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas>
Snee (1974) gives the two-way table aggregated over ‘Sex’. The
‘Sex’ split of the ‘Brown hair, Brown eye’ cell was changed to
agree with that used by Friendly (2000).
References:
Snee, R. D. (1974). Graphical display of two-way contingency
tables. _The American Statistician_, *28*, 9-12. doi:
10.2307/2683520 (URL: http://doi.org/10.2307/2683520).
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
haireye <- as.data.frame(HairEyeColor)
names(haireye) <- tolower(names(haireye))
names(haireye)[names(haireye) == "freq"] <- "number"
haireye
## hair eye sex number
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
## 7 Red Blue Male 10
## 8 Blond Blue Male 30
## 9 Black Hazel Male 10
## 10 Brown Hazel Male 25
## 11 Red Hazel Male 7
## 12 Blond Hazel Male 5
## 13 Black Green Male 3
## 14 Brown Green Male 15
## 15 Red Green Male 7
## 16 Blond Green Male 8
## 17 Black Brown Female 36
## 18 Brown Brown Female 66
## 19 Red Brown Female 16
## 20 Blond Brown Female 4
## 21 Black Blue Female 9
## 22 Brown Blue Female 34
## 23 Red Blue Female 7
## 24 Blond Blue Female 64
## 25 Black Hazel Female 5
## 26 Brown Hazel Female 29
## 27 Red Hazel Female 7
## 28 Blond Hazel Female 5
## 29 Black Green Female 2
## 30 Brown Green Female 14
## 31 Red Green Female 7
## 32 Blond Green Female 8
Questions:
We have for data \[\begin{aligned} n_{ij} = \text{( number of observations in categories $i$ and $j$ )} . \end{aligned}\] and \[\begin{aligned} n_{i \cdot} &= \text{( total number of observations in category $i$ )} , \\ n_{\cdot j} &= \text{( total number of observations in category $j$ )} , \\ n_{\cdot \cdot} &= \text{( total number of observations )} . \end{aligned}\]
Question: Is \(n_{i \cdot}\) fixed? What about \(n_{\cdot j}\)? Or \(n_{\cdot \cdot}\)?
Answer: Happily, it will turn out not to matter (for the model).
If we have Poisson-many things of two categories: \[\begin{aligned} A &\sim \Poisson(a) \\ B &\sim \Poisson(b) \end{aligned}\] then the total number of things is also Poisson: \[ A + B \sim \Poisson(a + b) \] and each thing chooses its type independently: \[ A \given A + B \sim \Binom\left(A+B, \frac{a}{a+b}\right) . \]
So, if we fit a model where
\[\begin{aligned} n_{ij} &\sim \Poisson(\lambda_{ij}) \end{aligned}\]
then \[\begin{aligned} \frac{\lambda_{ij}}{\sum_{k\ell} \lambda_{k\ell}} &= (\text{proportion in category $ij$}) \\ \frac{\lambda_{ij}}{\sum_{k} \lambda_{kj}} &= (\text{proportion of row $j$ in column $i$}) \\ \frac{\lambda_{ij}}{\lambda_{ij} + \lambda_{k\ell}} &= (\text{proportion of those in $k\ell$ or $ij$ that are $ij$}) \end{aligned}\]
For example, the proportion of black-haired people who have blue eyes is \[\begin{aligned} & \P\left(\text{ eye blue } \given \text{ hair black } \right) \\ &\qquad = \frac{ \lambda_{\text{blue},\text{black}} }{ \lambda_{\cdot,\text{black}} } \\ &\qquad = \frac{ \lambda_{\text{blue},\text{black}} }{ \lambda_{\text{brown},\text{black}} + \lambda_{\text{blue},\text{black}} + \lambda_{\text{hazel},\text{black}} + \lambda_{\text{green},\text{black}} } \end{aligned}\]
Question: What’s \[\begin{aligned} \P\left(\text{ male } \given \text{ hair black } \right) = ? \end{aligned}\]
If hair and eye color are independent, then probabilities of combinations are multiplicative:
\[\begin{aligned} &\P\{\text{black hair and blue eyes}\} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\given\text{black hair}\} \\ \end{aligned}\]
which if independent is \[\begin{aligned} &\hphantom{\P\{\text{black hair and blue eyes}\}} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\} \end{aligned}\]
In shorthand, \[\begin{aligned} p_{ij} = p_i \times p_j . \end{aligned}\]
If we use a log link function, additivity corresponds to independence: \[\begin{aligned} p_{ij} &= \frac{ \lambda_{ij} }{ \sum_{ab} \lambda_{ab} } \\ &= \frac{ e^{\beta_i + \gamma_j} }{ \sum_{ab} e^{\beta_a + \gamma_b} } \\ &= \left( \frac{ e^{\beta_i} }{ \sum_{a} e^{\beta_a} } \right) \times \left( \frac{ e^{\gamma_j} }{ \sum_{b} e^{\gamma_b} } \right) \\ &= p_i \times p_j \end{aligned}\]
## prior class coef group resp dpar nlpar bound source
## (flat) b default
## (flat) b eyeBlue (vectorized)
## (flat) b eyeGreen (vectorized)
## (flat) b eyeHazel (vectorized)
## (flat) b hairBlond (vectorized)
## (flat) b hairBlond:eyeBlue (vectorized)
## (flat) b hairBlond:eyeGreen (vectorized)
## (flat) b hairBlond:eyeHazel (vectorized)
## (flat) b hairBrown (vectorized)
## (flat) b hairBrown:eyeBlue (vectorized)
## (flat) b hairBrown:eyeGreen (vectorized)
## (flat) b hairBrown:eyeHazel (vectorized)
## (flat) b hairRed (vectorized)
## (flat) b hairRed:eyeBlue (vectorized)
## (flat) b hairRed:eyeGreen (vectorized)
## (flat) b hairRed:eyeHazel (vectorized)
## (flat) b sexFemale (vectorized)
## (flat) b sexFemale:eyeBlue (vectorized)
## (flat) b sexFemale:eyeGreen (vectorized)
## (flat) b sexFemale:eyeHazel (vectorized)
## (flat) b sexFemale:hairBlond (vectorized)
## (flat) b sexFemale:hairBrown (vectorized)
## (flat) b sexFemale:hairRed (vectorized)
## student_t(3, 2.3, 2.5) Intercept default
## Warning: partial match of 'dpar' to 'dpars'
## Compiling Stan program...
## Start sampling
## Family: poisson
## Links: mu = log
## Formula: number ~ sex + hair * eye + (hair + eye) * sex
## Data: haireye (Number of observations: 32)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 3.43 0.16 3.10 3.74 1.00 2530 2801
## sexFemale 0.09 0.21 -0.31 0.50 1.00 2835 3128
## hairBrown 0.53 0.19 0.16 0.91 1.00 2523 2640
## hairRed -1.05 0.29 -1.62 -0.50 1.00 2589 2746
## hairBlond -2.66 0.41 -3.49 -1.89 1.00 2562 2667
## eyeBlue -0.97 0.27 -1.49 -0.45 1.00 2907 2882
## eyeHazel -1.31 0.31 -1.94 -0.73 1.00 2949 2919
## eyeGreen -2.27 0.44 -3.18 -1.48 1.00 2402 2466
## hairBrown:eyeBlue 0.83 0.28 0.28 1.40 1.00 2689 2600
## hairRed:eyeBlue 0.73 0.40 -0.08 1.51 1.00 3026 2953
## hairBlond:eyeBlue 3.73 0.44 2.92 4.64 1.00 2548 2240
## hairBrown:eyeHazel 0.67 0.33 0.05 1.33 1.00 2990 3269
## hairRed:eyeHazel 0.82 0.43 -0.02 1.69 1.00 3314 3239
## hairBlond:eyeHazel 1.70 0.56 0.64 2.81 1.00 2937 2369
## hairBrown:eyeGreen 1.07 0.46 0.21 2.04 1.00 2467 2804
## hairRed:eyeGreen 1.84 0.54 0.80 2.91 1.00 2557 2597
## hairBlond:eyeGreen 3.24 0.59 2.12 4.45 1.00 2282 2773
## sexFemale:hairBrown 0.13 0.23 -0.31 0.59 1.00 2512 2906
## sexFemale:hairRed 0.24 0.31 -0.37 0.83 1.00 2658 2739
## sexFemale:hairBlond 0.86 0.30 0.27 1.44 1.00 2547 2683
## sexFemale:eyeBlue -0.41 0.22 -0.85 0.03 1.00 3983 3291
## sexFemale:eyeHazel -0.32 0.25 -0.82 0.16 1.00 4442 3327
## sexFemale:eyeGreen -0.48 0.29 -1.06 0.09 1.00 4119 3035
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
But, that was numbers: here’s the credible intervals for the proportions of each eye color for each hair color:
Recall the “reference” levels are hair = Black and eye = Brown
Questions:
(What do we mean: given gender? averaged across gender?)
Recall the “reference” levels are hair = Black and eye = Brown
Eye color proportions by hair type:
Hair color proportions by eye type:
What we’ve done so far:
Question: What would we have seen if hair and eye color were independent?
Let’s simulate some test data, under independence.
sim_data <- function () {
n <- sum(haireye$number)
p_hair <- tapply(haireye$number, haireye$hair, sum) / n
p_eye <- tapply(haireye$number, haireye$eye, sum) / n
p_sex <- tapply(haireye$number, haireye$sex, sum) / n
p <- (p_hair[haireye$hair] *
p_eye[haireye$eye] *
p_sex[haireye$sex])
stopifnot(sum(p) == 1)
number <- rmultinom(1, size=n, prob=p)
return(number)
}
sim_fits <- lapply(1:5, function (k) {
# replaces the data only locally
haireye$number <- sim_data()
update(he_fit, newdata=haireye) })
## Start sampling
## Start sampling
## Start sampling
## Start sampling
## Start sampling