Peter Ralph
Advanced Biological Statistics
HairEyeColor package:datasets R Documentation
Hair and Eye Color of Statistics Students
Description:
Distribution of hair and eye color and sex in 592 statistics
students.
Usage:
HairEyeColor
Format:
A 3-dimensional array resulting from cross-tabulating 592
observations on 3 variables. The variables and their levels are
as follows:
No Name Levels
1 Hair Black, Brown, Red, Blond
2 Eye Brown, Blue, Hazel, Green
3 Sex Male, Female
Details:
The Hair x Eye table comes from a survey of students at the
University of Delaware reported by Snee (1974). The split by
‘Sex’ was added by Friendly (1992a) for didactic purposes.
This data set is useful for illustrating various techniques for
the analysis of contingency tables, such as the standard
chi-squared test or, more generally, log-linear modelling, and
graphical methods such as mosaic plots, sieve diagrams or
association plots.
Source:
<URL:
http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas>
Snee (1974) gives the two-way table aggregated over ‘Sex’. The
‘Sex’ split of the ‘Brown hair, Brown eye’ cell was changed to
agree with that used by Friendly (2000).
References:
Snee, R. D. (1974). Graphical display of two-way contingency
tables. _The American Statistician_, *28*, 9-12. doi:
10.2307/2683520 (URL: http://doi.org/10.2307/2683520).
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
haireye <- as.data.frame(HairEyeColor)
names(haireye) <- tolower(names(haireye))
names(haireye)[names(haireye) == "freq"] <- "number"
haireye
## hair eye sex number
## 1 Black Brown Male 32
## 2 Brown Brown Male 53
## 3 Red Brown Male 10
## 4 Blond Brown Male 3
## 5 Black Blue Male 11
## 6 Brown Blue Male 50
## 7 Red Blue Male 10
## 8 Blond Blue Male 30
## 9 Black Hazel Male 10
## 10 Brown Hazel Male 25
## 11 Red Hazel Male 7
## 12 Blond Hazel Male 5
## 13 Black Green Male 3
## 14 Brown Green Male 15
## 15 Red Green Male 7
## 16 Blond Green Male 8
## 17 Black Brown Female 36
## 18 Brown Brown Female 66
## 19 Red Brown Female 16
## 20 Blond Brown Female 4
## 21 Black Blue Female 9
## 22 Brown Blue Female 34
## 23 Red Blue Female 7
## 24 Blond Blue Female 64
## 25 Black Hazel Female 5
## 26 Brown Hazel Female 29
## 27 Red Hazel Female 7
## 28 Blond Hazel Female 5
## 29 Black Green Female 2
## 30 Brown Green Female 14
## 31 Red Green Female 7
## 32 Blond Green Female 8
Questions:
We have for data \[\begin{aligned} n_{ij} = \text{( number of observations in categories $i$ and $j$ )} . \end{aligned}\] and \[\begin{aligned} n_{i \cdot} &= \text{( total number of observations in category $i$ )} , \\ n_{\cdot j} &= \text{( total number of observations in category $j$ )} , \\ n_{\cdot \cdot} &= \text{( total number of observations )} . \end{aligned}\]
Question: Is \(n_{i \cdot}\) fixed? What about \(n_{\cdot j}\)? Or \(n_{\cdot \cdot}\)?
Answer: Happily, it will turn out not to matter (for the model).
If we have Poisson-many things of two categories: \[\begin{aligned} A &\sim \Poisson(a) \\ B &\sim \Poisson(b) \end{aligned}\] then the total number of things is also Poisson: \[ A + B \sim \Poisson(a + b) \] and each thing chooses its type independently: \[ A \given A + B \sim \Binom\left(A+B, \frac{a}{a+b}\right) . \]
So, if we fit a model where
\[\begin{aligned} n_{ij} &\sim \Poisson(\lambda_{ij}) \end{aligned}\]
then \[\begin{aligned} \frac{\lambda_{ij}}{\sum_{k\ell} \lambda_{k\ell}} &= (\text{proportion in category $ij$}) \\ \frac{\lambda_{ij}}{\sum_{k} \lambda_{kj}} &= (\text{proportion of row $j$ in column $i$}) \\ \frac{\lambda_{ij}}{\lambda_{ij} + \lambda_{k\ell}} &= (\text{proportion of those in $k\ell$ or $ij$ that are $ij$}) \end{aligned}\]
For example, the proportion of black-haired people who have blue eyes is \[\begin{aligned} & \P\left(\text{ eye blue } \given \text{ hair black } \right) \\ &\qquad = \frac{ \lambda_{\text{blue},\text{black}} }{ \lambda_{\cdot,\text{black}} } \\ &\qquad = \frac{ \lambda_{\text{blue},\text{black}} }{ \lambda_{\text{brown},\text{black}} + \lambda_{\text{blue},\text{black}} + \lambda_{\text{hazel},\text{black}} + \lambda_{\text{green},\text{black}} } \end{aligned}\]
Question: What’s \[\begin{aligned} \P\left(\text{ male } \given \text{ hair black } \right) = ? \end{aligned}\]
If hair and eye color are independent, then probabilities of combinations are multiplicative:
\[\begin{aligned} &\P\{\text{black hair and blue eyes}\} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\given\text{black hair}\} \\ \end{aligned}\]
which if independent is \[\begin{aligned} &\hphantom{\P\{\text{black hair and blue eyes}\}} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\} \end{aligned}\]
In shorthand, \[\begin{aligned} p_{ij} = p_i \times p_j . \end{aligned}\]
If we use a log link function, additivity corresponds to independence: \[\begin{aligned} p_{ij} &= \frac{ \lambda_{ij} }{ \sum_{ab} \lambda_{ab} } \\ &= \frac{ e^{\beta_i + \gamma_j} }{ \sum_{ab} e^{\beta_a + \gamma_b} } \\ &= \left( \frac{ e^{\beta_i} }{ \sum_{a} e^{\beta_a} } \right) \times \left( \frac{ e^{\gamma_j} }{ \sum_{b} e^{\gamma_b} } \right) \\ &= p_i \times p_j \end{aligned}\]
## prior class coef group resp dpar nlpar bound source
## (flat) b default
## (flat) b eyeBlue (vectorized)
## (flat) b eyeGreen (vectorized)
## (flat) b eyeHazel (vectorized)
## (flat) b hairBlond (vectorized)
## (flat) b hairBlond:eyeBlue (vectorized)
## (flat) b hairBlond:eyeGreen (vectorized)
## (flat) b hairBlond:eyeHazel (vectorized)
## (flat) b hairBrown (vectorized)
## (flat) b hairBrown:eyeBlue (vectorized)
## (flat) b hairBrown:eyeGreen (vectorized)
## (flat) b hairBrown:eyeHazel (vectorized)
## (flat) b hairRed (vectorized)
## (flat) b hairRed:eyeBlue (vectorized)
## (flat) b hairRed:eyeGreen (vectorized)
## (flat) b hairRed:eyeHazel (vectorized)
## (flat) b sexFemale (vectorized)
## (flat) b sexFemale:eyeBlue (vectorized)
## (flat) b sexFemale:eyeGreen (vectorized)
## (flat) b sexFemale:eyeHazel (vectorized)
## (flat) b sexFemale:hairBlond (vectorized)
## (flat) b sexFemale:hairBrown (vectorized)
## (flat) b sexFemale:hairRed (vectorized)
## student_t(3, 2.3, 2.5) Intercept default
## Compiling Stan program...
## Start sampling
## Family: poisson
## Links: mu = log
## Formula: number ~ sex + hair * eye + (hair + eye) * sex
## Data: haireye (Number of observations: 32)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 3.43 0.16 3.10 3.73 1.00 2622 2571
## sexFemale 0.09 0.21 -0.31 0.51 1.00 2839 2813
## hairBrown 0.53 0.19 0.16 0.92 1.00 2763 2771
## hairRed -1.05 0.29 -1.61 -0.49 1.00 2979 2692
## hairBlond -2.66 0.41 -3.50 -1.89 1.00 2640 2622
## eyeBlue -0.98 0.26 -1.49 -0.48 1.00 2715 2792
## eyeHazel -1.31 0.30 -1.92 -0.73 1.00 2838 2461
## eyeGreen -2.27 0.45 -3.22 -1.44 1.00 2857 2518
## hairBrown:eyeBlue 0.83 0.28 0.29 1.37 1.00 2705 2685
## hairRed:eyeBlue 0.74 0.39 -0.05 1.50 1.00 3150 2995
## hairBlond:eyeBlue 3.75 0.44 2.93 4.62 1.00 2235 2571
## hairBrown:eyeHazel 0.67 0.33 0.04 1.32 1.00 2898 2679
## hairRed:eyeHazel 0.82 0.43 -0.04 1.65 1.00 3124 2989
## hairBlond:eyeHazel 1.71 0.53 0.66 2.75 1.00 3181 3042
## hairBrown:eyeGreen 1.07 0.47 0.22 2.06 1.00 2940 2437
## hairRed:eyeGreen 1.84 0.54 0.83 2.95 1.00 3156 3031
## hairBlond:eyeGreen 3.24 0.61 2.08 4.48 1.00 2626 2792
## sexFemale:hairBrown 0.14 0.23 -0.32 0.59 1.00 2581 2875
## sexFemale:hairRed 0.24 0.32 -0.39 0.87 1.00 3021 2822
## sexFemale:hairBlond 0.86 0.30 0.28 1.44 1.00 2652 2858
## sexFemale:eyeBlue -0.41 0.22 -0.85 0.01 1.00 4118 3467
## sexFemale:eyeHazel -0.32 0.25 -0.81 0.14 1.00 5029 3364
## sexFemale:eyeGreen -0.48 0.30 -1.08 0.11 1.00 4669 2986
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
But, that was numbers: here’s the credible intervals for the proportions of each eye color for each hair color:
Recall the “reference” levels are hair = Black and eye = Brown
Questions:
(What do we mean: given gender? averaged across gender?)
Recall the “reference” levels are hair = Black and eye = Brown
Eye color proportions by hair type:
Hair color proportions by eye type:
What we’ve done so far:
Question: What would we have seen if hair and eye color were independent?
Let’s simulate some test data, under independence.
sim_data <- function () {
n <- sum(haireye$number)
p_hair <- tapply(haireye$number, haireye$hair, sum) / n
p_eye <- tapply(haireye$number, haireye$eye, sum) / n
p_sex <- tapply(haireye$number, haireye$sex, sum) / n
p <- (p_hair[haireye$hair] *
p_eye[haireye$eye] *
p_sex[haireye$sex])
stopifnot(sum(p) == 1)
number <- rmultinom(1, size=n, prob=p)
return(number)
}
sim_fits <- lapply(1:5, function (k) {
# replaces the data only locally
haireye$number <- sim_data()
update(he_fit, newdata=haireye) })
## Start sampling
## Start sampling
## Start sampling
## Start sampling
## Start sampling