Categorical data and the chi-squared test

Peter Ralph

21 January 2020 – Advanced Biological Statistics

Categorical data

Hair and Eye color

data(HairEyeColor)

HairEyeColor             package:datasets              R Documentation

Hair and Eye Color of Statistics Students

Description:

     Distribution of hair and eye color and sex in 592 statistics
     students.

Usage:

     HairEyeColor
     
Format:

     A 3-dimensional array resulting from cross-tabulating 592
     observations on 3 variables.  The variables and their levels are
     as follows:

       No  Name  Levels                    
        1  Hair  Black, Brown, Red, Blond  
        2  Eye   Brown, Blue, Hazel, Green 
        3  Sex   Male, Female              
      
Details:

     The Hair x Eye table comes from a survey of students at the
     University of Delaware reported by Snee (1974).  The split by
     ‘Sex’ was added by Friendly (1992a) for didactic purposes.

     This data set is useful for illustrating various techniques for
     the analysis of contingency tables, such as the standard
     chi-squared test or, more generally, log-linear modelling, and
     graphical methods such as mosaic plots, sieve diagrams or
     association plots.

Source:

     <URL:
     http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas>

     Snee (1974) gives the two-way table aggregated over ‘Sex’.  The
     ‘Sex’ split of the ‘Brown hair, Brown eye’ cell was changed to
     agree with that used by Friendly (2000).

References:

     Snee, R. D. (1974).  Graphical display of two-way contingency
     tables.  _The American Statistician_, *28*, 9-12.  doi:
     10.2307/2683520 (URL: http://doi.org/10.2307/2683520).

## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

haireye <- as.data.frame(HairEyeColor)
names(haireye) <- tolower(names(haireye))
names(haireye)[names(haireye) == "freq"] <- "number"
haireye

##     hair   eye    sex number
## 1  Black Brown   Male     32
## 2  Brown Brown   Male     53
## 3    Red Brown   Male     10
## 4  Blond Brown   Male      3
## 5  Black  Blue   Male     11
## 6  Brown  Blue   Male     50
## 7    Red  Blue   Male     10
## 8  Blond  Blue   Male     30
## 9  Black Hazel   Male     10
## 10 Brown Hazel   Male     25
## 11   Red Hazel   Male      7
## 12 Blond Hazel   Male      5
## 13 Black Green   Male      3
## 14 Brown Green   Male     15
## 15   Red Green   Male      7
## 16 Blond Green   Male      8
## 17 Black Brown Female     36
## 18 Brown Brown Female     66
## 19   Red Brown Female     16
## 20 Blond Brown Female      4
## 21 Black  Blue Female      9
## 22 Brown  Blue Female     34
## 23   Red  Blue Female      7
## 24 Blond  Blue Female     64
## 25 Black Hazel Female      5
## 26 Brown Hazel Female     29
## 27   Red Hazel Female      7
## 28 Blond Hazel Female      5
## 29 Black Green Female      2
## 30 Brown Green Female     14
## 31   Red Green Female      7
## 32 Blond Green Female      8

Questions:

Are hair and eye color independent in this sample?
Do hair and eye color proportions differ by sex?

Independence and multiplicativity

If hair and eye color are independent, then probabilities of combinations are multiplicative:

\[\begin{aligned} &\P\{\text{black hair and blue eyes}\} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\given\text{black hair}\} \\ \end{aligned}\]

which if independent is \[\begin{aligned} &\hphantom{\P\{\text{black hair and blue eyes}\}} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\} \end{aligned}\]

Multiplicativity

A model of independence will have a multiplicative form: \[ p_{ab} = p_a \times p_b . \]

The chi-squared statistic

Let’s start by looking at just hair and eye color, summing over sex:

(haireye_2d <- HairEyeColor[,,"Male"] + HairEyeColor[,,"Female"])

##        Eye
## Hair    Brown Blue Hazel Green
##   Black    68   20    15     5
##   Brown   119   84    54    29
##   Red      26   17    14    14
##   Blond     7   94    10    16

Some questions

In this dataset…

What proportion have blonde hair?
What proportion have blue eyes?
If hair and eye color assort independently, what proportion do you expect to have both blonde hair and blue eyes? How many people would this be?
How many actually have both? Is this difference surprising?
Do the same for black hair and green eyes.

“Expected” counts

Let \[\begin{aligned} n_{ij} &= (\text{observed}_{ij}) \\ &=(\text{observed number with hair $i$ and eye $j$}) \\ E_{ij} &= (\text{expected}_{ij}) \\ &=(\text{total number}) \times(\text{proportion with hair $i$}) \\ &\qquad \times (\text{proportion with eye $j$}) \\ &= n \times \left(\frac{n_{i\cdot}}{n}\right) \times \left(\frac{n_{\cdot j}}{n}\right) . \end{aligned}\]

Here $n_{i \cdot}$ and $n_{\cdot j}$ are the row and column sums.

We want to quantify how different the observed and expected are, inversely weighted by their noisiness: \[\begin{aligned} \sum_{ij} \left( \frac{ (\text{observed})_{ij} - (\text{expected})_{ij} }{ \SE[\text{observed}_{ij}] } \right)^2 \end{aligned}\]

So, what is $\SE[\text{observed}_{ij}]$?

What is $\SE[\text{observed}_{ij}]$?

Under the model of independence, \[\begin{aligned} n_{ij} &\sim \Binom(n, p_i q_j) , \\ \text{where}\quad p_i &= (\text{prob of hair color $i$}) \\ q_j &= (\text{prob of eye color $j$}) . \end{aligned}\]

So, \[\begin{aligned} \sd[n_{ij}] = \sqrt{ n p_i q_j (1 - p_i q_j) } , \end{aligned}\]

… and so how about this \[\begin{aligned} \SE[n_{ij}] &\approx \sqrt{ n p_i q_j } \\ &= \sqrt{(\text{expected}_{ij})} \qquad \ldots? \end{aligned}\]

The chi-squared statistic

\[\begin{aligned} \chi^2 &= \sum_{ij} \frac{ \left((\text{observed})_{ij} - (\text{expected})_{ij} \right)^2 }{ (\text{expected})_{ij} } . \end{aligned}\]

i.e., “observed minus expected squared, divided by expected”.

This gives us a number. What does it mean?

Chi-squared test for independence

A chi-squared test

chisq.test(haireye_2d)

## 
##  Pearson's Chi-squared test
## 
## data:  haireye_2d
## X-squared = 138, df = 9, p-value <2e-16

Um, ok? Hair and eye color are not independent?

More context

Let’s actually look at “observed minus expected”:

haireye_exp <- 0 * haireye_2d
haireye_exp[] <- ( rowSums(haireye_2d)[row(haireye_exp)]
                  * colSums(haireye_2d)[col(haireye_exp)]
                  / sum(haireye_2d) )

##        Eye
## Hair    Brown  Blue Hazel Green
##   Black  40.1  39.2  17.0  11.7
##   Brown 106.3 103.9  44.9  30.9
##   Red    26.4  25.8  11.2   7.7
##   Blond  47.2  46.1  20.0  13.7

Observed minus expected:

##        Eye
## Hair     Brown   Blue  Hazel  Green
##   Black  27.86 -19.22  -1.97  -6.68
##   Brown  12.72 -19.87   9.07  -1.92
##   Red    -0.39  -8.79   2.85   6.32
##   Blond -40.20  47.88  -9.95   2.27

Normalized by $\sqrt{\text{expected}}$:

##        Eye
## Hair     Brown   Blue  Hazel  Green
##   Black  4.398 -3.069 -0.477 -1.954
##   Brown  1.233 -1.949  1.353 -0.345
##   Red   -0.075 -1.730  0.852  2.283
##   Blond -5.851  7.050 -2.228  0.613

Conclusions?

What about by sex?

Compute the chi-squared statistic with chisq.test( ):

chisq.test(HairEyeColor[,,"Female"])

## Warning in chisq.test(HairEyeColor[, , "Female"]): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  HairEyeColor[, , "Female"]
## X-squared = 107, df = 9, p-value <2e-16

Categorical data and the chi-squared test

Categorical data

Hair and Eye color

Independence and multiplicativity

Multiplicativity

The chi-squared statistic

Some questions

“Expected” counts

What is \(\SE[\text{observed}_{ij}]\)?

The chi-squared statistic

Chi-squared test for independence

A chi-squared test

More context

Conclusions?

What about by sex?