\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Categorical data with brms

Peter Ralph

19 January 2021 – Advanced Biological Statistics

Categorical data

Hair and Eye color

data(HairEyeColor)
HairEyeColor             package:datasets              R Documentation

Hair and Eye Color of Statistics Students

Description:

     Distribution of hair and eye color and sex in 592 statistics
     students.

Usage:

     HairEyeColor
     
Format:

     A 3-dimensional array resulting from cross-tabulating 592
     observations on 3 variables.  The variables and their levels are
     as follows:

       No  Name  Levels                    
        1  Hair  Black, Brown, Red, Blond  
        2  Eye   Brown, Blue, Hazel, Green 
        3  Sex   Male, Female              
      
Details:

     The Hair x Eye table comes from a survey of students at the
     University of Delaware reported by Snee (1974).  The split by
     ‘Sex’ was added by Friendly (1992a) for didactic purposes.

     This data set is useful for illustrating various techniques for
     the analysis of contingency tables, such as the standard
     chi-squared test or, more generally, log-linear modelling, and
     graphical methods such as mosaic plots, sieve diagrams or
     association plots.

Source:

     <URL:
     http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas>

     Snee (1974) gives the two-way table aggregated over ‘Sex’.  The
     ‘Sex’ split of the ‘Brown hair, Brown eye’ cell was changed to
     agree with that used by Friendly (2000).

References:

     Snee, R. D. (1974).  Graphical display of two-way contingency
     tables.  _The American Statistician_, *28*, 9-12.  doi:
     10.2307/2683520 (URL: http://doi.org/10.2307/2683520).
## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8
haireye <- as.data.frame(HairEyeColor)
names(haireye) <- tolower(names(haireye))
names(haireye)[names(haireye) == "freq"] <- "number"
haireye
##     hair   eye    sex number
## 1  Black Brown   Male     32
## 2  Brown Brown   Male     53
## 3    Red Brown   Male     10
## 4  Blond Brown   Male      3
## 5  Black  Blue   Male     11
## 6  Brown  Blue   Male     50
## 7    Red  Blue   Male     10
## 8  Blond  Blue   Male     30
## 9  Black Hazel   Male     10
## 10 Brown Hazel   Male     25
## 11   Red Hazel   Male      7
## 12 Blond Hazel   Male      5
## 13 Black Green   Male      3
## 14 Brown Green   Male     15
## 15   Red Green   Male      7
## 16 Blond Green   Male      8
## 17 Black Brown Female     36
## 18 Brown Brown Female     66
## 19   Red Brown Female     16
## 20 Blond Brown Female      4
## 21 Black  Blue Female      9
## 22 Brown  Blue Female     34
## 23   Red  Blue Female      7
## 24 Blond  Blue Female     64
## 25 Black Hazel Female      5
## 26 Brown Hazel Female     29
## 27   Red Hazel Female      7
## 28 Blond Hazel Female      5
## 29 Black Green Female      2
## 30 Brown Green Female     14
## 31   Red Green Female      7
## 32 Blond Green Female      8

Questions:

  1. Are hair and eye color independent in this sample?
  2. Do hair and eye color proportions differ by gender?

Poisson models for count data

We have for data \[\begin{aligned} n_{ij} = \text{( number of observations in categories $i$ and $j$ )} . \end{aligned}\] and \[\begin{aligned} n_{i \cdot} &= \text{( total number of observations in category $i$ )} , \\ n_{\cdot j} &= \text{( total number of observations in category $j$ )} , \\ n_{\cdot \cdot} &= \text{( total number of observations )} . \end{aligned}\]

Question: Is \(n_{i \cdot}\) fixed? What about \(n_{\cdot j}\)? Or \(n_{\cdot \cdot}\)?

Answer: Happily, it will turn out not to matter (for the model).

Poisson additivity

If we have Poisson-many things of two categories: \[\begin{aligned} A &\sim \Poisson(a) \\ B &\sim \Poisson(b) \end{aligned}\] then the total number of things is also Poisson: \[ A + B \sim \Poisson(a + b) \] and each thing chooses its type independently: \[ A \given A + B \sim \Binom\left(A+B, \frac{a}{a+b}\right) . \]

Poisson models for contingency tables

So, if we fit a model where

\[\begin{aligned} n_{ij} &\sim \Poisson(\lambda_{ij}) \end{aligned}\]

then \[\begin{aligned} \frac{\lambda_{ij}}{\sum_{k\ell} \lambda_{k\ell}} &= (\text{proportion in category $ij$}) \\ \frac{\lambda_{ij}}{\sum_{k} \lambda_{kj}} &= (\text{proportion of row $j$ in column $i$}) \\ \frac{\lambda_{ij}}{\lambda_{ij} + \lambda_{k\ell}} &= (\text{proportion of those in $k\ell$ or $ij$ that are $ij$}) \end{aligned}\]

For example, the proportion of black-haired people who have blue eyes is \[\begin{aligned} & \P\left(\text{ eye blue } \given \text{ hair black } \right) \\ &\qquad = \frac{ \lambda_{\text{blue},\text{black}} }{ \lambda_{\cdot,\text{black}} } \\ &\qquad = \frac{ \lambda_{\text{blue},\text{black}} }{ \lambda_{\text{brown},\text{black}} + \lambda_{\text{blue},\text{black}} + \lambda_{\text{hazel},\text{black}} + \lambda_{\text{green},\text{black}} } \end{aligned}\]

Question: What’s \[\begin{aligned} \P\left(\text{ male } \given \text{ hair black } \right) = ? \end{aligned}\]

Independence and multiplicativity

If hair and eye color are independent, then probabilities of combinations are multiplicative:

\[\begin{aligned} &\P\{\text{black hair and blue eyes}\} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\given\text{black hair}\} \\ \end{aligned}\]

which if independent is \[\begin{aligned} &\hphantom{\P\{\text{black hair and blue eyes}\}} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\} \end{aligned}\]

In shorthand, \[\begin{aligned} p_{ij} = p_i \times p_j . \end{aligned}\]

Additivity and independence

If we use a log link function, additivity corresponds to independence: \[\begin{aligned} p_{ij} &= \frac{ \lambda_{ij} }{ \sum_{ab} \lambda_{ab} } \\ &= \frac{ e^{\beta_i + \gamma_j} }{ \sum_{ab} e^{\beta_a + \gamma_b} } \\ &= \left( \frac{ e^{\beta_i} }{ \sum_{a} e^{\beta_a} } \right) \times \left( \frac{ e^{\gamma_j} }{ \sum_{b} e^{\gamma_b} } \right) \\ &= p_i \times p_j \end{aligned}\]

hair and eye colors, with brms

brms

he_formula <- brmsformula(number ~ sex + hair * eye + (hair + eye) * sex)

Flat priors

get_prior(he_formula, data=haireye, family='poisson')
##                   prior     class                coef group resp dpar nlpar bound       source
##                  (flat)         b                                                      default
##                  (flat)         b             eyeBlue                             (vectorized)
##                  (flat)         b            eyeGreen                             (vectorized)
##                  (flat)         b            eyeHazel                             (vectorized)
##                  (flat)         b           hairBlond                             (vectorized)
##                  (flat)         b   hairBlond:eyeBlue                             (vectorized)
##                  (flat)         b  hairBlond:eyeGreen                             (vectorized)
##                  (flat)         b  hairBlond:eyeHazel                             (vectorized)
##                  (flat)         b           hairBrown                             (vectorized)
##                  (flat)         b   hairBrown:eyeBlue                             (vectorized)
##                  (flat)         b  hairBrown:eyeGreen                             (vectorized)
##                  (flat)         b  hairBrown:eyeHazel                             (vectorized)
##                  (flat)         b             hairRed                             (vectorized)
##                  (flat)         b     hairRed:eyeBlue                             (vectorized)
##                  (flat)         b    hairRed:eyeGreen                             (vectorized)
##                  (flat)         b    hairRed:eyeHazel                             (vectorized)
##                  (flat)         b           sexFemale                             (vectorized)
##                  (flat)         b   sexFemale:eyeBlue                             (vectorized)
##                  (flat)         b  sexFemale:eyeGreen                             (vectorized)
##                  (flat)         b  sexFemale:eyeHazel                             (vectorized)
##                  (flat)         b sexFemale:hairBlond                             (vectorized)
##                  (flat)         b sexFemale:hairBrown                             (vectorized)
##                  (flat)         b   sexFemale:hairRed                             (vectorized)
##  student_t(3, 2.3, 2.5) Intercept                                                      default
he_priors <- c(prior('normal(0,3)', class='b'))

he_fit <- brm(he_formula,
              data=haireye,
              family=poisson(link='log'),
              prior=he_priors)
## Warning: partial match of 'dpar' to 'dpars'
## Compiling Stan program...
## Start sampling

summary(he_fit)
##  Family: poisson 
##   Links: mu = log 
## Formula: number ~ sex + hair * eye + (hair + eye) * sex 
##    Data: haireye (Number of observations: 32) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept               3.43      0.16     3.10     3.74 1.00     2530     2801
## sexFemale               0.09      0.21    -0.31     0.50 1.00     2835     3128
## hairBrown               0.53      0.19     0.16     0.91 1.00     2523     2640
## hairRed                -1.05      0.29    -1.62    -0.50 1.00     2589     2746
## hairBlond              -2.66      0.41    -3.49    -1.89 1.00     2562     2667
## eyeBlue                -0.97      0.27    -1.49    -0.45 1.00     2907     2882
## eyeHazel               -1.31      0.31    -1.94    -0.73 1.00     2949     2919
## eyeGreen               -2.27      0.44    -3.18    -1.48 1.00     2402     2466
## hairBrown:eyeBlue       0.83      0.28     0.28     1.40 1.00     2689     2600
## hairRed:eyeBlue         0.73      0.40    -0.08     1.51 1.00     3026     2953
## hairBlond:eyeBlue       3.73      0.44     2.92     4.64 1.00     2548     2240
## hairBrown:eyeHazel      0.67      0.33     0.05     1.33 1.00     2990     3269
## hairRed:eyeHazel        0.82      0.43    -0.02     1.69 1.00     3314     3239
## hairBlond:eyeHazel      1.70      0.56     0.64     2.81 1.00     2937     2369
## hairBrown:eyeGreen      1.07      0.46     0.21     2.04 1.00     2467     2804
## hairRed:eyeGreen        1.84      0.54     0.80     2.91 1.00     2557     2597
## hairBlond:eyeGreen      3.24      0.59     2.12     4.45 1.00     2282     2773
## sexFemale:hairBrown     0.13      0.23    -0.31     0.59 1.00     2512     2906
## sexFemale:hairRed       0.24      0.31    -0.37     0.83 1.00     2658     2739
## sexFemale:hairBlond     0.86      0.30     0.27     1.44 1.00     2547     2683
## sexFemale:eyeBlue      -0.41      0.22    -0.85     0.03 1.00     3983     3291
## sexFemale:eyeHazel     -0.32      0.25    -0.82     0.16 1.00     4442     3327
## sexFemale:eyeGreen     -0.48      0.29    -1.06     0.09 1.00     4119     3035
## 
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

conditional_effects(he_fit, effects="hair:eye")

plot of chunk r brm_results

But, that was numbers: here’s the credible intervals for the proportions of each eye color for each hair color: plot of chunk r brm_pred

Recall the “reference” levels are hair = Black and eye = Brown plot of chunk r coef_plot

Back to the questions

Questions:

  1. Are hair and eye color independent in this sample?
  2. Do hair and eye color proportions differ by gender?

Are hair and eye color independent in this sample?

(What do we mean: given gender? averaged across gender?)

Recall the “reference” levels are hair = Black and eye = Brown plot of chunk r bysexq

Eye color proportions by hair type: plot of chunk r brm_pred2

Hair color proportions by eye type: plot of chunk r brm_pred3

Do hair and eye color proportions differ by gender?

plot of chunk r bysex1

plot of chunk r bysex2

Testing our method

What we’ve done so far:

  1. Fit the model,
  2. observed CIs for interactions do not overlap zero
  3. concluded good statistical support for nonindependence.

Question: What would we have seen if hair and eye color were independent?

Testing

Let’s simulate some test data, under independence.

sim_data <- function () {
    n <- sum(haireye$number)
    p_hair <- tapply(haireye$number, haireye$hair, sum) / n
    p_eye <- tapply(haireye$number, haireye$eye, sum) / n
    p_sex <- tapply(haireye$number, haireye$sex, sum) / n
    p <- (p_hair[haireye$hair] * 
          p_eye[haireye$eye] * 
          p_sex[haireye$sex])
    stopifnot(sum(p) == 1)
    number <- rmultinom(1, size=n, prob=p)
    return(number)
}
sim_fits <- lapply(1:5, function (k) {
                       # replaces the data only locally
                       haireye$number <- sim_data()
                       update(he_fit, newdata=haireye) })
## Start sampling
## Start sampling
## Start sampling
## Start sampling
## Start sampling

plot of chunk r plot_sims

plot of chunk r plot_sims2

// reveal.js plugins