\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Categorical data with brms

Peter Ralph

Advanced Biological Statistics

Categorical data

Hair and Eye color

data(HairEyeColor)
HairEyeColor             package:datasets              R Documentation

Hair and Eye Color of Statistics Students

Description:

     Distribution of hair and eye color and sex in 592 statistics
     students.

Usage:

     HairEyeColor
     
Format:

     A 3-dimensional array resulting from cross-tabulating 592
     observations on 3 variables.  The variables and their levels are
     as follows:

       No  Name  Levels                    
        1  Hair  Black, Brown, Red, Blond  
        2  Eye   Brown, Blue, Hazel, Green 
        3  Sex   Male, Female              
      
Details:

     The Hair x Eye table comes from a survey of students at the
     University of Delaware reported by Snee (1974).  The split by
     ‘Sex’ was added by Friendly (1992a) for didactic purposes.

     This data set is useful for illustrating various techniques for
     the analysis of contingency tables, such as the standard
     chi-squared test or, more generally, log-linear modelling, and
     graphical methods such as mosaic plots, sieve diagrams or
     association plots.

Source:

     <URL:
     http://euclid.psych.yorku.ca/ftp/sas/vcd/catdata/haireye.sas>

     Snee (1974) gives the two-way table aggregated over ‘Sex’.  The
     ‘Sex’ split of the ‘Brown hair, Brown eye’ cell was changed to
     agree with that used by Friendly (2000).

References:

     Snee, R. D. (1974).  Graphical display of two-way contingency
     tables.  _The American Statistician_, *28*, 9-12.  doi:
     10.2307/2683520 (URL: http://doi.org/10.2307/2683520).
## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8
haireye <- as.data.frame(HairEyeColor)
names(haireye) <- tolower(names(haireye))
names(haireye)[names(haireye) == "freq"] <- "number"
haireye
##     hair   eye    sex number
## 1  Black Brown   Male     32
## 2  Brown Brown   Male     53
## 3    Red Brown   Male     10
## 4  Blond Brown   Male      3
## 5  Black  Blue   Male     11
## 6  Brown  Blue   Male     50
## 7    Red  Blue   Male     10
## 8  Blond  Blue   Male     30
## 9  Black Hazel   Male     10
## 10 Brown Hazel   Male     25
## 11   Red Hazel   Male      7
## 12 Blond Hazel   Male      5
## 13 Black Green   Male      3
## 14 Brown Green   Male     15
## 15   Red Green   Male      7
## 16 Blond Green   Male      8
## 17 Black Brown Female     36
## 18 Brown Brown Female     66
## 19   Red Brown Female     16
## 20 Blond Brown Female      4
## 21 Black  Blue Female      9
## 22 Brown  Blue Female     34
## 23   Red  Blue Female      7
## 24 Blond  Blue Female     64
## 25 Black Hazel Female      5
## 26 Brown Hazel Female     29
## 27   Red Hazel Female      7
## 28 Blond Hazel Female      5
## 29 Black Green Female      2
## 30 Brown Green Female     14
## 31   Red Green Female      7
## 32 Blond Green Female      8

Questions:

  1. Are hair and eye color independent in this sample?
  2. Do hair and eye color proportions differ by gender?

Poisson models for count data

We have for data \[\begin{aligned} n_{ij} = \text{( number of observations in categories $i$ and $j$ )} . \end{aligned}\] and \[\begin{aligned} n_{i \cdot} &= \text{( total number of observations in category $i$ )} , \\ n_{\cdot j} &= \text{( total number of observations in category $j$ )} , \\ n_{\cdot \cdot} &= \text{( total number of observations )} . \end{aligned}\]

Question: Is \(n_{i \cdot}\) fixed? What about \(n_{\cdot j}\)? Or \(n_{\cdot \cdot}\)?

Answer: Happily, it will turn out not to matter (for the model).

Poisson additivity

If we have Poisson-many things of two categories: \[\begin{aligned} A &\sim \Poisson(a) \\ B &\sim \Poisson(b) \end{aligned}\] then the total number of things is also Poisson: \[ A + B \sim \Poisson(a + b) \] and each thing chooses its type independently: \[ A \given A + B \sim \Binom\left(A+B, \frac{a}{a+b}\right) . \]

Poisson models for contingency tables

So, if we fit a model where

\[\begin{aligned} n_{ij} &\sim \Poisson(\lambda_{ij}) \end{aligned}\]

then \[\begin{aligned} \frac{\lambda_{ij}}{\sum_{k\ell} \lambda_{k\ell}} &= (\text{proportion in category $ij$}) \\ \frac{\lambda_{ij}}{\sum_{k} \lambda_{kj}} &= (\text{proportion of row $j$ in column $i$}) \\ \frac{\lambda_{ij}}{\lambda_{ij} + \lambda_{k\ell}} &= (\text{proportion of those in $k\ell$ or $ij$ that are $ij$}) \end{aligned}\]

For example, the proportion of black-haired people who have blue eyes is \[\begin{aligned} & \P\left(\text{ eye blue } \given \text{ hair black } \right) \\ &\qquad = \frac{ \lambda_{\text{blue},\text{black}} }{ \lambda_{\cdot,\text{black}} } \\ &\qquad = \frac{ \lambda_{\text{blue},\text{black}} }{ \lambda_{\text{brown},\text{black}} + \lambda_{\text{blue},\text{black}} + \lambda_{\text{hazel},\text{black}} + \lambda_{\text{green},\text{black}} } \end{aligned}\]

Question: What’s \[\begin{aligned} \P\left(\text{ male } \given \text{ hair black } \right) = ? \end{aligned}\]

Independence and multiplicativity

If hair and eye color are independent, then probabilities of combinations are multiplicative:

\[\begin{aligned} &\P\{\text{black hair and blue eyes}\} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\given\text{black hair}\} \\ \end{aligned}\]

which if independent is \[\begin{aligned} &\hphantom{\P\{\text{black hair and blue eyes}\}} \\ &\qquad = \P\{\text{black hair}\} \times \P\{\text{blue eyes}\} \end{aligned}\]

In shorthand, \[\begin{aligned} p_{ij} = p_i \times p_j . \end{aligned}\]

Additivity and independence

If we use a log link function, additivity corresponds to independence: \[\begin{aligned} p_{ij} &= \frac{ \lambda_{ij} }{ \sum_{ab} \lambda_{ab} } \\ &= \frac{ e^{\beta_i + \gamma_j} }{ \sum_{ab} e^{\beta_a + \gamma_b} } \\ &= \left( \frac{ e^{\beta_i} }{ \sum_{a} e^{\beta_a} } \right) \times \left( \frac{ e^{\gamma_j} }{ \sum_{b} e^{\gamma_b} } \right) \\ &= p_i \times p_j \end{aligned}\]

hair and eye colors, with brms

brms

he_formula <- brmsformula(number ~ sex + hair * eye + (hair + eye) * sex)

Flat priors

get_prior(he_formula, data=haireye, family='poisson')
##                   prior     class                coef group resp dpar nlpar bound       source
##                  (flat)         b                                                      default
##                  (flat)         b             eyeBlue                             (vectorized)
##                  (flat)         b            eyeGreen                             (vectorized)
##                  (flat)         b            eyeHazel                             (vectorized)
##                  (flat)         b           hairBlond                             (vectorized)
##                  (flat)         b   hairBlond:eyeBlue                             (vectorized)
##                  (flat)         b  hairBlond:eyeGreen                             (vectorized)
##                  (flat)         b  hairBlond:eyeHazel                             (vectorized)
##                  (flat)         b           hairBrown                             (vectorized)
##                  (flat)         b   hairBrown:eyeBlue                             (vectorized)
##                  (flat)         b  hairBrown:eyeGreen                             (vectorized)
##                  (flat)         b  hairBrown:eyeHazel                             (vectorized)
##                  (flat)         b             hairRed                             (vectorized)
##                  (flat)         b     hairRed:eyeBlue                             (vectorized)
##                  (flat)         b    hairRed:eyeGreen                             (vectorized)
##                  (flat)         b    hairRed:eyeHazel                             (vectorized)
##                  (flat)         b           sexFemale                             (vectorized)
##                  (flat)         b   sexFemale:eyeBlue                             (vectorized)
##                  (flat)         b  sexFemale:eyeGreen                             (vectorized)
##                  (flat)         b  sexFemale:eyeHazel                             (vectorized)
##                  (flat)         b sexFemale:hairBlond                             (vectorized)
##                  (flat)         b sexFemale:hairBrown                             (vectorized)
##                  (flat)         b   sexFemale:hairRed                             (vectorized)
##  student_t(3, 2.3, 2.5) Intercept                                                      default
he_priors <- c(prior('normal(0,3)', class='b'))

he_fit <- brm(he_formula,
              data=haireye,
              family=poisson(link='log'),
              prior=he_priors)
## Compiling Stan program...
## Start sampling

summary(he_fit)
##  Family: poisson 
##   Links: mu = log 
## Formula: number ~ sex + hair * eye + (hair + eye) * sex 
##    Data: haireye (Number of observations: 32) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Population-Level Effects: 
##                     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept               3.43      0.16     3.10     3.73 1.00     2622     2571
## sexFemale               0.09      0.21    -0.31     0.51 1.00     2839     2813
## hairBrown               0.53      0.19     0.16     0.92 1.00     2763     2771
## hairRed                -1.05      0.29    -1.61    -0.49 1.00     2979     2692
## hairBlond              -2.66      0.41    -3.50    -1.89 1.00     2640     2622
## eyeBlue                -0.98      0.26    -1.49    -0.48 1.00     2715     2792
## eyeHazel               -1.31      0.30    -1.92    -0.73 1.00     2838     2461
## eyeGreen               -2.27      0.45    -3.22    -1.44 1.00     2857     2518
## hairBrown:eyeBlue       0.83      0.28     0.29     1.37 1.00     2705     2685
## hairRed:eyeBlue         0.74      0.39    -0.05     1.50 1.00     3150     2995
## hairBlond:eyeBlue       3.75      0.44     2.93     4.62 1.00     2235     2571
## hairBrown:eyeHazel      0.67      0.33     0.04     1.32 1.00     2898     2679
## hairRed:eyeHazel        0.82      0.43    -0.04     1.65 1.00     3124     2989
## hairBlond:eyeHazel      1.71      0.53     0.66     2.75 1.00     3181     3042
## hairBrown:eyeGreen      1.07      0.47     0.22     2.06 1.00     2940     2437
## hairRed:eyeGreen        1.84      0.54     0.83     2.95 1.00     3156     3031
## hairBlond:eyeGreen      3.24      0.61     2.08     4.48 1.00     2626     2792
## sexFemale:hairBrown     0.14      0.23    -0.32     0.59 1.00     2581     2875
## sexFemale:hairRed       0.24      0.32    -0.39     0.87 1.00     3021     2822
## sexFemale:hairBlond     0.86      0.30     0.28     1.44 1.00     2652     2858
## sexFemale:eyeBlue      -0.41      0.22    -0.85     0.01 1.00     4118     3467
## sexFemale:eyeHazel     -0.32      0.25    -0.81     0.14 1.00     5029     3364
## sexFemale:eyeGreen     -0.48      0.30    -1.08     0.11 1.00     4669     2986
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

conditional_effects(he_fit, effects="hair:eye")

plot of chunk r brm_results

But, that was numbers: here’s the credible intervals for the proportions of each eye color for each hair color: plot of chunk r brm_pred

Recall the “reference” levels are hair = Black and eye = Brown plot of chunk r coef_plot

Back to the questions

Questions:

  1. Are hair and eye color independent in this sample?
  2. Do hair and eye color proportions differ by gender?

Are hair and eye color independent in this sample?

(What do we mean: given gender? averaged across gender?)

Recall the “reference” levels are hair = Black and eye = Brown plot of chunk r bysexq

Eye color proportions by hair type: plot of chunk r brm_pred2

Hair color proportions by eye type: plot of chunk r brm_pred3

Do hair and eye color proportions differ by gender?

plot of chunk r bysex1

plot of chunk r bysex2

Testing our method

What we’ve done so far:

  1. Fit the model,
  2. observed CIs for interactions do not overlap zero
  3. concluded good statistical support for nonindependence.

Question: What would we have seen if hair and eye color were independent?

Testing

Let’s simulate some test data, under independence.

sim_data <- function () {
    n <- sum(haireye$number)
    p_hair <- tapply(haireye$number, haireye$hair, sum) / n
    p_eye <- tapply(haireye$number, haireye$eye, sum) / n
    p_sex <- tapply(haireye$number, haireye$sex, sum) / n
    p <- (p_hair[haireye$hair] * 
          p_eye[haireye$eye] * 
          p_sex[haireye$sex])
    stopifnot(sum(p) == 1)
    number <- rmultinom(1, size=n, prob=p)
    return(number)
}
sim_fits <- lapply(1:5, function (k) {
                       # replaces the data only locally
                       haireye$number <- sim_data()
                       update(he_fit, newdata=haireye) })
## Start sampling
## Start sampling
## Start sampling
## Start sampling
## Start sampling

plot of chunk r plot_sims

plot of chunk r plot_sims2

// reveal.js plugins