\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Homework 6: Youth smoking rates

Assignment: Your task is to use Rmarkdown to write a short report, readable by a technically literate person. The code you used should not be visible in the final report (unless you have a good reason to show it).

Due: Submit your work via Canvas by the end of the day (midnight) on Tuesday, November 15th. Please submit both the Rmd file and the resulting html or pdf file. You can work with other members of class, but I expect each of you to construct and run all of the scripts yourself.

The Problem

For this assignment, you’ll be using a subset of data from the CDC National Youth Tobacco Survey: yts.csv. These data describe the prevelence of cigarette use in middle and high school students by state or US terrritory from 2000-2017. The methodology and detailed descriptions of the data can be found here. The data were in fact collected using a two-stage stratified survey design, but for the purposes of this assignment, let’s assume (somewhat unrealistically) that the data were produced by taking a random sample of all students in the state in that year. For instance, the first row is

  YEAR   state Gender     Education Number_Yes Sample_Size
  2000 Alabama Female   High School        227         816

… and so assume that the CDC surveyed 816 randomly chosen female high school students in Alabama in 2000, of whom 227 reported that they currently smoke cigarettes.

Please write your report to answer the following questions:

  1. How many surveys do we have, from what states and which years? Are there any strange or alarming features of the data?

  2. How much does cigarette use differ by state, education level, gender, and year? Is there evidence of a decline (or increase) in use over time? To answer these questions, fit a binomial model with brms:

     brm(Number_Yes | trials(Sample_Size) ~ Gender + Education + YEAR + (1|state),
                data=d, family=binomial(link='logit'))

    Explain the model, and use the results to provide quantitative estimates of how much cigarette use differs by state, education level, gender, and year.

  3. Use the fitted model to predict the percent of female high school students that would report cigarette use in 2019 in Utah and in New Jersey. Provide estimates of uncertainty on your prediction.

As always, please write a readable report, as might be read by someone working on youth smoking policy, for instance. Please do not include code, or (for instance) copy-paste the questions above into numbered headers.