\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Homework, week 7: Make up some data.

Assignment: I would like you to make up some data - i.e., come up with a model and a story, simulate data from it, and provide the story, the data, and the question. Everyone should do this individually (but as usual, we encourage working and talking with others), and then the next step will be to meet as a group to choose one of your fake data sets that I will pass to another group to analyze.

For this homework, you should submit the following things:

  1. A short document (1-2 paragraphs) describing how the data were (hypothetically) collected, and posing the problem that you would like another group to solve.
  2. The dataset, as a csv file, with informative column names.
  3. An R script that exactly recreates the dataset (at the top, you should use set.seed( ) to make the randomness always the same), and fits the model you have in mind (to verify that the questions are answerable from the data provided).

You do not need to simulate the data from a model we have used in class, but you should check that the model (probably, a GLM) you have in mind to analyse the data does what you expect.

Next week, others (the “analysis” team) will get (1) and (2), but (3) is for only the instructors.

Here are some further requirements.

  1. The question you pose should be solvable by fitting a model that we have learned in this class. There should be at least three variables in the dataset, and at least 100 observations.
  2. Try to make it engaging and/or fun - silly situations are ok!
  3. But, it should still be “realistic” - the “analysis” team will be encouraged to find any impossibilities, such as negative weights, or impossible measurements.
  4. Also, try to keep it simple - the question should be answered by doing one analysis, not multiple, dependent steps.
  5. To ensure that the question is answerable using the data provided (so, that there is enough data and it’s not too noisy), you should actually fit the model you have in mind.
  6. The description should not include any statistical details (e.g., don’t say that the response is Poisson distributed or that it is a linear function of the explanatory variables).
  7. Please include at least one red herring, such as a potentially explanatory variable that doesn’t affect the response at all or a few “extreme outlier” values (as from measurement error).

The model you simulate from should be a GLM, or similar to one: so, you should simulate some predictor variables (which can be correlated), then use the predictor variables to simulate the response variable, and then the question should have to do with how the predictor variables determine/affect/are correlated with the response. (However, your question should not use the phrases “response variable” and “predictor variable”: state the question in real-world terms, not in statistics jargon!)

Due: Submit your work via Canvas by the 5pm (so I have time to read them) on Monday, November 21th.