\[
%%
% Add your macros here; they'll be included in pdf and html output.
%%
\newcommand{\R}{\mathbb{R}} % reals
\newcommand{\E}{\mathbb{E}} % expectation
\renewcommand{\P}{\mathbb{P}} % probability
\DeclareMathOperator{\logit}{logit}
\DeclareMathOperator{\logistic}{logistic}
\DeclareMathOperator{\sd}{sd}
\DeclareMathOperator{\var}{var}
\DeclareMathOperator{\cov}{cov}
\DeclareMathOperator{\cor}{cor}
\DeclareMathOperator{\Normal}{Normal}
\DeclareMathOperator{\LogNormal}{logNormal}
\DeclareMathOperator{\Poisson}{Poisson}
\DeclareMathOperator{\Beta}{Beta}
\DeclareMathOperator{\Binom}{Binomial}
\DeclareMathOperator{\Gam}{Gamma}
\DeclareMathOperator{\Exp}{Exponential}
\DeclareMathOperator{\Cauchy}{Cauchy}
\DeclareMathOperator{\Unif}{Unif}
\DeclareMathOperator{\Dirichlet}{Dirichlet}
\DeclareMathOperator{\Wishart}{Wishart}
\DeclareMathOperator{\StudentsT}{StudentsT}
\DeclareMathOperator{\Weibull}{Weibull}
\newcommand{\given}{\;\vert\;}
\]
Homework, week 7: Make up
some data.
Assignment: I would like you to make up some
data - i.e., come up with a model and a story, simulate data from
it, and provide the story, the data, and the question. Everyone should
do this individually (but as usual, we encourage working and talking
with others), and then the next step will be to meet as a group
to choose one of your fake data sets that I will pass to another group
to analyze.
For this homework, you should submit the following things:
- A short document (1-2 paragraphs) describing how the data were
(hypothetically) collected, and posing the problem that you would like
another group to solve.
- The dataset, as a csv file, with informative column names.
- An R script that exactly recreates the dataset (at the top, you
should use
set.seed( )
to make the randomness always the
same), and fits the model you have in mind (to verify that the questions
are answerable from the data provided).
You do not need to simulate the data from a model we have used in
class, but you should check that the model (probably, a GLM) you have in
mind to analyse the data does what you expect.
Next week, others (the “analysis” team) will get (1) and (2), but (3)
is for only the instructors.
Here are some further requirements.
- The question you pose should be solvable by fitting a model
that we have learned in this class. There should be at least three
variables in the dataset, and at least 100 observations.
- Try to make it engaging and/or fun - silly situations are ok!
- But, it should still be “realistic” - the “analysis” team will be
encouraged to find any impossibilities, such as negative weights, or
impossible measurements.
- Also, try to keep it simple - the question should be answered by
doing one analysis, not multiple, dependent steps.
- To ensure that the question is answerable using the data provided
(so, that there is enough data and it’s not too noisy), you should
actually fit the model you have in mind.
- The description should not include any statistical details (e.g.,
don’t say that the response is Poisson distributed or that it is a
linear function of the explanatory variables).
- Please include at least one red herring, such as a
potentially explanatory variable that doesn’t affect the response at all
or a few “extreme outlier” values (as from measurement error).
The model you simulate from should be a GLM, or similar to one: so,
you should simulate some predictor variables (which can be
correlated), then use the predictor variables to simulate the
response variable, and then the question should have to do with
how the predictor variables determine/affect/are correlated with the
response. (However, your question should not use the phrases
“response variable” and “predictor variable”: state the question in
real-world terms, not in statistics jargon!)
Due: Submit your work via Canvas by the
5pm (so I have time to read them) on Monday,
November 21th.