\[%% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Homework, week 14: Simulation challenge

Assignment: You should analyze one of the student-created datasets described below. As usual, your task is to use Rmarkdown to write a short report, readable by a technically literate person. The code you used should not be visible in the final report (unless you have a good reason to show it). You will have time to discuss this in groups (the same groups as before), but you should write up the report yourself (in your own words).

You can use the following function to find out which dataset to analyze, where g is your group number:

f <- function (g) {1 + (g %% 6)}

For instance, if you are in group 2, then f(2) = 3, so you should analyze the third dataset below.

Due: Submit your work via Canvas by the end of the day (midnight) on Tuesday, February 9th. Please submit both the Rmd file and the resulting html or pdf file. You will also give a brief presentation on the results in class on Thursday, February 11th.

Group 1: Coyote and roadrunner

Dataset: coyote_roadrunner_data.csv

A recent surge in accidental coyote deaths have been attributed to roadrunner hijinks in Arizona. Researchers are concerned about the possible impact this unprecedented decline in coyote population will have on the local ecosystem, and have turned to unconventional sources for insights-the 1949 cartoon “Roadrunner and Coyote”.

We are interested in modeling survival time of Coyote in the cartoon “Roadrunner and Coyote”. We sampled 500 episodes and tracked the time it took for Coyote to die. We also took note of the number of cliffs present, the average Roadrunner speed, number of traps set by Coyote, presence of other Loony Toons characters, and presence of anvils. Each row corresponds to an episode observed. We want to know which factors are significantly correlated with Coyote’s survival time.

Group 2: Dunder Mifflin

Dataset: Dunder_Mifflin_Sales_Revenue.csv

Dunder Mifflin is a paper company with 7 branches in Scranton, Pennsylvania. David Wallace, the company’s CFO, is interested in evaluating each branchs’s success by looking at their monthly revenue collected from the last five years. This information was collated by the Head of the Accounting Department, Angela, after the Scranton branches had undergone a number of management changes. In particular, David wants to figure out which branch manager had the highest total revenue for the last five years. In addition, David knows that the all of branches enjoy holding copious amounts of meetings with the entire branch and he questions how much these meetings help the company increase profits. David also remembers that Dunder Mifflin has tried various sales initiatives to increase profit, but he can’t recall whether these were successful. Finally, David knows that certain months out of the year may bring in more funds, particularly in the back to school rush that occurs at the end of summer.

This report should help inform David Wallace of the effect of different managers, sales schemes, months of the year, and hours spent in branch meetings on monthly revenue. The report should analyze the 420 observations that Angela gathered and determine which parameter combinations lead to the highest revenue possible. Each observation includes:

  • The manager on duty (name)
  • The branch number
  • The month (number from 1-12, where 1 corresponds to January)
  • Discount rate from the sales initiatives (a percentage that customers got a discount on their paper purchase)
  • Total hours spent in branch meetings per month (this value does not include any meetings that the manager may have with individual employees).
  • The monthly revenue from paper sales in dollars, adjusted per 1,000 (i.e., 53.128 = $53,128)

Group 3: COVID mortality

Dataset: group3-mortality.csv

We are investigating the influence of different risk factors on the binary outcome of living or dying for a group of 500 patients who have received a positive COVID-19 test. The data were collected by volunteers hired to evaluate the patients. Out of desperation for willing volunteers, we had to accept an overly enthusiastic volunteer who decided to ask for patient’s horoscope signs. We recorded the following variables for each patient:

  • age (in years),
  • sex (0 = male, 1 = female),
  • horoscope sign (1 = Aries, 2 = Taurus, 3 = Gemini, 4 = Cancer, 5 = Leo, 6 = Virgo, 7 = Libra, 8 = Scorpio, 9 = Sagittarius, 10 = Capricorn, 11 = Aquarius, 12 = Pisces), and
  • whether or not the patient has a pre-existing condition (0 = no pre-existing conditions, 1 = pre-existing condition).

We are interested in identifying which variables most significantly influence the likelihood of death for these patients.

Group 4: How many memes does a meme-r meme?

Dataset: memer-memes.csv

Memes are a well-loved part of our internet culture, and in 2020 the number of memes created online exploded. But what factors contribute to a person creating more memes? We collected data from 1000 people in Eugene on how many memes they created in 2020, along with some other factors that are thought to contribute to this number. Please build a model that fits the observed data, to predict which factors significantly contribute to the number of memes a person creates.

Variables:

  • Number of Memes Made - Response variable: number of memes created in 2020
  • Is a student - 1=yes, 0=no
  • Male or Female - Male, Female
  • Hours Spent on Social Media Per Day - Number of hours spent on social media per day
  • Hours Spent Reading The News Per Day - Number of hours spent reading the news per day

Group 5: In Which a Bear is Studied

Dataset: honey.csv

Dr. C. Robin and his undergraduate assistant K. Roo were wandering through a forest located near their lab when they observed a bear climbing a tree to eat honey. After receiving a walloping number of stings to its face, the bear slid back to the ground and ran away. As an Ethologist (an animal behavior scientist) Dr. Robin became curious as to what parameters determine the volume of honey that a bear can eat in a single snacking session. To conduct their study, Dr. Robin and K. Roo tagged and followed a bear designated WTP001 over the course of a five years. Hunger was recorded as high medium or low, based on loudness of tummy rumbles, the distance of a hive from the ground in meters, and the number of stings received by the bear during the attempt. After 1000 observations, they sat down to analyze the data to determine which were related to the amount of honey consumed by the bear (in liters). Of these four parameters, which did they find are significant in determining how much honey the bear consumes? And what effect do they have on the consumption (sign and magnitude)?

Group 6: UFO sightings

Dataset: ufos.csv

Researchers noticed there was a small town in New Mexico that had a surprising number of UFO sightings per year relative to other towns of a similar size in the United States. This prompted the researchers to conduct a survey over a 100-day period in this small town. The researchers were able to survey every person that lived full-time in this town, every day for the 100-day period. They collected six metrics for each day. One row in the data represents a single day of observation, with the following six variables:

  • UFO: The total number of UFOs seen by all the residents of the town for a given day
  • AQI: The Air Quality Index on the day (0 = good, 500 = very hazardous
  • BAC: The mean Blood-Alcohol Content across all the individuals claiming to see UFOs on a given day
  • airplanes: The number of airplanes in the sky on that day
  • clouds: The % cloud cover on that day
  • xfiles: The mean number of hours that year that UFO spotters have been watching the TV series “X-files”

The researchers would like to know if any of the variables (AQI, BAC, airplanes, clouds, xfiles) have an impact on the number of UFOs seen by the townspeople on any given day. The researchers are also interested in whether there are any interactions between the variables.