Assignment: You should analyze one of the student-created datasets described below. As usual, your task is to use Rmarkdown to write a short report, readable by a technically literate person. The code you used should not be visible in the final report (unless you have a good reason to show it). You will have time to discuss this in groups (the same groups as before), but you should write up the report yourself (in your own words).
You can use the following function to find out which dataset to analyze, where g
is your group number:
f <- function (g) {
groups <- (1:11)[-10]
c(groups, groups)[match(g, groups) + 5]
}
For instance, if you are in group 2, then f(2) = 7
, so you should analyze the seventh dataset below.
Due: Submit your work via Canvas by the end of the day (midnight) on Thursday, February 3rd.
Previous research has shown that increased blood alcohol levels are actually correlated with increased coding ability (here for the article). To study this potential trend, data were collected from current students and alumni of the University of Oregon Bioinformatics and Genomics Masters Program (BGMP). Your team is tasked with investigating how various factors influence coding ability of people within the BGMP. Coding ability is measured by the number of bugs (syntax and/or runtime/logical errors) in code per coding task. Participants were given 4 coding tasks of the same difficulty level: 1 task to be administered at each blood alcohol content (BAC). For each of the 4 coding tasks, participants were given 5 minutes in a Zoom breakout room to write R code to study factors affecting capybara exuberance using VIM while sharing their screen over an unstable internet connection (actual internet speed is not a variable being studied here). In between tasks, study participants were fed increasing amounts of Everclear flavored with Mio™ and Jolly Rancher™ candies (the choice drink of BGMP students) and had their blood alcohol content measured via a military-grade Breathalyzer. Study participants self-reported other data.
Questions:
Dataset: drunk-coding.csv
Ovarian Cancer has a 49.1% 5 year relative survival rate and around 20,000 women are diagnosed each year. Researchers are concerned about the possible impact of biological variables as well as an experimental treatment that could impact the survival rate for patients with ovarian cancer. This data was collected as part of a 5 year clinical trial testing the effectiveness of a standard chemotherapy drug against an experimental one. We are interested in modeling the survival time of patients and the effects of any significant covariates on survival. The researchers collected the following data for 500 patients:
id
: A unique identifier for each patient
treatment
: Treatment Type (0 = Standard of Care Group, 1 = Experimental Treatment Group)
celltype
: Cell Type (1 = Epithelial ovarian carcinomas, 2 = Stromal cell tumors, 3 = Germ cell tumors)
diet
: Vegetarian or Not Vegetarian
time
: Time of death (status =1) or last follow up (status =0) in days
status
: Dead = 1, Patient Dropout = 0 (Switched treatments, Patient moved, etc.)
Note: Treatment, Cell Type, and Diet variables are not evenly distributed in our sample.
Questions: 1. Are any of these variables (diet, treatment, cell type) correlated with extending survival? 2. Is the experimental treatment more helpful than the standard of care?
Dataset: ovarian-cancer.csv
This data was collected in a cat shelter that takes in stray or abandoned cats. One hundred cats were randomly selected to be weighed on the same day. We want to know the average shelter cat’s weight and the range of weights after a given amount of time spent in the shelter, as well as how the current weight of the cat depends on time spent in the shelter since rescue, coat color, and sex of the cat. Time spent in the shelter is given in days, and weight is given in pounds.
Dataset: cat-weights.csv
The dataset presented here consists of student responses to a survey distributed at the time of their final exam in a statistics course. There were three sections of the course, with 150 students assigned to each section. The final exam took place over the course of a week on either a Monday, Wednesday, or Friday. Upon completion of the exam, students were instructed to fill out a survey which gathers general information like their student ID and age, as well as information on a variety of variables such as the amount of sleep they got the night before, whether or not they ate breakfast, etc. The University would like to use this information to better understand which variables are the strongest predictors of student success on a statistics final exam. Here are the available variables:
ID
: University issued student ID. (int
)test_score
: integer score on statistics final exam on scale of (0-100) (int
)hrs_of_sleep
: float value representing number of hours student slept the night before the exam (float
)breakfast
: Boolean of whether or not the student ate breakfast the morning of the exam. (boolean
)instructor
: name of the instructor teaching the section (chr
)age
: Age of student (int
)hours_of_exercise
: int value representing number of hours spent exercising the week prior to the exam. (int
)have_friends
: bool value indicating whether or not the student currently has friends (boolean
)eye_color
: chr value indicating eye color (chr
)shoe_size
: float value indicating student shoe size (float
)blood_type
: chr value indicating blood type (A,B,AB,O) (chr
)vaccination_status
: boolean of whether or not the student has received the covid vaccine (boolean
)TA_sessions
: int value indicating the number of statistics TA sessions the student has attended during the semester (int
)test_day
: chr value indicating day of the week the test was taken on. Section 1 took test on monday. Sections 2 and 3 took test on friday. (chr
)Dataset: stats-grades.csv
Kona is the cutest Shiba Inu on earth. The objective of this research was to understand what factors in his daily life contribute significantly to his happiness so that we can make him the happiest dog on earth. Kona was observed for every day for one year and four variables were tracked: the number of walks he went on each day, the number of pats he received each day, the number of hours he slept throughout the day, and the number of dogs he met. The response variable, his happiness each day, was quantified as how many times he wagged his tail.
Dataset: tail-wags.csv
Daphnia is a genus of unique, cyclically parthenogenic “water fleas”. The presence of predators in the water can induce in developing Daphnia drastic morphological responses, including the growth of elaborate helmets and tail spines. This developmental plasticity means that even individuals of the same clonal lineage raised in different environments can look quite different. We are interested in the parameters relating to helmet length in the North American invader Daphnia lumholtzi. An increase in helmet (or tail spine) length is speculated to hinder the handling abilities of larger aquatic vertebrate predators, such as the threespine stickleback fish, and fitness experiments have uncovered an associated increase in the proportion of surviving Daphnia individuals with a defended phenotype compared to undefended individuals.
One of the measurements in our dataset is helmet size (in mm, as measured from the tip of the head to the tip of the head spine). Also recorded within the dataset is juvenile developmental instar, which represents the developmental stage (counted by number of molts). The transition between instars occurs when the animals absorbs water into the outer shell and undergoes a period of rapid growth before advancing to the next instar; we explored the first three developmental instars in this dataset, thus this parameter represents a developmental stage, not a numerical value. We also include age (in days) at first reproduction, as well as average clutch size across the individual’s life. Finally, to simulate environmental predation, experimental organisms were housed in predation environments in which the water from tanks containing 19 sticklebacks per 100 liters was passed, whereas control animals were contained in predation free environments (i.e., in clean, fish-free water). Ecological parameters and size measurements (through digitization of animals) were recorded through the lifespan of the animals.
Please describe the data, including any relationships between size and predation and effects of the additional parameters and describe a model to predict helmet size in Daphnia lumholtzi.
Dataset: daphnia.csv
As part of a larger restoration effort in Deschutes National Forest, botanists are hoping to seed a previously abundant wildflower, Abandra korcheskinii, that has declined greatly in the last few decades. Historical records found A. korcheskinii in a variety of ecotypes within Deschutes NF including sagebrush steppe, woodland meadows, low elevation meadows, and high elevation meadows. Extensive monitoring of the few remaining populations of A. korcheskinii across its range suggests it relies on both bees and flies as pollinators.
Throughout the forest, Forest Service botanists have been assessing their overall restoration efforts by monitoring one hundred, 2m\(^2\) plots across the four ecotypes. Monitoring includes recording species richness of vascular plants in the plot (species richness is defined as the number of species present) as well as the total number of times pollinators (bees and flies) visited the plot in one hour. In this study, researchers only noted the primary pollinator for each plot. Botanists want to seed A. korcheskinii in areas with high visitation rates of pollinators, hoping that an abundance of pollinator visitation leads to successful populations, but pollinator observations are time consuming and time sensitive to do across a large area.
The following dataset includes the measured frequency of pollination visits of all pollinators (“visits”), along with the pollinator genus that was most frequently observed at the plot (“pollinator”), ecotype (“ecotype”) and plant species richness (“richness”) for each of the plots. The researchers would like to know how frequency of pollination visits is related to genus of the most frequently observed visiting pollinators, ecotype the plot is in and plant species richness of the plot to help guide their restoration efforts. In particular, they would like to know whether most frequently observed pollinator genus, ecotype and plant richness have an effect on how often pollinators visit plants, and if so, how much?
Dataset: pollination.csv
We love slugs. Banana slugs are a group of often yellow slugs belonging to the genus Ariolimax. These slugs can be found throughout wet, low elevation areas of the Pacific Northwest. Since we love slugs so much, we want to know the conditions under which we are likely to see the most slugs along the Oregon Coast Trail. We collected data on the number of slugs seen in one-mile increments over 100 miles of the Oregon Coast Trail, a 362-mile trail along Oregon’s coast from northern California to southern Washington. Trail sections were scored based on trail-use intensity (low, medium, high). We recorded percent humidity and the amount of mud that collects on our shoes (g) during each one-mile increment. We also measured rainfall (mm) in the preceding 24 hours of these hikes. How is the number of slugs we see per each one-mile section related to trail-use intensity, percent humidity, rainfall, and shoe mud?
Dataset: slugs.csv
This data set includes information about a group of randomly chosen individuals from Eugene. The data include information about what sort of diet the person eats (1 is vegan, 2 is vegetarian, 3 is omnivore), average number of days per week the person drinks alcohol, BMI (body mass index), shoe size, age, gender (1 for male, 0 for female), propensity for risky behavior (0 being risk averse, 10 being thrill-seeking), the number of haunted houses visited each year, and that person’s LDL cholesterol numbers (above 190 is considered very high and in need of medication). All information is self-reported.
Questions: Which variables contribute to the variation seen LDL cholesterol? Do they improve or worsen LDL cholesterol and by how much?
Dataset: cholesterol.csv
Attend any statistics lecture by Dr. X and you will surely hear the question “By a show of hands…?”… or will you?
We have observed the number of times this question has been asked, along with other variables, and we wish to find which of these variables best explain his change in behavior (i.e. if he consumes more coffee does he ask for more shows of hands?).
The data use the variables:
coffees
: Amount of caffeine consumed before class (in quantity of cups of coffee)students
: Number of students in the peanut gallery (in person/zoom)zoom_students
: Number of students particpating on zoom.smokes
: Amount of nicotine consumed before class (in quantity of cigarettes)Finally, asks
is the number of observed times Dr. X asks his hallmark question.
Please determine which of the variables above have the strongest effect on the number of times the question is asked, and what the effect is.
Dataset: show-of-hands.csv