Peter Ralph
13 October – Advanced Biological Statistics
images of lab notebooks pasted into an Excel document
breed | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
rhode island red | 5 | 6 | NA | NA | NA | NA |
white leghorn | 7 | 5 | 6 | 8 | NA | NA |
barred rock | 3 | 2 | 4 | 4 | 3 | 4 |
jersey giant | 5 | 2 | 8 | NA | NA | NA |
australorp | 4 | NA | NA | NA | NA | NA |
breed | num_eggs | |
---|---|---|
11 | rhode island red | 5 |
21 | rhode island red | 6 |
12 | white leghorn | 7 |
22 | white leghorn | 5 |
32 | white leghorn | 6 |
42 | white leghorn | 8 |
13 | barred rock | 3 |
23 | barred rock | 2 |
33 | barred rock | 4 |
43 | barred rock | 4 |
53 | barred rock | 3 |
63 | barred rock | 4 |
14 | jersey giant | 5 |
24 | jersey giant | 2 |
34 | jersey giant | 8 |
15 | australorp | 4 |
Design a tidy data format for the stickleback experiment: two strains of stickleback were made microbe free, placed in tanks and either innoculated with microbes or not, then had their gene expression measured with RNA-seq. Sex is recorded, also.
Tidying data is hard!
… and often requires expert input.
Many common data wrangling operations are made easier by the tidyverse.
packages that do many of the same things as base functions in R
designed to do them more “cleanly”
also includes ggplot
(for “Grammar of Graphics”)
filter()
.arrange()
.select()
.mutate()
.summarise()
.filter()
, arrange()
and select()
a1 <- select(airbnb, neighbourhood, price, host_id, beds, bathrooms)
a2 <- filter(a1, neighbourhood == "Richmond"
| neighbourhood == "Woodlawn"
| neighbourhood == "Downtown")
a3 <- arrange(a2, price, neighbourhood)
a3 <- (select(airbnb, neighbourhood, price, host_id, beds, bathrooms)
%>% filter(neighbourhood == "Richmond"
| neighbourhood == "Woodlawn"
| neighbourhood == "Downtown")
%>% arrange(price, neighbourhood))
Tidyverse:
a3 <- (select(airbnb, neighbourhood, price, host_id, beds, bathrooms)
%>% filter(neighbourhood == "Richmond"
| neighbourhood == "Woodlawn"
| neighbourhood == "Downtown")
%>% arrange(price, neighbourhood))
Base:
a1 <- airbnb[,c("neighbourhood", "price", "host_id", "beds", "bathrooms")]
a2 <- subset(a1, neighbourhood %in% c("Richmond", "Woodlawn", "Downtown"))
a3 <- a2[order(a2$price, a2$price), ]
mutate()
and transmutate()
Add new variables:
mutate(a3,
price_per_bed = price / beds,
price_per_bath = price / bathrooms)
Or, make an entirely new data frame:
transmute(airbnb,
price = price,
price_per_bed = price / beds,
price_per_bath = price / bathrooms)
group_by()
and summarize()
group_by()
aggregates data by category, e.g.:
by_hood <- group_by(a3, neighbourhood)
Now, you can calculate summaries of other variables within each group, e.g.:
summarise(by_hood, price = mean(price, na.rm = TRUE))
Make a data frame only including rooms in the top ten neighbourhoods. Then, using only these neighbourhoods…
Find the mean price
, cleaning_fee
, and ratio of cleaning fee to price, by neighbourhood.
Edit your code in (2) to add variables for the 25% and 75% quantile of price
(use quantile( )
).
Do as in (2) and (3) but splitting by both neighbourhood
and room_type
(e.g., finding the mean price of private rooms in Woodlawn).
Edit your code in (1) to add a new variable giving the number of characters in the house_rules
(use nchar( )
).
cleaning <- (sub_bnb
%>% group_by(neighbourhood)
%>% summarise(mean_price=mean(price, na.rm=TRUE),
mean_cleaning_fee=mean(cleaning_fee, na.rm=TRUE),
prop_cleaning=mean(cleaning_fee/price, na.rm=TRUE)))
cleaning
## # A tibble: 10 x 4
## neighbourhood mean_price mean_cleaning_fee prop_cleaning
## <fct> <dbl> <dbl> <dbl>
## 1 Boise-Eliot 118. 62.6 0.545
## 2 Buckman 129. 58.6 0.482
## 3 Concordia 113. 55.8 0.519
## 4 Downtown 237. 85.5 0.466
## 5 Hosford-Abernethy 133. 58.8 0.441
## 6 King 121. 60.8 0.612
## 7 Northwest District 142. 65.5 0.506
## 8 Overlook 105. 54.7 0.534
## 9 Richmond 118. 59.7 0.512
## 10 Sunnyside 114. 56.8 0.513
cleaning <- (sub_bnb
%>% group_by(neighbourhood)
%>% summarise(mean_price=mean(price, na.rm=TRUE),
first_quartile_price=quantile(price, probs=0.25, na.rm=TRUE),
third_quartile_price=quantile(price, probs=0.75, na.rm=TRUE),
mean_cleaning_fee=mean(cleaning_fee, na.rm=TRUE),
prop_cleaning=mean(cleaning_fee/price, na.rm=TRUE)))
cleaning
## # A tibble: 10 x 6
## neighbourhood mean_price first_quartile_price third_quartile_price mean_cleaning_fee prop_cleaning
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boise-Eliot 118. 75 135 62.6 0.545
## 2 Buckman 129. 84.5 146 58.6 0.482
## 3 Concordia 113. 70 130. 55.8 0.519
## 4 Downtown 237. 101 300 85.5 0.466
## 5 Hosford-Abernethy 133. 84 153 58.8 0.441
## 6 King 121. 69 138 60.8 0.612
## 7 Northwest District 142. 89 163 65.5 0.506
## 8 Overlook 105. 65.8 119. 54.7 0.534
## 9 Richmond 118. 75 129 59.7 0.512
## 10 Sunnyside 114. 73.2 134. 56.8 0.513
cleaning <- (sub_bnb
%>% group_by(neighbourhood, room_type)
%>% summarise(mean_price=mean(price, na.rm=TRUE),
first_quartile_price=quantile(price, probs=0.25, na.rm=TRUE),
third_quartile_price=quantile(price, probs=0.75, na.rm=TRUE),
mean_cleaning_fee=mean(cleaning_fee, na.rm=TRUE),
prop_cleaning=mean(cleaning_fee/price, na.rm=TRUE)))
cleaning
## # A tibble: 25 x 7
## # Groups: neighbourhood [10]
## neighbourhood room_type mean_price first_quartile_price third_quartile_price mean_cleaning_fee prop_cleaning
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boise-Eliot Entire home/apt 130. 89 150. 73.0 0.604
## 2 Boise-Eliot Private room 74.3 55 85 20.5 0.328
## 3 Boise-Eliot Shared room 81.8 36 75 11.2 0.130
## 4 Buckman Entire home/apt 136. 95 150. 66.1 0.514
## 5 Buckman Private room 106. 50 91.2 20.9 0.320
## 6 Buckman Shared room 25 25 25 NaN NaN
## 7 Concordia Entire home/apt 125. 85 145 64.9 0.548
## 8 Concordia Private room 70.6 49 77 22.4 0.412
## 9 Downtown Entire home/apt 258. 125 300 90.5 0.477
## 10 Downtown Private room 95.4 59 100 26.1 0.345
## # … with 15 more rows