Peter Ralph
Advanced Biological Statistics
Number of eggs laid by 16 chickens of five breeds:
breed | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
rhode island red | 5 | 6 | NA | NA | NA | NA |
white leghorn | 7 | 5 | 6 | 8 | NA | NA |
barred rock | 3 | 2 | 4 | 4 | 3 | 4 |
jersey giant | 5 | 2 | 8 | NA | NA | NA |
australorp | 4 | NA | NA | NA | NA | NA |
The same data, long:
breed | num_eggs |
---|---|
rhode island red | 5 |
rhode island red | 6 |
white leghorn | 7 |
white leghorn | 5 |
white leghorn | 6 |
white leghorn | 8 |
barred rock | 3 |
barred rock | 2 |
barred rock | 4 |
barred rock | 4 |
barred rock | 3 |
barred rock | 4 |
jersey giant | 5 |
jersey giant | 2 |
jersey giant | 8 |
australorp | 4 |
Design a tidy data format for the stickleback experiment: two strains of stickleback were made microbe free, placed in tanks and either innoculated with microbes or not, then had their gene expression measured with RNA-seq. Sex is recorded, also.
Tidying data is hard!
… and often requires expert input.
Many common data wrangling operations are made easier by the tidyverse.
packages that do many of the same things as base functions in R
designed to do them more “cleanly”
also includes ggplot
(for “Grammar of Graphics”)
filter()
.arrange()
.select()
.mutate()
.summarise()
.select()
, filter()
, and arrange()
a1 <- select(airbnb, neighbourhood, price, host_id, beds, bathrooms)
a2 <- filter(a1, neighbourhood == "Richmond"
| neighbourhood == "Woodlawn"
| neighbourhood == "Downtown")
a3 <- arrange(a2, price, neighbourhood)
a3 <- (airbnb %>% select(neighbourhood, price, host_id, beds, bathrooms)
%>% filter(neighbourhood == "Richmond"
| neighbourhood == "Woodlawn"
| neighbourhood == "Downtown")
%>% arrange(price, neighbourhood))
Tidyverse:
a3 <- (airbnb %>% select(neighbourhood, price, host_id, beds, bathrooms)
%>% filter(neighbourhood == "Richmond"
| neighbourhood == "Woodlawn"
| neighbourhood == "Downtown")
%>% arrange(price, neighbourhood))
Base:
a1 <- airbnb[,c("neighbourhood", "price", "host_id", "beds", "bathrooms")]
a2 <- subset(a1, neighbourhood %in% c("Richmond", "Woodlawn", "Downtown"))
a3 <- a2[order(a2$price, a2$price), ]
mutate()
and transmutate()
Add new variables:
mutate(a3,
price_per_bed = price / beds,
price_per_bath = price / bathrooms)
Or, make an entirely new data frame:
transmute(airbnb,
price = price,
price_per_bed = price / beds,
price_per_bath = price / bathrooms)
group_by()
and summarize()
group_by()
aggregates data by category, e.g.:
by_hood <- group_by(a3, neighbourhood)
Now, you can calculate summaries of other variables within each group, e.g.:
summarise(by_hood, price = mean(price, na.rm = TRUE))
Make a data frame only including rooms in the top ten neighbourhoods (use n( )
to count them). Then, using only these neighbourhoods…
Find the mean price
, cleaning_fee
, and ratio of cleaning fee to price, by neighbourhood.
Edit your code in (2) to add variables for the 25% and 75% quantile of price
(use quantile( )
).
Do as in (2) and (3) but splitting by both neighbourhood
and room_type
(e.g., finding the mean price of private rooms in Woodlawn).
Edit your code in (1) to add a new variable giving the number of characters in the house_rules
(use nchar( )
).
To get you started:
(sub_bnb
%>% group_by(neighbourhood)
%>% summarise(
mean_price=mean(price, na.rm=TRUE),
mean_cleaning_fee=mean(cleaning_fee, na.rm=TRUE),
price_q1=quantile(price, probs=0.25, na.rm=TRUE),
price_q3=quantile(price, probs=0.75, na.rm=TRUE)
)
%>% mutate(ratio=mean_price/mean_cleaning_fee)
)
## # A tibble: 10 × 6
## neighbourhood mean_price mean_cleaning_fee price_q1 price_q3 ratio
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boise-Eliot 118. 62.6 75 135 1.89
## 2 Buckman 129. 58.6 84.5 146 2.20
## 3 Concordia 113. 55.8 70 130. 2.02
## 4 Downtown 237. 85.5 101 300 2.77
## 5 Hosford-Abernethy 133. 58.8 84 153 2.26
## 6 King 121. 60.8 69 138 1.99
## 7 Northwest District 142. 65.5 89 163 2.16
## 8 Overlook 105. 54.7 65.8 119. 1.91
## 9 Richmond 118. 59.7 75 129 1.98
## 10 Sunnyside 114. 56.8 73.2 134. 2.01