Tidy data

Peter Ralph

Advanced Biological Statistics

Tidy data

Checklist for data tidiness

Store a copy of data in a nonproprietary format, (e.g. plain ASCII text)
Leave an uncorrected file when doing analyses
Use descriptive names for your data files and variables
Include a header line with descriptive variable names
Maintain effective metadata about the data (a README)
Add new observations to a dataset by row
Add new variables to a dataset by column
A column of data should contain only one data type
All measurements of the same type should be in the same column

Wide data

Number of eggs laid by 16 chickens of five breeds:

breed	1	2	3	4	5	6
rhode island red	5	6	NA	NA	NA	NA
white leghorn	7	5	6	8	NA	NA
barred rock	3	2	4	4	3	4
jersey giant	5	2	8	NA	NA	NA
australorp	4	NA	NA	NA	NA	NA

The same data, long:

breed	num_eggs
rhode island red	5
rhode island red	6
white leghorn	7
white leghorn	5
white leghorn	6
white leghorn	8
barred rock	3
barred rock	2
barred rock	4
barred rock	4
barred rock	3
barred rock	4
jersey giant	5
jersey giant	2
jersey giant	8
australorp	4

Exercise

Design a tidy data format for the stickleback experiment: two strains of stickleback were made microbe free, placed in tanks and either innoculated with microbes or not, then had their gene expression measured with RNA-seq. Sex is recorded, also.

The tidyverse

Tools for tidy data

Tidying data is hard!

… and often requires expert input.

Many common data wrangling operations are made easier by the tidyverse.

The “tidyverse”

packages that do many of the same things as base functions in R
designed to do them more “cleanly”
also includes ggplot (for “Grammar of Graphics”)

A “tibble” is a data frame

Key functions in dplyr

Pick observations by their values with filter().
Reorder the rows with arrange().
Pick variables by their names with select().
Create new variables with functions of existing variables with mutate().
Collapse many values down to a single summary with summarise().

`select()`, `filter()`, and `arrange()`

a1 <- select(airbnb, neighbourhood, price, host_id, beds, bathrooms)

a2 <- filter(a1, neighbourhood == "Richmond" 
                 | neighbourhood == "Woodlawn"
                 | neighbourhood == "Downtown")

a3 <- arrange(a2, price, neighbourhood)

Also, the “pipe”

a3 <- (airbnb %>% select(neighbourhood, price, host_id, beds, bathrooms)
        %>% filter(neighbourhood == "Richmond" 
                     | neighbourhood == "Woodlawn"
                     | neighbourhood == "Downtown")
        %>% arrange(price, neighbourhood))

Tidyverse:

a3 <- (airbnb %>% select(neighbourhood, price, host_id, beds, bathrooms)
        %>% filter(neighbourhood == "Richmond" 
                     | neighbourhood == "Woodlawn"
                     | neighbourhood == "Downtown")
        %>% arrange(price, neighbourhood))

Base:

a1 <- airbnb[,c("neighbourhood", "price", "host_id", "beds", "bathrooms")]
a2 <- subset(a1, neighbourhood %in% c("Richmond", "Woodlawn", "Downtown"))
a3 <- a2[order(a2$price, a2$price), ]

`mutate()` and `transmutate()`

Add new variables:

mutate(a3,
    price_per_bed = price / beds,
    price_per_bath = price / bathrooms)

Or, make an entirely new data frame:

transmute(airbnb,
    price = price,
    price_per_bed = price / beds,
    price_per_bath = price / bathrooms)

`group_by()` and `summarize()`

group_by() aggregates data by category, e.g.:

by_hood <- group_by(a3, neighbourhood)

Now, you can calculate summaries of other variables within each group, e.g.:

summarise(by_hood, price = mean(price, na.rm = TRUE))

Your turn

Make a data frame only including rooms in the top ten neighbourhoods (use n( ) to count them). Then, using only these neighbourhoods…
Find the mean price, cleaning_fee, and ratio of cleaning fee to price, by neighbourhood.
Edit your code in (2) to add variables for the 25% and 75% quantile of price (use quantile( )).
Do as in (2) and (3) but splitting by both neighbourhood and room_type (e.g., finding the mean price of private rooms in Woodlawn).
Edit your code in (1) to add a new variable giving the number of characters in the house_rules (use nchar( )).

Only top ten neighbourhoods

To get you started:

neighbourhood_counts <- (airbnb
                         %>% group_by(neighbourhood) %>% summarize(count=n())
                         %>% arrange(count))
top_ten <- neighbourhood_counts$neighbourhood[nrow(neighbourhood_counts) - 0:9]
sub_bnb <- filter(airbnb, neighbourhood %in% top_ten) %>% droplevels

Summarise

(sub_bnb
 %>% group_by(neighbourhood)
 %>% summarise(
           mean_price=mean(price, na.rm=TRUE),
           mean_cleaning_fee=mean(cleaning_fee, na.rm=TRUE),
           price_q1=quantile(price, probs=0.25, na.rm=TRUE),
           price_q3=quantile(price, probs=0.75, na.rm=TRUE)
 )
 %>% mutate(ratio=mean_price/mean_cleaning_fee)
 )

## # A tibble: 10 × 6
##    neighbourhood      mean_price mean_cleaning_fee price_q1 price_q3 ratio
##    <fct>                   <dbl>             <dbl>    <dbl>    <dbl> <dbl>
##  1 Boise-Eliot              118.              62.6     75       135   1.89
##  2 Buckman                  129.              58.6     84.5     146   2.20
##  3 Concordia                113.              55.8     70       130.  2.02
##  4 Downtown                 237.              85.5    101       300   2.77
##  5 Hosford-Abernethy        133.              58.8     84       153   2.26
##  6 King                     121.              60.8     69       138   1.99
##  7 Northwest District       142.              65.5     89       163   2.16
##  8 Overlook                 105.              54.7     65.8     119.  1.91
##  9 Richmond                 118.              59.7     75       129   1.98
## 10 Sunnyside                114.              56.8     73.2     134.  2.01

Tidy data

Tidy data

Checklist for data tidiness

Wide data

Exercise

The tidyverse

Tools for tidy data

The “tidyverse”

A “tibble” is a data frame

A “tibble” is a data frame

Key functions in dplyr

select(), filter(), and arrange()

Also, the “pipe”

mutate() and transmutate()

group_by() and summarize()

Your turn

Only top ten neighbourhoods

Summarise

`select()`, `filter()`, and `arrange()`

`mutate()` and `transmutate()`

`group_by()` and `summarize()`