\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Tidy data

Peter Ralph

Advanced Biological Statistics

Tidy data

Checklist for data tidiness

  • Store a copy of data in a nonproprietary format, (e.g. plain ASCII text)
  • Leave an uncorrected file when doing analyses
  • Use descriptive names for your data files and variables
  • Include a header line with descriptive variable names
  • Maintain effective metadata about the data (a README)
  • Add new observations to a dataset by row
  • Add new variables to a dataset by column
  • A column of data should contain only one data type
  • All measurements of the same type should be in the same column

Wide data

Number of eggs laid by 16 chickens of five breeds:

breed 1 2 3 4 5 6
rhode island red 5 6 NA NA NA NA
white leghorn 7 5 6 8 NA NA
barred rock 3 2 4 4 3 4
jersey giant 5 2 8 NA NA NA
australorp 4 NA NA NA NA NA

The same data, long:

breed num_eggs
rhode island red 5
rhode island red 6
white leghorn 7
white leghorn 5
white leghorn 6
white leghorn 8
barred rock 3
barred rock 2
barred rock 4
barred rock 4
barred rock 3
barred rock 4
jersey giant 5
jersey giant 2
jersey giant 8
australorp 4

Exercise

Design a tidy data format for the stickleback experiment: two strains of stickleback were made microbe free, placed in tanks and either innoculated with microbes or not, then had their gene expression measured with RNA-seq. Sex is recorded, also.

stickleback experiment

The tidyverse

Tools for tidy data

Tidying data is hard!

… and often requires expert input.

Many common data wrangling operations are made easier by the tidyverse.

The “tidyverse”

  • packages that do many of the same things as base functions in R

  • designed to do them more “cleanly”

  • also includes ggplot (for “Grammar of Graphics”)

A “tibble” is a data frame

A “tibble” is a data frame

Key functions in dplyr

  • Pick observations by their values with filter().
  • Reorder the rows with arrange().
  • Pick variables by their names with select().
  • Create new variables with functions of existing variables with mutate().
  • Collapse many values down to a single summary with summarise().

select(), filter(), and arrange()

a1 <- select(airbnb, neighbourhood, price, host_id, beds, bathrooms)
a2 <- filter(a1, neighbourhood == "Richmond" 
                 | neighbourhood == "Woodlawn"
                 | neighbourhood == "Downtown")
a3 <- arrange(a2, price, neighbourhood)

Also, the “pipe”

a3 <- (airbnb %>% select(neighbourhood, price, host_id, beds, bathrooms)
        %>% filter(neighbourhood == "Richmond" 
                     | neighbourhood == "Woodlawn"
                     | neighbourhood == "Downtown")
        %>% arrange(price, neighbourhood))

Tidyverse:

a3 <- (airbnb %>% select(neighbourhood, price, host_id, beds, bathrooms)
        %>% filter(neighbourhood == "Richmond" 
                     | neighbourhood == "Woodlawn"
                     | neighbourhood == "Downtown")
        %>% arrange(price, neighbourhood))

Base:

a1 <- airbnb[,c("neighbourhood", "price", "host_id", "beds", "bathrooms")]
a2 <- subset(a1, neighbourhood %in% c("Richmond", "Woodlawn", "Downtown"))
a3 <- a2[order(a2$price, a2$price), ]

mutate() and transmutate()

Add new variables:

mutate(a3,
    price_per_bed = price / beds,
    price_per_bath = price / bathrooms)

Or, make an entirely new data frame:

transmute(airbnb,
    price = price,
    price_per_bed = price / beds,
    price_per_bath = price / bathrooms)

group_by() and summarize()

group_by() aggregates data by category, e.g.:

by_hood <- group_by(a3, neighbourhood)

Now, you can calculate summaries of other variables within each group, e.g.:

summarise(by_hood, price = mean(price, na.rm = TRUE))

Your turn

  1. Make a data frame only including rooms in the top ten neighbourhoods (use n( ) to count them). Then, using only these neighbourhoods…

  2. Find the mean price, cleaning_fee, and ratio of cleaning fee to price, by neighbourhood.

  3. Edit your code in (2) to add variables for the 25% and 75% quantile of price (use quantile( )).

  4. Do as in (2) and (3) but splitting by both neighbourhood and room_type (e.g., finding the mean price of private rooms in Woodlawn).

  5. Edit your code in (1) to add a new variable giving the number of characters in the house_rules (use nchar( )).

Only top ten neighbourhoods

To get you started:

neighbourhood_counts <- (airbnb
                         %>% group_by(neighbourhood) %>% summarize(count=n())
                         %>% arrange(count))
top_ten <- neighbourhood_counts$neighbourhood[nrow(neighbourhood_counts) - 0:9]
sub_bnb <- filter(airbnb, neighbourhood %in% top_ten) %>% droplevels

Summarise

(sub_bnb
 %>% group_by(neighbourhood)
 %>% summarise(
           mean_price=mean(price, na.rm=TRUE),
           mean_cleaning_fee=mean(cleaning_fee, na.rm=TRUE),
           price_q1=quantile(price, probs=0.25, na.rm=TRUE),
           price_q3=quantile(price, probs=0.75, na.rm=TRUE)
 )
 %>% mutate(ratio=mean_price/mean_cleaning_fee)
 )
## # A tibble: 10 × 6
##    neighbourhood      mean_price mean_cleaning_fee price_q1 price_q3 ratio
##    <fct>                   <dbl>             <dbl>    <dbl>    <dbl> <dbl>
##  1 Boise-Eliot              118.              62.6     75       135   1.89
##  2 Buckman                  129.              58.6     84.5     146   2.20
##  3 Concordia                113.              55.8     70       130.  2.02
##  4 Downtown                 237.              85.5    101       300   2.77
##  5 Hosford-Abernethy        133.              58.8     84       153   2.26
##  6 King                     121.              60.8     69       138   1.99
##  7 Northwest District       142.              65.5     89       163   2.16
##  8 Overlook                 105.              54.7     65.8     119.  1.91
##  9 Richmond                 118.              59.7     75       129   1.98
## 10 Sunnyside                114.              56.8     73.2     134.  2.01
// reveal.js plugins