R for Biostatistical Analyses

Hannah Tavalire and Bill Cresko - University of Oregon

October 2019 - Advanced Biostats Review

Lecture 1 - Using R for Biostatistical Analyses

Why use `R`?

R is a statistical programming language (derived from S)
Superb data management & graphics capabilities
You can write your own functions
Powerful and flexible
Runs on all computer platforms
Well established system of packages and documentation
Active development and dedicated community
Can use a nice GUI front end such as Rstudio
Reproducibility
- keep your scripts to see exactly what was done
- distribute these with your data
- embed your R analyses in polished RMarkdown files
FREE

`R` resources

The R Project Homepage: http://www.r-project.org
Quick R Homepage: http://www.statmethods.net
Bioconductor: http://www.bioconductor.org
An Introduction to R (long!): http://cran.r-project.org/doc/manuals/R-intro.html
R for Data Science: https://r4ds.had.co.nz
Google - tutorials, guides, demos, packages and more

Running `R`

Need to make sure that you have R installed
- locally or on a server
- https://www.r-project.org
Run R from the command line
- just type R
- can run it locally as well as on clusters
Install an R Integrated Development Environment (IDE)
- RStudio: http://www.rstudio.com
- Makes working with R much easier, particularly for a new R user
- Run on Windows, Mac or Linux OS

`RStudio`

Exercise 1.1 - Exploring `RStudio`

Open RStudio
Take a few minutes to familiarize yourself with the Rstudio environment by locating the following features:
- See what types of new files can be made in Rstudio by clicking the top left icon- open a new R script.
- The windows clockwise from top left are: the code editor, the workspace and history, the plots and files window, and the R console.
- In the plots and files window, click on the packages and help tabs to see what they offer.
Now open the file called ABS_2019_Exercises_for_R_Review.Rmd in ~/CLASS_MATERIALS/R_and_Rmd_Review/02.Exercises/
- This file will serve as your digital notebook for this review and contains the exercises.

Introduction to `RMarkdown`

`RMarkdown`

A great way to embed R code into descriptive files to keep your life organized
You can insert R chunks into Rmarkdown documents
You will be using the markdown language throughout the term!

The markdown language is very flexible

You can import RMarkdown templates into RStudio and open as a new Rmarkdown file
Better yet there are packages that add functionality
When you install the package it will show up in the ‘From Template’ section of the ‘new file’ startup screen
There are packages to make
- books
- journal articles
- slide shows
- interactive exercises
- many more

What is markdown?

Lightweight formal markup languages are used to add formatting to plaintext documents
- Adding basic syntax to the text will make elements look different once rendered/knit
- Available in many base editors (e.g., Atom text editor)
You then need a markdown application with a markdown processor/parser to render your text files into something more exciting
- Static and dynamic outputs!
- pdf, HTML, presentations, websites, scientific articles, books etc

What is Knitr and PANDOC?

Knitr is a package in R to render markdown files
PANDOC is a general way to render markdown files into something else
https://pandoc.orgis
Can include math using LaTeX
GitHub will render markdown directly
Markdown can easily be rendered within most editors now
Within RStudio just use the knit button to render markdown
Markdown syntax is very easy

Formatting text

*Italic* or _Italic_
**Bold** or __Bold__

Italic or Italic
Bold or Bold

Formatting text

> "You know the greatest danger facing us is ourselves, an irrational fear of the unknown. 
But there’s no such thing as the unknown — only things temporarily hidden, temporarily not understood."
>
> --- Captain James T. Kirk

“You know the greatest danger facing us is ourselves, an irrational fear of the unknown. But there’s no such thing as the unknown — only things temporarily hidden, temporarily not understood.”

— Captain James T. Kirk

Formatting lists

-list_element
-sub_list_element  #double tab to indent
-sub_list_element  #double tab to indent
-sub_list_element  #double tab to indent
-list_element
-sub_list_element  #double tab to indent
# note the space after each dash- this is important!

list_element
- sub_list_element
- sub_list_element
- sub_list_element
list_element
- sub_list_element

Formatting lists

1. One
2. Two
3. Three
4. Four

One
Two
Three
Four

Inserting images or URLs

[Link](https://commonmark.org/help/)
![Image](https://i1.wp.com/evomics.org/wp-content/uploads/2012/07/20120115-IMG_0297.jpg)

Link

Exercise 1.2-1.3 - Intro to `RMarkdown` Files and `Rmarkdown` Advanced

Take a few minutes to familiarize yourself with RMarkdown files and the markdown language by completing exercise 1.2 & 1.3 in your exercises document- don’t worry if you don’t get all the way through

BASICS of `R`

Commands can be submitted through
- terminal, console or scripts
- can be embedded as code chunks in RMarkdown
On these slides evaluating code chunks and showing output
- shown here after the two # symbols
- the number of output items is in []
R follows the normal priority of mathematical evaluation (PEDMAS)

BASICS of `R`

Input code chunk and then output

4 * 4

## [1] 16

Input code chunk and then output

(4 + 3 * 2^2)

## [1] 16

Assigning Variables

A better way to do this is to assign variables
Variables are assigned values using the <- operator (better than =).
Variable names must begin with a letter, but other than that, just about anything goes.
Do keep in mind that R is case sensitive.

Assigning Variables

x <- 2
x * 3

## [1] 6

y <- x * 3
y - 2

## [1] 4

These do not work

3y <- 3
3*y <- 3

Arithmetic operations on functions

Arithmetic operations can be performed easily on functions as well as numbers.

x <- 12
x + 2

## [1] 14

x^2

## [1] 144

log(x)

## [1] 2.484907

Arithmetic operations on functions

Note that the last of these - log - is a built-in function of R, and therefore the object of the function needs to be put in parentheses
These parentheses will be important, and we’ll come back to them later when we add arguments in the parentheses after the function
The outcome of calculations can be assigned to new variables as well, and the results can be checked using the print command

Arithmetic operations on functions

y <- 67
print(y)

## [1] 67

x <- 124
z <- (x * y)^2
print(z)

## [1] 69022864

STRINGS

Operations can be performed on character variables as well
Note that “characters” need to be set off by quotation marks to differentiate them from numbers
The c stands for concatenate
Note that we are using the same variable names as we did previously, which means that we’re overwriting our previous assignment
A good rule of thumb is to use new names for each variable, and make them short but still descriptive

STRINGS

x <- "I Love"
print(x)

## [1] "I Love"

y <- "Biostatistics"
print(y)

## [1] "Biostatistics"

z <- c(x, y)
print(z)

## [1] "I Love"        "Biostatistics"

VECTORS

In general R thinks in terms of vectors
- a list of characters, factors or numerical values (“I Love”)
- it will benefit any R user to try to write scripts with that in mind
- it will simplify most things
Vectors can be assigned directly using the c() function and then entering the exact values with commas separating each element.

VECTORS

n <- c(2, 3, 4, 2, 1, 2, 4, 5, 10, 8, 9)
print(n)

##  [1]  2  3  4  2  1  2  4  5 10  8  9

z <- n + 3
print(z)

##  [1]  5  6  7  5  4  5  7  8 13 11 12

FACTORS

The vector x is now what is called a list of character values (“I Love”).
Sometimes we would like to treat the characters as if they were units for subsequent calculations.
These are called factors, and we can redefine our character variables as factors.
This might seem a bit strange, but it’s important for statistical analyses where we might want to see the mean or variance for two different treatments.

FACTORS

x_factor <- as.factor(x)
print(x_factor)

## [1] I Love
## Levels: I Love

Note that factor levels are reported alphabetically

FACTORS

We can also determine how R “sees” a variable using str() or class() functions.
This is a useful check when importing datasets or verifying that you assigned a class correctly

str(x)

##  chr "I Love"

class(x)

## [1] "character"

Types or ‘classes’ of vectors of data

Types of vectors of data

int stands for integers
dbl stands for doubles, or real numbers (or num)
chr stands for character vectors, or strings
dttm stands for date-times (a date + a time)
lgl stands for logical, vectors that contain only TRUE or FALSE
fctr stands for factors, which R uses to represent categorical variables with fixed possible values
date stands for dates

Types of vectors of data

Logical vectors can take only three possible values:
- FALSE
- TRUE
- NA which is ‘not available’ and is the default coding for missing data in R
Integer and double vectors are known collectively as numeric vectors.
- In R numbers are doubles by default.
Integers have one special value: NA, while doubles have four:
- NA
- NaN which is ‘not a number’
- Inf
- -Inf

Basic Statistics

Many functions exist to operate on vectors.

mean(n)
median(n)
var(n)
log(n)
exp(n)
sqrt(n)
sum(n)
length(n)
sample(n, replace = T)  #has an additional argument (replace=T)

Arguments modify or direct the function in some way
- There are many arguments for each function, some of which are defaults
- Tab complete is helpful to view argument options

Getting Help

Getting Help on any function is very easy - just type a question mark and the name of the function (or ?? from functions within packages).
There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of parameters/arguments to send to the function.
Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.

Getting Help

-help(mean)
-`?`(mean)
-example(mean)
-help.search("mean")
-apropos("mean")
-args(mean)

Creating vectors

Creating a vector of new data by entering it by hand can be a drag
However, it is also very easy to use functions such as
- seq
- sample

Creating vectors

What do the arguments mean?

seq_1 <- seq(0, 10, by = 0.1)
print(seq_1)

##   [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3
##  [15]  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7
##  [29]  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1
##  [43]  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
##  [57]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
##  [71]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3
##  [85]  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##  [99]  9.8  9.9 10.0

Creating vectors

seq_2 <- seq(10, 0, by = -0.1)
print(seq_2)

##   [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7
##  [15]  8.6  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3
##  [29]  7.2  7.1  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9
##  [43]  5.8  5.7  5.6  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5
##  [57]  4.4  4.3  4.2  4.1  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1
##  [71]  3.0  2.9  2.8  2.7  2.6  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7
##  [85]  1.6  1.5  1.4  1.3  1.2  1.1  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3
##  [99]  0.2  0.1  0.0

Creating vectors

seq_square <- (seq_2) * (seq_2)
print(seq_square)

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Creating vectors

seq_square_new <- (seq_2)^2
print(seq_square_new)

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

R Interlude

Complete Exercises 1.4-1.7

Drawing samples from distributions

Here is a way to create your own data sets that are random samples… we started doing this in class already!

x <- rnorm(n = 10000, mean = 0, sd = 10)
y <- sample(1:10000, 10000, replace = T)
xy <- cbind(x, y)
plot(xy)

Drawing samples from distributions

x <- rnorm(10000, 0, 10)
y <- sample(1:10000, 10000, replace = T)
xy <- cbind(x, y)
hist(x)

Drawing samples from distributions

You’ve probably figured out that y from the last example is drawing numbers with equal probability.
What if you want to draw from a distribution?
Again, play around with the arguments in the parentheses to see what happens.

x <- rnorm (10000, 0, 10)
y <- sample (???, 10000, replace = ???)

Drawing samples from distributions

dnorm() generates the probability density, which can be plotted using the curve() function.
Note that is curve is added to the plot using add=TRUE

x <- rnorm(1000, 0, 100)
hist(x, xlim = c(-500, 500))
curve(50000 * dnorm(x, 0, 100), xlim = c(-500, 500), add = TRUE, 
    col = "Red")

A Note About Arguments in `R` Functions

Sometimes R can guess what you mean because of order…

x <- rnorm(1000, 0, 10)  #n=, mean=, sd=
x[1:10]

##  [1]   5.7478597 -14.7850405   0.7835355 -10.0918965  11.9909998
##  [6]   2.2570687  15.9292746   3.9519431  -8.4260325  -4.0817148

But sometimes if the order isn’t right, you can confuse R and get something you really didn’t want…

x2 <- rnorm(10, 1000, 0)  #n=, mean=, sd=
x2

##  [1] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000

Arguments in `R` Functions

A work-around and best-practice: include the arguments!!

set.seed(145)
x <- rnorm(n = 1000, mean = 0, sd = 10)  #n=, mean=, sd=
x[1:10]

##  [1]  6.869129 10.663631  5.367006 19.060287 10.631596 13.703436  5.277918
##  [8]  4.030967 11.677516  7.926794

set.seed(145)
x2 <- rnorm(sd = 10, n = 1000, mean = 0)  #n=, mean=, sd=
x2[1:10]

##  [1]  6.869129 10.663631  5.367006 19.060287 10.631596 13.703436  5.277918
##  [8]  4.030967 11.677516  7.926794

Notice we also set the seed to replicate our sample results!

Visualizing Data in `R`

Visualizing Data

So far you’ve been visualizing just the list of output numbers
Except for the last example where I snuck in a hist function.
You can also visualize all of the variables that you’ve created using the plot function (as well as a number of more sophisticated plotting functions).
Each of these is called a high level plotting function, which sets the stage
Low level plotting functions will tweak the plots and make them beautiful

Visualizing Data

seq_1 <- seq(0, 10, by = 0.1)
plot(seq_1, xlab = "space", ylab = "function of space", type = "p", 
    col = "red")

Putting plots in a single figure

The first line of the lower script tells R that you are going to create a composite figure that has two rows and two columns (on next slide)
- Can you tell how?

seq_1 <- seq(0, 10, by = 0.1)
seq_2 <- seq(10, 0, by = -0.1)

par(mfrow = c(2, 2))
plot(seq_1, xlab = "time", ylab = "p in population 1", type = "p", 
    col = "red")
plot(seq_2, xlab = "time", ylab = "p in population 2", type = "p", 
    col = "green")
plot(seq_square, xlab = "time", ylab = "p2 in population 2", 
    type = "p", col = "blue")
plot(seq_square_new, xlab = "time", ylab = "p in population 1", 
    type = "l", col = "yellow")

Putting plots in a single figure

R Interlude

Complete Exercises 1.8-1.9

Working with Imported Datasets in `R`

Creating Data Frames in `R`

As you have seen, in R you can generate your own random data set drawn from nearly any distribution very easily.
Often we will want to use collected data.
Now, let’s make a dummy dataset to get used to dealing with data frames
- Set up three variables (habitat, temp and elevation) as vectors

habitat <- factor(c("mixed", "wet", "wet", "wet", "dry", "dry", 
    "dry", "mixed"))
temp <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5)
elevation <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)

Creating Data Frames in R

Create a data frame where vectors become columns

mydata <- data.frame(habitat, temp, elevation)
row.names(mydata) <- c("Reedy Lake", "Pearcadale", "Warneet", 
    "Cranbourne", "Lysterfield", "Red Hill", "Devilbend", "Olinda")
head(mydata)

##             habitat temp elevation
## Reedy Lake    mixed  3.4       0.0
## Pearcadale      wet  3.4       9.2
## Warneet         wet  8.4       3.8
## Cranbourne      wet  3.0       5.0
## Lysterfield     dry  5.6       5.6
## Red Hill        dry  8.1       4.1

Now you have a hand-made data frame with row names

R Interlude: Reading in Data Frames in R

A strength of R is being able to import data from an external source
- Create the same table that you did above in a spreadsheet using Excel or similar
- Export it to a comma separated and tab separated text files for importing into R.
- The first will read in a comma-delimited file, whereas the second is a tab-delimited
- In both cases the header and row.names arguments indicate that there is a header row and row label column
- Note that the name of the file by itself will have R look in the PWD, whereas a full path can also be used

Reading in Data Frames in R

YourFile <- read.table("yourfile.csv", header = T, row.names = 1, 
    sep = ",")
YourFile <- read.csv("yourfile.csv", header = T, row.names = 1, 
    sep = ",")
YourFile <- read.table("yourfile.txt", header = T, row.names = 1, 
    sep = "\t")

Exporting Data Frames in R

write.csv(YourFile, "yourfile.csv", quote = F, row.names = T, 
    sep = ",")
write.table(YourFile, "yourfile.txt", quote = F, row.names = T, 
    sep = "\t")

you will get more practice with this during the next R interlude

Indexing in data frames

Next up - indexing just a subset of the data
This is a very important feature in R, that allows you to analyze just a subset of the data.

print(YourFile[, 2])
print(YourFile$temp)
print(YourFile[2, ])
plot(YourFile$temp, YourFile$elevation)

Indexing in data frames

You can also assign values, or single values, from a data set to a new variable

x <- (YourFile[, 2])
y <- (YourFile$temp)
z <- (YourFile$elevation)
plot(y, z)

Indexing in data frames

You can perform operations on particular levels of a factor
Note that the first argument is the numerical column vector, and the second is the factor column vector.
The third is the operation. Reversing the first two does not work
- Tab complete will tell you the correct order for arguments

tapply(YourFile$temp, YourFile$habitat, mean)
tapply(YourFile$temp, YourFile$habitat, var)

R Interlude

Complete Exercises 1.10-1.11

R for Biostatistical Analyses

Hannah Tavalire and Bill Cresko - University of Oregon

October 2019 - Advanced Biostats Review

Lecture 1 - Using R for Biostatistical Analyses

Why use R?

R resources

Running R

RStudio

Exercise 1.1 - Exploring RStudio

Introduction to RMarkdown

RMarkdown

The markdown language is very flexible

What is markdown?

What is Knitr and PANDOC?

Formatting text

Formatting text

Formatting lists

Formatting lists

Inserting images or URLs

Exercise 1.2-1.3 - Intro to RMarkdown Files and Rmarkdown Advanced

BASICS of R

BASICS of R

BASICS of R

Assigning Variables

Assigning Variables

Arithmetic operations on functions

Arithmetic operations on functions

Arithmetic operations on functions

STRINGS

STRINGS

VECTORS

VECTORS

FACTORS

FACTORS

FACTORS

Types or ‘classes’ of vectors of data

Types of vectors of data

Types of vectors of data

Basic Statistics

Getting Help

Getting Help

Creating vectors

Creating vectors

Creating vectors

Creating vectors

Creating vectors

R Interlude

Drawing samples from distributions

Drawing samples from distributions

Drawing samples from distributions

Drawing samples from distributions

A Note About Arguments in R Functions

Arguments in R Functions

Visualizing Data in R

Visualizing Data

Visualizing Data

Putting plots in a single figure

Putting plots in a single figure

R Interlude

Working with Imported Datasets in R

Creating Data Frames in R

Creating Data Frames in R

R Interlude: Reading in Data Frames in R

Reading in Data Frames in R

Exporting Data Frames in R

Indexing in data frames

Indexing in data frames

Indexing in data frames

R Interlude

Why use `R`?

`R` resources

Running `R`

`RStudio`

Exercise 1.1 - Exploring `RStudio`

Introduction to `RMarkdown`

`RMarkdown`

Exercise 1.2-1.3 - Intro to `RMarkdown` Files and `Rmarkdown` Advanced

BASICS of `R`

BASICS of `R`

BASICS of `R`

A Note About Arguments in `R` Functions

Arguments in `R` Functions

Visualizing Data in `R`

Working with Imported Datasets in `R`

Creating Data Frames in `R`