R for Biostatistical Analyses

Hannah Tavalire and Bill Cresko - University of Oregon

October 2019 - Advanced Biostats Review

Lecture 1 - Using R for Biostatistical Analyses

Why use R?

  • R is a statistical programming language (derived from S)
  • Superb data management & graphics capabilities
  • You can write your own functions
  • Powerful and flexible
  • Runs on all computer platforms
  • Well established system of packages and documentation
  • Active development and dedicated community
  • Can use a nice GUI front end such as Rstudio
  • Reproducibility
    • keep your scripts to see exactly what was done
    • distribute these with your data
    • embed your R analyses in polished RMarkdown files
  • FREE

R resources

Running R

  • Need to make sure that you have R installed
  • Run R from the command line
    • just type R
    • can run it locally as well as on clusters
  • Install an R Integrated Development Environment (IDE)
    • RStudio: http://www.rstudio.com
    • Makes working with R much easier, particularly for a new R user
    • Run on Windows, Mac or Linux OS

RStudio

Exercise 1.1 - Exploring RStudio

  • Open RStudio
  • Take a few minutes to familiarize yourself with the Rstudio environment by locating the following features:
    • See what types of new files can be made in Rstudio by clicking the top left icon- open a new R script.
    • The windows clockwise from top left are: the code editor, the workspace and history, the plots and files window, and the R console.
    • In the plots and files window, click on the packages and help tabs to see what they offer.
  • Now open the file called ABS_2019_Exercises_for_R_Review.Rmd in ~/CLASS_MATERIALS/R_and_Rmd_Review/02.Exercises/
    • This file will serve as your digital notebook for this review and contains the exercises.

Introduction to RMarkdown

RMarkdown

The markdown language is very flexible

  • You can import RMarkdown templates into RStudio and open as a new Rmarkdown file
  • Better yet there are packages that add functionality
  • When you install the package it will show up in the ‘From Template’ section of the ‘new file’ startup screen
  • There are packages to make
    • books
    • journal articles
    • slide shows
    • interactive exercises
    • many more

What is markdown?

  • Lightweight formal markup languages are used to add formatting to plaintext documents
    • Adding basic syntax to the text will make elements look different once rendered/knit
    • Available in many base editors (e.g., Atom text editor)
  • You then need a markdown application with a markdown processor/parser to render your text files into something more exciting
    • Static and dynamic outputs!
    • pdf, HTML, presentations, websites, scientific articles, books etc

What is Knitr and PANDOC?

  • Knitr is a package in R to render markdown files
  • PANDOC is a general way to render markdown files into something else
  • https://pandoc.orgis
  • Can include math using LaTeX
  • GitHub will render markdown directly
  • Markdown can easily be rendered within most editors now
  • Within RStudio just use the knit button to render markdown
  • Markdown syntax is very easy

Formatting text

  • Italic or Italic
  • Bold or Bold

Formatting text

“You know the greatest danger facing us is ourselves, an irrational fear of the unknown. But there’s no such thing as the unknown — only things temporarily hidden, temporarily not understood.”

— Captain James T. Kirk

Formatting lists

  • list_element
    • sub_list_element
    • sub_list_element
    • sub_list_element
  • list_element
    • sub_list_element

Formatting lists

  1. One
  2. Two
  3. Three
  4. Four

Inserting images or URLs

Link Image

Exercise 1.2-1.3 - Intro to RMarkdown Files and Rmarkdown Advanced

  • Take a few minutes to familiarize yourself with RMarkdown files and the markdown language by completing exercise 1.2 & 1.3 in your exercises document- don’t worry if you don’t get all the way through

BASICS of R

BASICS of R

  • Commands can be submitted through
    • terminal, console or scripts
    • can be embedded as code chunks in RMarkdown
  • On these slides evaluating code chunks and showing output
    • shown here after the two # symbols
    • the number of output items is in []
  • R follows the normal priority of mathematical evaluation (PEDMAS)

BASICS of R

Input code chunk and then output

## [1] 16

Input code chunk and then output

## [1] 16

Assigning Variables

  • A better way to do this is to assign variables
  • Variables are assigned values using the <- operator (better than =).
  • Variable names must begin with a letter, but other than that, just about anything goes.
  • Do keep in mind that R is case sensitive.

Assigning Variables

## [1] 6
## [1] 4

These do not work

Arithmetic operations on functions

  • Arithmetic operations can be performed easily on functions as well as numbers.
## [1] 14
## [1] 144
## [1] 2.484907

Arithmetic operations on functions

  • Note that the last of these - log - is a built-in function of R, and therefore the object of the function needs to be put in parentheses
  • These parentheses will be important, and we’ll come back to them later when we add arguments in the parentheses after the function
  • The outcome of calculations can be assigned to new variables as well, and the results can be checked using the print command

Arithmetic operations on functions

## [1] 67
## [1] 69022864

STRINGS

  • Operations can be performed on character variables as well
  • Note that “characters” need to be set off by quotation marks to differentiate them from numbers
  • The c stands for concatenate
  • Note that we are using the same variable names as we did previously, which means that we’re overwriting our previous assignment
  • A good rule of thumb is to use new names for each variable, and make them short but still descriptive

STRINGS

## [1] "I Love"
## [1] "Biostatistics"
## [1] "I Love"        "Biostatistics"

VECTORS

  • In general R thinks in terms of vectors
    • a list of characters, factors or numerical values (“I Love”)
    • it will benefit any R user to try to write scripts with that in mind
    • it will simplify most things
  • Vectors can be assigned directly using the c() function and then entering the exact values with commas separating each element.

VECTORS

##  [1]  2  3  4  2  1  2  4  5 10  8  9
##  [1]  5  6  7  5  4  5  7  8 13 11 12

FACTORS

  • The vector x is now what is called a list of character values (“I Love”).
  • Sometimes we would like to treat the characters as if they were units for subsequent calculations.
  • These are called factors, and we can redefine our character variables as factors.
  • This might seem a bit strange, but it’s important for statistical analyses where we might want to see the mean or variance for two different treatments.

FACTORS

## [1] I Love
## Levels: I Love
  • Note that factor levels are reported alphabetically

FACTORS

  • We can also determine how R “sees” a variable using str() or class() functions.
  • This is a useful check when importing datasets or verifying that you assigned a class correctly
##  chr "I Love"
## [1] "character"

Types or ‘classes’ of vectors of data

Types of vectors of data

  • int stands for integers

  • dbl stands for doubles, or real numbers (or num)

  • chr stands for character vectors, or strings

  • dttm stands for date-times (a date + a time)

  • lgl stands for logical, vectors that contain only TRUE or FALSE

  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values

  • date stands for dates

Types of vectors of data

  • Logical vectors can take only three possible values:
    • FALSE
    • TRUE
    • NA which is ‘not available’ and is the default coding for missing data in R
  • Integer and double vectors are known collectively as numeric vectors.
    • In R numbers are doubles by default.
  • Integers have one special value: NA, while doubles have four:
    • NA
    • NaN which is ‘not a number’
    • Inf
    • -Inf

Basic Statistics

Many functions exist to operate on vectors.

  • Arguments modify or direct the function in some way
    • There are many arguments for each function, some of which are defaults
    • Tab complete is helpful to view argument options

Getting Help

  • Getting Help on any function is very easy - just type a question mark and the name of the function (or ?? from functions within packages).
  • There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
  • In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of parameters/arguments to send to the function.
  • Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.

Getting Help

Creating vectors

  • Creating a vector of new data by entering it by hand can be a drag
  • However, it is also very easy to use functions such as
    • seq
    • sample

Creating vectors

  • What do the arguments mean?
##   [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3
##  [15]  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7
##  [29]  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1
##  [43]  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
##  [57]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
##  [71]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3
##  [85]  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##  [99]  9.8  9.9 10.0

Creating vectors

##   [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7
##  [15]  8.6  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3
##  [29]  7.2  7.1  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9
##  [43]  5.8  5.7  5.6  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5
##  [57]  4.4  4.3  4.2  4.1  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1
##  [71]  3.0  2.9  2.8  2.7  2.6  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7
##  [85]  1.6  1.5  1.4  1.3  1.2  1.1  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3
##  [99]  0.2  0.1  0.0

Creating vectors

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Creating vectors

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

R Interlude

Complete Exercises 1.4-1.7

Drawing samples from distributions

  • Here is a way to create your own data sets that are random samples… we started doing this in class already!

Drawing samples from distributions

Drawing samples from distributions

  • You’ve probably figured out that y from the last example is drawing numbers with equal probability.
  • What if you want to draw from a distribution?
  • Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

  • dnorm() generates the probability density, which can be plotted using the curve() function.
  • Note that is curve is added to the plot using add=TRUE

A Note About Arguments in R Functions

  • Sometimes R can guess what you mean because of order…
##  [1]   5.7478597 -14.7850405   0.7835355 -10.0918965  11.9909998
##  [6]   2.2570687  15.9292746   3.9519431  -8.4260325  -4.0817148
  • But sometimes if the order isn’t right, you can confuse R and get something you really didn’t want…
##  [1] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000

Arguments in R Functions

  • A work-around and best-practice: include the arguments!!
##  [1]  6.869129 10.663631  5.367006 19.060287 10.631596 13.703436  5.277918
##  [8]  4.030967 11.677516  7.926794
##  [1]  6.869129 10.663631  5.367006 19.060287 10.631596 13.703436  5.277918
##  [8]  4.030967 11.677516  7.926794
  • Notice we also set the seed to replicate our sample results!

Visualizing Data in R

Visualizing Data

  • So far you’ve been visualizing just the list of output numbers
  • Except for the last example where I snuck in a hist function.
  • You can also visualize all of the variables that you’ve created using the plot function (as well as a number of more sophisticated plotting functions).
  • Each of these is called a high level plotting function, which sets the stage
  • Low level plotting functions will tweak the plots and make them beautiful

Visualizing Data

Putting plots in a single figure

  • The first line of the lower script tells R that you are going to create a composite figure that has two rows and two columns (on next slide)
    • Can you tell how?

Putting plots in a single figure

R Interlude

Complete Exercises 1.8-1.9

Working with Imported Datasets in R

Creating Data Frames in R

  • As you have seen, in R you can generate your own random data set drawn from nearly any distribution very easily.
  • Often we will want to use collected data.
  • Now, let’s make a dummy dataset to get used to dealing with data frames
    • Set up three variables (habitat, temp and elevation) as vectors

Creating Data Frames in R

  • Create a data frame where vectors become columns
##             habitat temp elevation
## Reedy Lake    mixed  3.4       0.0
## Pearcadale      wet  3.4       9.2
## Warneet         wet  8.4       3.8
## Cranbourne      wet  3.0       5.0
## Lysterfield     dry  5.6       5.6
## Red Hill        dry  8.1       4.1
  • Now you have a hand-made data frame with row names

R Interlude: Reading in Data Frames in R

  • A strength of R is being able to import data from an external source
    • Create the same table that you did above in a spreadsheet using Excel or similar
    • Export it to a comma separated and tab separated text files for importing into R.
    • The first will read in a comma-delimited file, whereas the second is a tab-delimited
    • In both cases the header and row.names arguments indicate that there is a header row and row label column
    • Note that the name of the file by itself will have R look in the PWD, whereas a full path can also be used

Reading in Data Frames in R

Exporting Data Frames in R

  • you will get more practice with this during the next R interlude

Indexing in data frames

  • Next up - indexing just a subset of the data
  • This is a very important feature in R, that allows you to analyze just a subset of the data.

Indexing in data frames

  • You can also assign values, or single values, from a data set to a new variable

Indexing in data frames

  • You can perform operations on particular levels of a factor
  • Note that the first argument is the numerical column vector, and the second is the factor column vector.
  • The third is the operation. Reversing the first two does not work
    • Tab complete will tell you the correct order for arguments

R Interlude

Complete Exercises 1.10-1.11