Chapter 6 An Introduction to the R language
6.1 Background
R
is a computer programming language and environment especially useful for graphic visualization and statistical analysis of data. It is an offshoot of a language developed in 1976 at Bell Laboratories called S
. R
is an interpreted language, meaning that every time code is run it must be translated to machine language by the R
interpreter, as opposed to being compiled prior to running. R
is the premier computational platform for statistical analysis thanks to its GNU open-source status and countless packages contributed by diverse members of the scientific community.
6.2 Why use R
?
- Good general scripting tool for statistics and mathematics
- Powerful and flexible and free
- Runs on all computer platforms
- New packages released all the time
- Superb data management & graphics capabilities
- Reproducibility - can keep your scripts to see exactly what was done
- Can embed your
R
analyses in dynamic, polished files using R markdown - You can write your own functions
- Lots of online help available
- Can use a nice IDE such as
RStudio
6.3 Important R
terms and definitions
From Logan, M. 2010. Biostatistical Design and Analysis Using R
Operators are symbols in programming that have a specific meaning
From Logan, M. 2010. Biostatistical Design and Analysis Using R
6.4 Getting started with R
via the RStudio Environment
To begin working with R
, open RStudio. You should first see something that looks like this:
To open a new script editor (where you will keep track of your code and notes), go to File > New File > R Script. Note that there are other options for file types, which we will be using in the future. For now, though, we want a plain script, which when saved will have the extention .R
.
It is easy to run code directly from the script editor. For single lines of code, simply make sure your cursor is on that line, and hit Ctrl-Enter. For multiple lines, highlight the block of code you want to run and hit Ctrl-Enter.
Now your display should look somehting like below (but without the red pane labels, of course):
Note that you can also type commands directly from the command line using the R
Console (lower left pane), and the R
interpreter will run them when you press Enter.
Any objects you define, and a summary of their values, will appear in the upper right pane, and the lower right pane differs in appearance depending on instructions you provide to R Studio
. For instance, if you produce a plot, it will appear there by default. Another extremely important feature of R functions (we’ll get to them in a bit) is the help file. Recall from Chapter 5 our discussion of man
pages for UNIX programs. Help files the equivalent for R
functions. They contain almost everything you need to know about a given function, and most of them even include and example at the bottom. These help files will appear in the lower right RStudio pane when you call them, for example when you run help(function_name)
at the R
Console.
6.4.1 R Programming Basics
For the code examples below, it might be useful for you to start your own RStudio session, open a new .R
file and type/run code while reading.
- Commands can be submitted through the terminal, console or scripts
- In your scripts, anything that follows
#
symbol (aka hash) is just for humans - Notice on these slides I’m evaluating the code chunks and showing output
- The output is shown here after the two
#
symbols and the number of output items is in[]
- Also notice that
R
follows the normal priority of mathematical evaluation
4*4
## [1] 16
4+3*2^2) (
## [1] 16
6.4.1.1 A note on R
Markdown
This format provides a much better way to embed code and output, in an easily readable, reproducible manner. We will dive into R
Markdown next week, so for now just be aware that it exists.
You can insert
R
chunks intoRmarkdown
documents
6.4.1.2 Assigning Variables
To “store” information for later use, like the arithmetic operation above, we can assign variables in
R
.Variables are assigned values using the
<-
operator.Variable names must begin with a letter, and should not contain spaces or
R
operators (see above) but other than that, just about anything goes. It is good practice to avoid periods in variable names, as they have other functionality in related programming languages, like Python.Do keep in mind that
R
is case sensitive.
<- 2
x *3 x
## [1] 6
<- x * 3
y -2 y
## [1] 4
These do not work
<- 3
3y 3*y <- 3
6.4.1.3 Arithmetic operations with functions
Arithmetic operations can be used with functions as well as numbers.
Try the following, and then your own.
+2
x^2
xlog(x) + log(x+1)
Note that the last of these -
log()
- is a built in function ofR
, and therefore the argument for the function (in this case “x” or “x+1”) needs to be put in parentheses.These parentheses will be important, and we’ll come back to them later when we add other arguments after the object in the parentheses.
The outcome of calculations can be assigned to new variables as well, and the results can be checked using the
print()
function.
<- 67
y print(y)
## [1] 67
<- 124
x <- (x*y)^2
z print(z)
## [1] 69022864
6.4.1.4 Strings
Assignments and operations can be performed on characters as well.
Note that characters need to be set off by quotation marks to differentiate them from numeric objects.
The c(function) stands for ‘concatenate’.
Note that we are using the same variable names as we did previously, which means that we’re overwriting our previous assignment.
A good general rule is to use new names for each variable, and make them short but still descriptive
<- "I Love"
x print (x)
## [1] "I Love"
<- "Biostatistics"
y print (y)
## [1] "Biostatistics"
<- c(x,y)
z print (z)
## [1] "I Love" "Biostatistics"
The variable z is now a vector of character objects.
6.4.1.5 Factors
Sometimes we would like to treat character objects as if they were units for subsequent calculations.
These are called factors, and we can redefine our character object as one of class factor.
This might seem a bit strange, but it’s important for statistical analyses where we might want to calculate the mean or variance for two different treatments. In that case the two different treatments would be coded as two different “levels” of a factor we designate in our metadata. This will become clear when we get into hypothesis testing in
R
.
<- as.factor(z)
z_factor print(z_factor)
class(z_factor)
Note that factor levels are reported alphabetically. I used the class()
function to ask R
what type of object “z_factor” is. class()
is one of the most important tools at your disposal. Often times you can debug your code simply by changing the class of an object. Because functions are written to work with specific classes, changing the class of a given object is crucial in many cases.
6.4.1.6 Null values
R (and many programming languages) have special strings that mean ‘no value’, or ‘null’. In R, the most common is
NA
, although there are others as well (NULL
andNaN
)Typically,
NA
is used to indicate a lack of data for a given observation, or a missing value where there normally should be one.Any instance of a blank entry in your data file will be read into R as an
NA
.NA is a technically a logical data type, and is not equivalent to an empty string or the numeric 0. It is also a reserved word and can’t be used as a variable name.
class(NA)
## [1] "logical"
Many functions in R (e.g. mean()
) will not work by default if passed any NA values as an argument. So if we want to determine the mean of a vector of numeric values, we need to ensure there are either no NA values in the vector, or specify an additional ‘argument’ to the function telling it to ignore NA
. Additionally NA
, like other ‘null’ values, are operated on by a number of unique functions in R.
<- c(0,1,2,NA,4)
num mean(num)
## [1] NA
mean(num, na.rm = TRUE)
## [1] 1.75
is.na(num)
## [1] FALSE FALSE FALSE TRUE FALSE
6.4.1.7 Vectors
In general R thinks in terms of vectors (a list of characters factors or numerical values) and it will benefit any R user to try to write programs with that in mind.
The simplest vectors in R are ‘atomic’ vectors, meaning that they consist of only one data type.
R operations, and therefore functions, are vectorized.
This means an operation or function will be performed for each element in a vector.
Vectors can be assigned directly using the ‘c()’ function and then entering the exact values.
<- c(2,3,4,2,1,2,4,5,10,8,9)
x print(x)
## [1] 2 3 4 2 1 2 4 5 10 8 9
<- x+1
x_plus print(x_plus)
## [1] 3 4 5 3 2 3 5 6 11 9 10
Creating vectors of new data by entering it by hand can be a drag.
However, it is also very easy to use functions such as
seq()
andsample()
.Try the examples below. Can you figure out what the three arguments in the parentheses mean?
Within reason, try varying the arguments to see what happens
<- seq(0.0, 10.0, by = 0.1)
seq_1 print(seq_1)
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
## [16] 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
## [31] 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4
## [46] 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
## [61] 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 7.1 7.2 7.3 7.4
## [76] 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9
## [91] 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
<- seq(10.0, 0.0, by = -0.1)
seq_2 print(seq_2)
## [1] 10.0 9.9 9.8 9.7 9.6 9.5 9.4 9.3 9.2 9.1 9.0 8.9 8.8 8.7 8.6
## [16] 8.5 8.4 8.3 8.2 8.1 8.0 7.9 7.8 7.7 7.6 7.5 7.4 7.3 7.2 7.1
## [31] 7.0 6.9 6.8 6.7 6.6 6.5 6.4 6.3 6.2 6.1 6.0 5.9 5.8 5.7 5.6
## [46] 5.5 5.4 5.3 5.2 5.1 5.0 4.9 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1
## [61] 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 3.1 3.0 2.9 2.8 2.7 2.6
## [76] 2.5 2.4 2.3 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1
## [91] 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
<- (seq_2)*(seq_2)
seq_square print(seq_square)
## [1] 100.00 98.01 96.04 94.09 92.16 90.25 88.36 86.49 84.64 82.81
## [11] 81.00 79.21 77.44 75.69 73.96 72.25 70.56 68.89 67.24 65.61
## [21] 64.00 62.41 60.84 59.29 57.76 56.25 54.76 53.29 51.84 50.41
## [31] 49.00 47.61 46.24 44.89 43.56 42.25 40.96 39.69 38.44 37.21
## [41] 36.00 34.81 33.64 32.49 31.36 30.25 29.16 28.09 27.04 26.01
## [51] 25.00 24.01 23.04 22.09 21.16 20.25 19.36 18.49 17.64 16.81
## [61] 16.00 15.21 14.44 13.69 12.96 12.25 11.56 10.89 10.24 9.61
## [71] 9.00 8.41 7.84 7.29 6.76 6.25 5.76 5.29 4.84 4.41
## [81] 4.00 3.61 3.24 2.89 2.56 2.25 1.96 1.69 1.44 1.21
## [91] 1.00 0.81 0.64 0.49 0.36 0.25 0.16 0.09 0.04 0.01
## [101] 0.00
<- (seq_2)^2
seq_square_new print(seq_square_new)
## [1] 100.00 98.01 96.04 94.09 92.16 90.25 88.36 86.49 84.64 82.81
## [11] 81.00 79.21 77.44 75.69 73.96 72.25 70.56 68.89 67.24 65.61
## [21] 64.00 62.41 60.84 59.29 57.76 56.25 54.76 53.29 51.84 50.41
## [31] 49.00 47.61 46.24 44.89 43.56 42.25 40.96 39.69 38.44 37.21
## [41] 36.00 34.81 33.64 32.49 31.36 30.25 29.16 28.09 27.04 26.01
## [51] 25.00 24.01 23.04 22.09 21.16 20.25 19.36 18.49 17.64 16.81
## [61] 16.00 15.21 14.44 13.69 12.96 12.25 11.56 10.89 10.24 9.61
## [71] 9.00 8.41 7.84 7.29 6.76 6.25 5.76 5.29 4.84 4.41
## [81] 4.00 3.61 3.24 2.89 2.56 2.25 1.96 1.69 1.44 1.21
## [91] 1.00 0.81 0.64 0.49 0.36 0.25 0.16 0.09 0.04 0.01
## [101] 0.00
Here is a way to create your own data sets that are random samples.
Again, on your own, play around with the arguments in the parentheses to see what happens.
<- rnorm (10000, 0, 10)
x <- sample (1:10000, 10000, replace = T)
y <- cbind(x,y)
xy plot(x,y)
You’ve probably figured out that “y” from the last example is a draw of numbers with equal probability (what we call a flat, or uniform distribution).
What if you want to draw from a defined probability distribution, like the normal distribution?
Again, play around with the arguments in the parentheses to see what happens.
<-rnorm(100, 0, 100)
x print (x)
## [1] 119.594552 118.360688 146.985574 2.789787 9.453125 97.012481
## [7] -188.450692 -76.117016 -9.972779 -149.702734 62.268984 -172.160938
## [13] -130.005542 121.454286 -14.125039 74.409048 65.848147 -22.071723
## [19] 158.830803 -109.082609 -93.200418 39.107243 -1.893896 -79.210065
## [25] -6.655992 122.633964 -61.008755 -69.071462 154.232533 52.868288
## [31] 110.527800 99.003130 72.442599 -233.856342 -15.719491 98.744884
## [37] -39.205986 30.929730 -2.433160 -96.470640 3.789014 -37.002019
## [43] -56.254456 -56.455765 90.738564 28.275340 3.990721 -25.958500
## [49] 137.714206 -7.770096 -3.095180 104.334630 -81.901302 111.962132
## [55] 45.575460 2.175602 -46.958588 -5.018479 -114.410151 -124.176085
## [61] -94.710549 15.975731 -59.003959 48.607734 -184.114742 132.428233
## [67] 23.212103 179.804817 -9.460291 -114.895577 24.386068 -54.603461
## [73] -7.134520 -98.107571 68.578203 -10.245594 -68.504521 78.348423
## [79] -17.772234 121.774464 -72.615233 -72.984271 14.755449 -39.788984
## [85] 63.087094 5.234390 12.925797 -68.577694 243.467906 -92.173598
## [91] -171.001379 70.551491 87.085149 160.703198 29.525940 -50.746384
## [97] 5.951637 -129.092875 -193.137195 178.418330
hist(x, xlim = c(-50,50))
hist(x, xlim = c(-500,500))
Can you figure out what the three rnorm() arguments represent?
6.4.1.8 Basic Summary Statistics
We will get into the details regarding summary statistics later, but for now, check out several of the R
functions that calculate them.
mean(x)
median(x)
var(x)
log(x)
ln(x)
sqrt(x)
sum(x)
length(x)
sample(x, replace = T)
Notice that the last function (
sample
) has an argument (replace=T
)Arguments simply modify or direct the function in some way
There are many arguments for each function, some of which are defaults
6.4.1.9 Getting help to understand functions
Getting help on any function is very easy - just type a question mark and the name of the function.
There are functions for just about anything within
R
and it is easy enough to write your own functions if none already exist to do what you want to do.In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of arguments you assign parameters to and send to the function.
Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.
Help can be accessed a few ways - try them :
- help(mean)
- ?mean
- example(mean)
- help.search("mean")
- apropos("mean")
- args(mean)
6.6 Additional learning resources:
Logan, M. 2010. Biostatistical Design and Analysis Using R. - A great intro to R for statistical analysis
http://library.open.oregonstate.edu/computationalbiology/ - O’Neil, S.T. 2017. A Primer for Computational Biology