9/25/2018 & 9/27/2018

But first a beautiful chair

Goals of the course

  • To help you perform your research through instruction in the core components of
    • data collection and organization
    • manipulation and analysis
    • interpretation and presentation
  • Provide a broad coverage of the core components of modern biological statistics
  • Provide you with the computational tools necessary to carry out your work - namely R and affiliated tools
  • This is a practical course and we will learn by doing

Why do we need statistics?

  • We almost never know the world perfectly, but still want to draw conclusions or make decisions
  • We need to estimate underlying parameters from samples of data
  • Sometimes we need to test hypotheses using data
  • Other times we need to more succinctly summarize and/or visualize large amounts of data
  • There are well known mathematical rules that help us
  • Statistics can be done by hand, but computers let us do most of the mathematics quickly

Why do we need statistics?

  • We want to turn data into conclusions about the world
    • experimental design
    • point estimates and confidence intervals
    • hypothesis testing
    • data reduction of highly dimensional data
  • We need a firm understanding of probability, sampling and distributions

A biological example to get us started

  • EXAMPLE - Say you perform an experiment on two different strains of stickleback fish, one from an ocean population (RS) and one from a freshwater lake (BP) by making them microbe free. Microbes in the gut are known to interact with the gut epithelium in ways that lead to a proper maturation of the immune system.
  • EXPERIMENTAL SETUP - You decide to carry out an experiment by treating multiple fish from each strain so that some of them have a conventional microbiota, and some of them are inoculated with only one bacterial species. You then measure the levels of gene expression in the stickleback gut using RNA-seq. Because you have a suspicion that the sex of the fish might be important, you track it too.
  • GETTING THE DATA READY TO ANALYZE - How should the data set be organized to best analyze it? What are the properties of the variables, and why does that matter?

Data set rules of thumb (aka Tidy Data)

  • Store a copy of data in nonproprietary software and hardware formats, such as plain ASCII text (aka a flat file)
  • Leave an uncorrected file when doing analyses
  • Use descriptive names for your data files and variables
  • Include a header line with descriptive variable names
  • Maintain effective metadata about the data
  • When you add observations to a database, add rows
  • When you add variables to a database, add columns, not rows
  • A column of data should contain only one data type

Why use the command line?

  • The commands work almost identically across platforms
  • You can even use them on a large computer cluster like Talapas
  • It is incredibly powerful particularly for repeated actions
  • It allows you to do thousands of ‘clicks’ with single commands

The power of the command line

  • Get a terminal application running on your computer
  • Make sure that you can display the path to your current directory
  • List the files in the current directory
  • Use arguments to your list command to show all files in long form
  • Move through your directories by changing your location them
  • Make a new directory and put an existing text file in it
  • Use the command less to see the contents of the file
  • Create a new file in the directory using nano
  • Copy the files that you made into a new directory
  • Delete the files and then the directory

Why use R?

  • Good general scripting tool for statistics and mathematics
  • Powerful and flexible and free
  • Runs on all computer platforms
  • New enhancements coming out all the time
  • Superb data management & graphics capabilities
  • Reproducibility - can keep your scripts to see exactly what was done
  • Can embed your R analyses in dynamic, polished files using R markdown
  • You can write your own functions
  • Lots of online help available
  • Can use a nice GUI front end such as Rstudio

BASICS of R

  • Commands can be submitted through the terminal, console or scripts
  • In your scripts, anything that follows ‘#’ symbol (aka hash) is just for humans
  • Notice on these slides I’m evaluating the code chunks and showing output
  • The output is shown here after the two # symbols and the number of output items is in []
  • Also notice that R follows the normal priority of mathematical evaluation
4*4
## [1] 16
(4+3*2^2)
## [1] 16

RMarkdown

Assigning Variables

  • A better way to do this is to assign variables
  • Variables are assigned values using the <- operator.
  • Variable names must begin with a letter, but other than that, just about anything goes.
  • Do keep in mind that R is case sensitive.

Assigning Variables

x <- 2
x*3
## [1] 6
y <- x * 3
y-2
## [1] 4

These do not work

3y <- 3
3*y <- 3

Arithmetic operations on functions

  • Arithmetic operations can be performed easily on functions as well as numbers.
  • Try the following, and then your own.
x+2
x^2
log(x)
  • Note that the last of these - log - is a built in function of R, and therefore the object of the function needs to be put in parentheses
  • These parentheses will be important, and we’ll come back to them later when we add arguments after the object in the parentheses
  • The outcome of calculations can be assigned to new variables as well, and the results can be checked using the ‘print’ command

Arithmetic operations on functions

y <- 67
print(y)
## [1] 67
x <- 124
z <- (x*y)^2
print(z)
## [1] 69022864

STRINGS

  • Variables and operations can be performed on characters as well
  • Note that characters need to be set off by quotation marks to differentiate them from numbers
  • The c stands for concatenate
  • Note that we are using the same variable names as we did previously, which means that we’re overwriting our previous assignment
  • A good rule of thumb is to use new names for each variable, and make them short but still descriptive

STRINGS

x <- "I Love"
print (x)
## [1] "I Love"
y <- "Biostatistics"
print (y)
## [1] "Biostatistics"
z <- c(x,y)
print (z)
## [1] "I Love"        "Biostatistics"

FACTORS

  • The variable z is now what is called a list of character values.
  • Sometimes we would like to treat the characters as if they were units for subsequent calculations.
  • These are called factors, and we can redefine our character variables as factors.
  • This might seem a bit strange, but it’s important for statistical analyses where we might want to see the mean or variance for two different treatments.

FACTORS

z_factor <- as.factor(z)
print (z_factor)
  • Note that factor levels are reported alphabetically

VECTORS

  • In general R thinks in terms of vectors (a list of characters, factors or numerical values) and it will benefit any R user to try to write programs with that in mind, as it will simplify most things.
  • Vectors can be assigned directly using the ‘c()’ function and then entering the exact values.

VECTORS

x <- c(2,3,4,2,1,2,4,5,10,8,9)
print(x)
##  [1]  2  3  4  2  1  2  4  5 10  8  9

Basic Statistics

  • Many functions exist to operate on vectors.
  • Combine these with your previous variable to see what happens.
  • Also, try to find other functions (e.g. standard deviation).

Basic Statistics

mean(x)
median(x)
var(x)
log(x)
ln(x)
sqrt(x)
sum(x)
length(x)
sample(x, replace = T)
  • Notice that the last function (sample) has an argument (replace=T)
  • Arguments simply modify or direct the function in some way
  • There are many arguments for each function, some of which are defaults

Getting Help

  • Getting Help on any function is very easy - just type a question mark and the name of the function.
  • There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
  • In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of parameters to send to the function.
  • Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.
  • Help can be accessed a few ways - try them :

Getting Help

- help(mean)
- ?mean
- example(mean)
- help.search("mean")
- apropos("mean")
- args(mean)

Creating vectors

  • Creating vector of new data by entering it by hand can be a drag
  • However, it is also very easy to use functions such as seq and sample
  • Try the examples below Can you figure out what the three arguments in the parentheses mean?
  • Try varying the arguments to see what happens.
  • Don’t go too crazy with the last one or your computer might slow way down

Creating vectors

seq_1 <- seq(0.0, 10.0, by = 0.1)
print(seq_1)
##   [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3
##  [15]  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7
##  [29]  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1
##  [43]  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
##  [57]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
##  [71]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3
##  [85]  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##  [99]  9.8  9.9 10.0
seq_2 <- seq(10.0, 0.0, by = -0.1)
print(seq_2)
##   [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7
##  [15]  8.6  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3
##  [29]  7.2  7.1  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9
##  [43]  5.8  5.7  5.6  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5
##  [57]  4.4  4.3  4.2  4.1  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1
##  [71]  3.0  2.9  2.8  2.7  2.6  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7
##  [85]  1.6  1.5  1.4  1.3  1.2  1.1  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3
##  [99]  0.2  0.1  0.0

Creating vectors

seq_square <- (seq_2)*(seq_2)
print(seq_square)
##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Creating vectors

seq_square_new <- (seq_2)^2
print(seq_square_new)
##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Drawing samples from distributions

  • Here is a way to create your own data sets that are random samples.
  • Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
plot(x,y) 

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
plot(xy)

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
hist(x)

Drawing samples from distributions

  • You’ve probably figured out that y from the last example is drawing numbers with equal probability.
  • What if you want to draw from a distribution?
  • Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

x <-rnorm(1000, 0, 100)
hist(x, xlim = c(-500,500))
curve(50000*dnorm(x, 0, 100), xlim = c(-500,500), add=TRUE, col='Red')

- dnorm() generates the probability density, which can be plotted using the curve() function. - Note that is curve is added to the plot using add=TRUE

Visualizing Data

  • So far you’ve been visualizing just the list of output numbers
  • Except for the last example where I snuck in a hist function.
  • You can also visualize all of the variables that you’ve created using the plot function (as well as a number of more sophisticated plotting functions).
  • Each of these is called a high level plotting function, which sets the stage
  • Low level plotting functions will tweak the plots and make them beautiful

Visualizing Data

  • What do you think that each of the arguments means for the plot function?
  • A cool thing about R is that the options for the arguments make sense.
  • Try adjusting an argument and see if it works
  • Note next week we will be exploring the plotting in GGPlot2

Visualizing Data

seq_1 <- seq(0.0, 10.0, by = 0.1) 
plot (seq_1, xlab="space", ylab ="function of space", type = "p", col = "red")

Putting plots in a single figure

  • On the next slide
  • The first line of the lower script tells R that you are going to create a composite figure that has two rows and two columns. Can you tell how?
  • Now, modify the code to add two more variables and add one more row of two panels.
seq_1 <- seq(0.0, 10.0, by = 0.1)
seq_2 <- seq(10.0, 0.0, by = -0.1)

Putting plots in a single figure

par(mfrow=c(2,2))
plot (seq_1, xlab="time", ylab ="p in population 1", type = "p", col = 'red')
plot (seq_2, xlab="time", ylab ="p in population 2", type = "p", col = 'green')
plot (seq_square, xlab="time", ylab ="p2 in population 2", type = "p", col = 'blue')
plot (seq_square_new, xlab="time", ylab ="p in population 1", type = "l", col = 'yellow')

Example using binomial distribution

  • As above for the normal distribution, data can be generated by being sampled from nearly any distribution and then visualized.
  • Below I’m having you use the ‘histogram’ function. What does it do?

Example using binomial distribution

  • 10 successes (out of 20 trials) is the most frequent outcome
heads <- rbinom(n=1000, size=20, prob=0.5)
hist(heads)

Example using binomial distribution

  • This kind of statement can be run in one line as well, which is sometimes easier.
hist(rbinom(n=1000, size=20, prob=0.5))

Creating Data Frames in R

  • As you have seen, in R you can generate your own random data set drawn from nearly any distribution very easily.
  • Often we will want to use collected data.
  • Now, let’s make a dummy dataset to get used to dealing with data frames
  • Set up three variables (habitat, temp and elevation) as vectors
habitat <- factor(c("mixed", "wet", "wet", "wet", "dry", "dry", "dry","mixed"))
temp <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5)
elevation <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)
  • Create a data frame where vectors become columns
mydata <- data.frame(habitat, temp, elevation)
row.names(mydata) <- c("Reedy Lake", "Pearcadale", "Warneet", "Cranbourne", 
                       "Lysterfield", "Red Hill", "Devilbend", "Olinda")
  • Now you have a hand-made data frame with row names

Reading in Data Frames in R

  • A strength of R is being able to import data from an external source
  • Create the same table that you did above in a spreadsheet like Excel
  • Export it to comma separated and tab separated text files for importing into R.
  • The first will read in a comma-delimited file, whereas the second is a tab-delimited
  • In both cases the header and row.names arguments indicate that there is a header row and row label column
  • Note that the name of the file by itself will have R look in the CWD, whereas a full path can also be used

Reading in Data Frames in R

YourFile <- read.table('yourfile.csv', header=T, row.names=1, sep=',')
YourFile <- read.table('yourfile.txt', header=T, row.names=1, sep='\t')

Exporting Data Frames in R

write.table(YourFile, "yourfile.csv", quote=F, row.names=T, sep=",")
write.table(YourFile, "yourfile.txt", quote=F, row.names=T, sep="\t")

Indexing in data frames

  • Next up - indexing just a subset of the data
  • This is a very important idea in R, that you can analyze just a subset of the data.
  • This is analyzing only the data in the file you made that has the factor value ‘mixed’.
print (YourFile[,2])
print (YourFile$temp)
print (YourFile[2,])
plot (YourFile$temp, YourFile$elevation)

Indexing in data frames

  • You can perform operations on particular levels of a factor
  • Calculating the mean of the ‘mixed’ and ‘gipps’ levels of habitat.
  • Note that the first argument is the numerical column vector, and the second is the factor column vector.
  • The third is the operation. Reversing the first two does not work (the one below).
tapply(YourFile$temp, YourFile$habitat, mean)
tapply(YourFile$temp, YourFile$habitat, var)

R INTERLUDE

Some real transcriptomic data

  • Examine the text file: GacuRNAseq_Subset.csv
  • How many many rows and columns are there?
  • How many different variables are there?
  • What are the general types of variables?
  • Now let’s read the data file into R and analyze it
  • This exercise will help you get used to reading in and manipulating genomic data files
  • First off, remember to set your working directory to find your file correctly

Some real transcriptomic data

RNAseq_Data <- read.table('GacuRNAseq_subset.csv', header=TRUE, sep=',')

print (RNAseq_Data)
head (RNAseq_Data)
tail (RNAseq_Data)

print (RNAseq_Data[,2])
print (RNAseq_Data[1,])
print (RNAseq_Data[1,2])
print (RNAseq_Data$ENSGACG00000000010)
print (RNAseq_Data$ENSGACG00000000010>45.0)

Summary stats and figures

summary1 <- summary(RNAseq_Data $ENSGACG00000000003)
print (summary1)

hist(RNAseq_Data $ENSGACG00000000003)
boxplot(RNAseq_Data$ENSGACG00000000003)
boxplot(RNAseq_Data$ENSGACG00000000003~RNAseq_Data$Population)
plot(RNAseq_Data $ENSGACG00000000003, RNAseq_Data$ENSGACG00000000003)

boxplot(RNAseq_Data $ENSGACG00000000003~RNAseq_Data$Treatment, 
        col = "red", ylab = "Expression Level", xlab = "Treatment level", 
        border ="orange", 
        main = "Boxplot of variation in gene expression across microbiota treatments")