Advanced Biological Statistics Week 1

9/25/2018 & 9/27/2018

But first a beautiful chair

Goals of the course

To help you perform your research through instruction in the core components of
- data collection and organization
- manipulation and analysis
- interpretation and presentation
Provide a broad coverage of the core components of modern biological statistics
Provide you with the computational tools necessary to carry out your work - namely R and affiliated tools
This is a practical course and we will learn by doing

Why do we need statistics?

We almost never know the world perfectly, but still want to draw conclusions or make decisions
We need to estimate underlying parameters from samples of data
Sometimes we need to test hypotheses using data
Other times we need to more succinctly summarize and/or visualize large amounts of data
There are well known mathematical rules that help us
Statistics can be done by hand, but computers let us do most of the mathematics quickly

Why do we need statistics?

We want to turn data into conclusions about the world
- experimental design
- point estimates and confidence intervals
- hypothesis testing
- data reduction of highly dimensional data
We need a firm understanding of probability, sampling and distributions

A biological example to get us started

EXAMPLE - Say you perform an experiment on two different strains of stickleback fish, one from an ocean population (RS) and one from a freshwater lake (BP) by making them microbe free. Microbes in the gut are known to interact with the gut epithelium in ways that lead to a proper maturation of the immune system.
EXPERIMENTAL SETUP - You decide to carry out an experiment by treating multiple fish from each strain so that some of them have a conventional microbiota, and some of them are inoculated with only one bacterial species. You then measure the levels of gene expression in the stickleback gut using RNA-seq. Because you have a suspicion that the sex of the fish might be important, you track it too.
GETTING THE DATA READY TO ANALYZE - How should the data set be organized to best analyze it? What are the properties of the variables, and why does that matter?

Data set rules of thumb (aka Tidy Data)

Store a copy of data in nonproprietary software and hardware formats, such as plain ASCII text (aka a flat file)
Leave an uncorrected file when doing analyses
Use descriptive names for your data files and variables
Include a header line with descriptive variable names
Maintain effective metadata about the data
When you add observations to a database, add rows
When you add variables to a database, add columns, not rows
A column of data should contain only one data type

Why use the command line?

The commands work almost identically across platforms
You can even use them on a large computer cluster like Talapas
It is incredibly powerful particularly for repeated actions
It allows you to do thousands of ‘clicks’ with single commands

The power of the `command line`

Get a terminal application running on your computer
Make sure that you can display the path to your current directory
List the files in the current directory
Use arguments to your list command to show all files in long form
Move through your directories by changing your location them
Make a new directory and put an existing text file in it
Use the command less to see the contents of the file
Create a new file in the directory using nano
Copy the files that you made into a new directory
Delete the files and then the directory

Why use `R`?

Good general scripting tool for statistics and mathematics
Powerful and flexible and free
Runs on all computer platforms
New enhancements coming out all the time
Superb data management & graphics capabilities
Reproducibility - can keep your scripts to see exactly what was done
Can embed your R analyses in dynamic, polished files using R markdown
You can write your own functions
Lots of online help available
Can use a nice GUI front end such as Rstudio

BASICS of `R`

Commands can be submitted through the terminal, console or scripts
In your scripts, anything that follows ‘#’ symbol (aka hash) is just for humans
Notice on these slides I’m evaluating the code chunks and showing output
The output is shown here after the two # symbols and the number of output items is in []
Also notice that R follows the normal priority of mathematical evaluation

4*4

## [1] 16

(4+3*2^2)

## [1] 16

`RMarkdown`

A better way to embed code in R Markdown files
http://kbroman.org/knitr_knutshell/pages/Rmarkdown.html
You can insert R chunks into Rmarkdown documents

Assigning Variables

A better way to do this is to assign variables
Variables are assigned values using the <- operator.
Variable names must begin with a letter, but other than that, just about anything goes.
Do keep in mind that R is case sensitive.

Assigning Variables

x <- 2
x*3

## [1] 6

y <- x * 3
y-2

## [1] 4

These do not work

3y <- 3
3*y <- 3

Arithmetic operations on functions

Arithmetic operations can be performed easily on functions as well as numbers.
Try the following, and then your own.

x+2
x^2
log(x)

Note that the last of these - log - is a built in function of R, and therefore the object of the function needs to be put in parentheses
These parentheses will be important, and we’ll come back to them later when we add arguments after the object in the parentheses
The outcome of calculations can be assigned to new variables as well, and the results can be checked using the ‘print’ command

Arithmetic operations on functions

y <- 67
print(y)

## [1] 67

x <- 124
z <- (x*y)^2
print(z)

## [1] 69022864

STRINGS

Variables and operations can be performed on characters as well
Note that characters need to be set off by quotation marks to differentiate them from numbers
The c stands for concatenate
Note that we are using the same variable names as we did previously, which means that we’re overwriting our previous assignment
A good rule of thumb is to use new names for each variable, and make them short but still descriptive

STRINGS

x <- "I Love"
print (x)

## [1] "I Love"

y <- "Biostatistics"
print (y)

## [1] "Biostatistics"

z <- c(x,y)
print (z)

## [1] "I Love"        "Biostatistics"

FACTORS

The variable z is now what is called a list of character values.
Sometimes we would like to treat the characters as if they were units for subsequent calculations.
These are called factors, and we can redefine our character variables as factors.
This might seem a bit strange, but it’s important for statistical analyses where we might want to see the mean or variance for two different treatments.

FACTORS

z_factor <- as.factor(z)
print (z_factor)

Note that factor levels are reported alphabetically

VECTORS

In general R thinks in terms of vectors (a list of characters, factors or numerical values) and it will benefit any R user to try to write programs with that in mind, as it will simplify most things.
Vectors can be assigned directly using the ‘c()’ function and then entering the exact values.

VECTORS

x <- c(2,3,4,2,1,2,4,5,10,8,9)
print(x)

##  [1]  2  3  4  2  1  2  4  5 10  8  9

Basic Statistics

Many functions exist to operate on vectors.
Combine these with your previous variable to see what happens.
Also, try to find other functions (e.g. standard deviation).

Basic Statistics

mean(x)
median(x)
var(x)
log(x)
ln(x)
sqrt(x)
sum(x)
length(x)
sample(x, replace = T)

Notice that the last function (sample) has an argument (replace=T)
Arguments simply modify or direct the function in some way
There are many arguments for each function, some of which are defaults

Getting Help

Getting Help on any function is very easy - just type a question mark and the name of the function.
There are functions for just about anything within R and it is easy enough to write your own functions if none already exist to do what you want to do.
In general, function calls have a simple structure: a function name, a set of parentheses and an optional set of parameters to send to the function.
Help pages exist for all functions that, at a minimum, explain what parameters exist for the function.
Help can be accessed a few ways - try them :

Getting Help

- help(mean)
- ?mean
- example(mean)
- help.search("mean")
- apropos("mean")
- args(mean)

Creating vectors

Creating vector of new data by entering it by hand can be a drag
However, it is also very easy to use functions such as seq and sample
Try the examples below Can you figure out what the three arguments in the parentheses mean?
Try varying the arguments to see what happens.
Don’t go too crazy with the last one or your computer might slow way down

Creating vectors

seq_1 <- seq(0.0, 10.0, by = 0.1)
print(seq_1)

##   [1]  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0  1.1  1.2  1.3
##  [15]  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3  2.4  2.5  2.6  2.7
##  [29]  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7  3.8  3.9  4.0  4.1
##  [43]  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1  5.2  5.3  5.4  5.5
##  [57]  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5  6.6  6.7  6.8  6.9
##  [71]  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9  8.0  8.1  8.2  8.3
##  [85]  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##  [99]  9.8  9.9 10.0

seq_2 <- seq(10.0, 0.0, by = -0.1)
print(seq_2)

##   [1] 10.0  9.9  9.8  9.7  9.6  9.5  9.4  9.3  9.2  9.1  9.0  8.9  8.8  8.7
##  [15]  8.6  8.5  8.4  8.3  8.2  8.1  8.0  7.9  7.8  7.7  7.6  7.5  7.4  7.3
##  [29]  7.2  7.1  7.0  6.9  6.8  6.7  6.6  6.5  6.4  6.3  6.2  6.1  6.0  5.9
##  [43]  5.8  5.7  5.6  5.5  5.4  5.3  5.2  5.1  5.0  4.9  4.8  4.7  4.6  4.5
##  [57]  4.4  4.3  4.2  4.1  4.0  3.9  3.8  3.7  3.6  3.5  3.4  3.3  3.2  3.1
##  [71]  3.0  2.9  2.8  2.7  2.6  2.5  2.4  2.3  2.2  2.1  2.0  1.9  1.8  1.7
##  [85]  1.6  1.5  1.4  1.3  1.2  1.1  1.0  0.9  0.8  0.7  0.6  0.5  0.4  0.3
##  [99]  0.2  0.1  0.0

Creating vectors

seq_square <- (seq_2)*(seq_2)
print(seq_square)

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Creating vectors

seq_square_new <- (seq_2)^2
print(seq_square_new)

##   [1] 100.00  98.01  96.04  94.09  92.16  90.25  88.36  86.49  84.64  82.81
##  [11]  81.00  79.21  77.44  75.69  73.96  72.25  70.56  68.89  67.24  65.61
##  [21]  64.00  62.41  60.84  59.29  57.76  56.25  54.76  53.29  51.84  50.41
##  [31]  49.00  47.61  46.24  44.89  43.56  42.25  40.96  39.69  38.44  37.21
##  [41]  36.00  34.81  33.64  32.49  31.36  30.25  29.16  28.09  27.04  26.01
##  [51]  25.00  24.01  23.04  22.09  21.16  20.25  19.36  18.49  17.64  16.81
##  [61]  16.00  15.21  14.44  13.69  12.96  12.25  11.56  10.89  10.24   9.61
##  [71]   9.00   8.41   7.84   7.29   6.76   6.25   5.76   5.29   4.84   4.41
##  [81]   4.00   3.61   3.24   2.89   2.56   2.25   1.96   1.69   1.44   1.21
##  [91]   1.00   0.81   0.64   0.49   0.36   0.25   0.16   0.09   0.04   0.01
## [101]   0.00

Drawing samples from distributions

Here is a way to create your own data sets that are random samples.
Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
plot(x,y)

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
plot(xy)

Drawing samples from distributions

x <- rnorm (10000, 0, 10)
y <- sample (1:10000, 10000, replace = T)
xy <- cbind(x,y)
hist(x)

Drawing samples from distributions

You’ve probably figured out that y from the last example is drawing numbers with equal probability.
What if you want to draw from a distribution?
Again, play around with the arguments in the parentheses to see what happens.

Drawing samples from distributions

x <-rnorm(1000, 0, 100)
hist(x, xlim = c(-500,500))
curve(50000*dnorm(x, 0, 100), xlim = c(-500,500), add=TRUE, col='Red')

- dnorm() generates the probability density, which can be plotted using the curve() function. - Note that is curve is added to the plot using add=TRUE

Visualizing Data

So far you’ve been visualizing just the list of output numbers
Except for the last example where I snuck in a hist function.
You can also visualize all of the variables that you’ve created using the plot function (as well as a number of more sophisticated plotting functions).
Each of these is called a high level plotting function, which sets the stage
Low level plotting functions will tweak the plots and make them beautiful

Visualizing Data

What do you think that each of the arguments means for the plot function?
A cool thing about R is that the options for the arguments make sense.
Try adjusting an argument and see if it works
Note next week we will be exploring the plotting in GGPlot2

Visualizing Data

seq_1 <- seq(0.0, 10.0, by = 0.1) 
plot (seq_1, xlab="space", ylab ="function of space", type = "p", col = "red")

Putting plots in a single figure

On the next slide
The first line of the lower script tells R that you are going to create a composite figure that has two rows and two columns. Can you tell how?
Now, modify the code to add two more variables and add one more row of two panels.

seq_1 <- seq(0.0, 10.0, by = 0.1)
seq_2 <- seq(10.0, 0.0, by = -0.1)

Putting plots in a single figure

par(mfrow=c(2,2))
plot (seq_1, xlab="time", ylab ="p in population 1", type = "p", col = 'red')
plot (seq_2, xlab="time", ylab ="p in population 2", type = "p", col = 'green')
plot (seq_square, xlab="time", ylab ="p2 in population 2", type = "p", col = 'blue')
plot (seq_square_new, xlab="time", ylab ="p in population 1", type = "l", col = 'yellow')

Example using binomial distribution

As above for the normal distribution, data can be generated by being sampled from nearly any distribution and then visualized.
Below I’m having you use the ‘histogram’ function. What does it do?

Example using binomial distribution

10 successes (out of 20 trials) is the most frequent outcome

heads <- rbinom(n=1000, size=20, prob=0.5)
hist(heads)

Example using binomial distribution

This kind of statement can be run in one line as well, which is sometimes easier.

hist(rbinom(n=1000, size=20, prob=0.5))

Creating Data Frames in R

As you have seen, in R you can generate your own random data set drawn from nearly any distribution very easily.
Often we will want to use collected data.
Now, let’s make a dummy dataset to get used to dealing with data frames
Set up three variables (habitat, temp and elevation) as vectors

habitat <- factor(c("mixed", "wet", "wet", "wet", "dry", "dry", "dry","mixed"))
temp <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5)
elevation <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)

Create a data frame where vectors become columns

mydata <- data.frame(habitat, temp, elevation)
row.names(mydata) <- c("Reedy Lake", "Pearcadale", "Warneet", "Cranbourne", 
                       "Lysterfield", "Red Hill", "Devilbend", "Olinda")

Now you have a hand-made data frame with row names

Reading in Data Frames in R

A strength of R is being able to import data from an external source
Create the same table that you did above in a spreadsheet like Excel
Export it to comma separated and tab separated text files for importing into R.
The first will read in a comma-delimited file, whereas the second is a tab-delimited
In both cases the header and row.names arguments indicate that there is a header row and row label column
Note that the name of the file by itself will have R look in the CWD, whereas a full path can also be used

Reading in Data Frames in R

YourFile <- read.table('yourfile.csv', header=T, row.names=1, sep=',')
YourFile <- read.table('yourfile.txt', header=T, row.names=1, sep='\t')

Exporting Data Frames in R

write.table(YourFile, "yourfile.csv", quote=F, row.names=T, sep=",")
write.table(YourFile, "yourfile.txt", quote=F, row.names=T, sep="\t")

Indexing in data frames

Next up - indexing just a subset of the data
This is a very important idea in R, that you can analyze just a subset of the data.
This is analyzing only the data in the file you made that has the factor value ‘mixed’.

print (YourFile[,2])
print (YourFile$temp)
print (YourFile[2,])
plot (YourFile$temp, YourFile$elevation)

Indexing in data frames

You can perform operations on particular levels of a factor
Calculating the mean of the ‘mixed’ and ‘gipps’ levels of habitat.
Note that the first argument is the numerical column vector, and the second is the factor column vector.
The third is the operation. Reversing the first two does not work (the one below).

tapply(YourFile$temp, YourFile$habitat, mean)
tapply(YourFile$temp, YourFile$habitat, var)

R INTERLUDE

Some real transcriptomic data

Examine the text file: GacuRNAseq_Subset.csv
How many many rows and columns are there?
How many different variables are there?
What are the general types of variables?
Now let’s read the data file into R and analyze it
This exercise will help you get used to reading in and manipulating genomic data files
First off, remember to set your working directory to find your file correctly

Some real transcriptomic data

RNAseq_Data <- read.table('GacuRNAseq_subset.csv', header=TRUE, sep=',')

print (RNAseq_Data)
head (RNAseq_Data)
tail (RNAseq_Data)

print (RNAseq_Data[,2])
print (RNAseq_Data[1,])
print (RNAseq_Data[1,2])
print (RNAseq_Data$ENSGACG00000000010)
print (RNAseq_Data$ENSGACG00000000010>45.0)

Summary stats and figures

summary1 <- summary(RNAseq_Data $ENSGACG00000000003)
print (summary1)

hist(RNAseq_Data $ENSGACG00000000003)
boxplot(RNAseq_Data$ENSGACG00000000003)
boxplot(RNAseq_Data$ENSGACG00000000003~RNAseq_Data$Population)
plot(RNAseq_Data $ENSGACG00000000003, RNAseq_Data$ENSGACG00000000003)

boxplot(RNAseq_Data $ENSGACG00000000003~RNAseq_Data$Treatment, 
        col = "red", ylab = "Expression Level", xlab = "Treatment level", 
        border ="orange", 
        main = "Boxplot of variation in gene expression across microbiota treatments")

But first a beautiful chair

Goals of the course

Why do we need statistics?

Why do we need statistics?

A biological example to get us started

Data set rules of thumb (aka Tidy Data)

Why use the command line?

The power of the command line

Why use R?

BASICS of R

RMarkdown

Assigning Variables

Assigning Variables

Arithmetic operations on functions

Arithmetic operations on functions

STRINGS

STRINGS

FACTORS

FACTORS

VECTORS

VECTORS

Basic Statistics

Basic Statistics

Getting Help

Getting Help

Creating vectors

Creating vectors

Creating vectors

Creating vectors

Drawing samples from distributions

Drawing samples from distributions

Drawing samples from distributions

Drawing samples from distributions

Drawing samples from distributions

Drawing samples from distributions

Visualizing Data

Visualizing Data

Visualizing Data

Putting plots in a single figure

Putting plots in a single figure

Example using binomial distribution

Example using binomial distribution

Example using binomial distribution

Creating Data Frames in R

Reading in Data Frames in R

Reading in Data Frames in R

Exporting Data Frames in R

Indexing in data frames

Indexing in data frames

R INTERLUDE

Some real transcriptomic data

Some real transcriptomic data

Summary stats and figures

The power of the `command line`

Why use `R`?

BASICS of `R`

`RMarkdown`