10/02/2018 & 10/04/2018

Goals for this week

  • Dynamical systems and stochastic processes
  • Probability distributions and density functions
  • Bernoulli, Binomial, Normal, Poisson
  • Data wrangling and Exploratory Data Analysis (EDA)
  • Figures - the good, the bad and the ugly
  • Grammar of Graphics (GG)
  • Creating beautiful graphics using GGPlot2
  • FOR THURSDAY - On your EDA exercises will be posted

Basics of probability and distributions

Dynamical systems

state space and transition rules

Probability of independent events

Random variables & probability

  • Probability is the expression of belief in some future outcome

  • A random variable can take on different values with different probabilities

  • The sample space of a random variable is the universe of all possible values

  • The sample space can be represented by a

    • probability distribution (for discrete variables)
    • probability density function (PDF - for continuous variables)
    • algebra and calculus are used for each respectively
    • the probabilities of an entire sample space always sum to 1.0
  • There are many families or forms of distributions or PDFs
    • depends on the nature of the dynamical system they represent
    • the exact instantiation of the form depends on their parameter values
    • we are often interested in statistics in estimating parameters

Bernoulli distribution

  • Describes the expected outcome of a single event with probability p

  • Example of flipping of a fair coin once

\[Pr(X=\text{Head}) = \frac{1}{2} = 0.5 = p \]

\[Pr(X=\text{Tails}) = \frac{1}{2} = 0.5 = 1 - p \]

  • If the coin isn’t fair then \(p \neq 0.5\)
  • However, the probabilities still sum to 1

\[ p + (1-p) = 1 \]

  • Same is true for other binary possibilities
    • success or failure
    • yes or no answers
    • choosing an allele from a population based upon allele frequences

Probability rules

  • Flip a coin twice
  • Represent the first flip as ‘X’ and the second flip as ‘Y’
  • First, pretend you determine the probability in advance of flipping both coins

\[ Pr(\text{X=H and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=H and Y=T}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=H}) = p*p = p^2 \] \[ Pr(\text{X=T and Y=T}) = p*p = p^2 \]

Probability rules

  • Now determine the probability if the H and T can occur in any order

\[ \text{Pr(H and T) =} \] \[ \text{Pr(X=H and Y=T) or Pr(X=T and Y=H)} = \] \[ (p*p) + (p*p) = 2p^{2} \]

  • These are the ‘and’ and ‘or’ rules of probability
    • ‘and’ means multiply the probabilities
    • ‘or’ means sum the probabilities
    • most probability distributions can be built up from these simple rules

Joint and conditional probability

Joint probability

\[Pr(X,Y) = Pr(X) * Pr(Y)\]

  • Note that this is true for two independent events
  • However, for two non-independent events we also have to take into account their covariance

Joint and conditional probability

Conditional probability

  • For two independent variables

\[Pr(Y|X) = Pr(Y)\text{ and }Pr(X|Y) = Pr(X)\]

  • For two non-independent variables

\[Pr(Y|X) \neq Pr(Y)\text{ and }Pr(X|Y) \neq Pr(X)\]

  • Variables that are non-independent have a shared variance, which is also known as the covariance
  • Covariance standardized to a mean of zero and a unit standard deviation is correlation

Binomial Distribution

  • A binomial distribution results from the combination of several independent Bernoulli events

  • Example - pretend that you flip 20 fair coins and record the number of heads
  • Now repeat that process and record the number of heads for each
  • We expect that most of the time we will get approximately 10 heads
  • Sometimes we get many fewer heads or many more heads
  • The distribution of probabilities for each combination of outcomes is

\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]

  • n is the total number of trials
  • k is the number of successes
  • p is the probability of success
  • q is the probability of not success
  • For binomial as with the Bernoulli p = 1-q

Binomial Probability Distribution

R INTERLUDE

binomially distributed data

  • Pretend that you want to create your own binomial distribution
    • you could flip 20 fair coins: 20 independent Bernoulli “trials”
    • you could then replicate this 100 times: 100 “observations”
    • for each one of your replicates you record the number of heads
    • probability of heads (“success”) for any flip of a fair coin is 0.5
  • Draw a diagram of what you think the distribution would look like
    • what will the x-axis represent, and what are its values?
    • what will the y-axis represent, and what are its values?
    • where will the ‘center of mass’ be?
    • what parameter determines the value of the center of mass?
    • what parameter determines the spread of the distribution?

R INTERLUDE

binomially distributed data

  • Using R, simulate these experimental data
    • using the \(rbinom()\) function
    • look at the arguments to see how they map on to the trials, probability of each trial, and number of replicates
    • use the \(hist()\) function to plot the simulated data.
    • what do the y- and x-axes represent?
    • what is the most common outcome, in terms of the number of “successes” per trial? Does this make sense?
  • Now just change you script to do the following
    • perform 200 trials for each of the 100 replicates
    • perform 2000 trials for each of the 100 replicate
    • how do the distributions change when you go from 20 to 200 to 2000 trials?

R INTERLUDE

binomially distributed data

  • Note that the binomial function incorporates both the ‘and’ and ‘or’ rules of probability
  • This part is the probability of each outcome (multiplication)

\[\large p^{k} (1-p)^{n-k}\]

This part (called the binomial coefficient) is the number of different ways each combination of outcomes can be achieved (summation)

\[\large {n \choose k}\] Together they equal the probability of a specified number of successes

\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]

R INTERLUDE

binomially distributed data

  • Now perform your experiment using R
  • Use the rbinom function to replicate what you sketched out for coin flipping
  • Be sure to check out the other functions in the binom family
  • Make sure you know how you are mapping the parameeters to values
  • Take the output and create a histogram
  • Does it look similar to what you expected?
  • Now change the values around so that
    • You have more or fewer coin flips per trial
    • You replicate the process with more or fewer trials
    • You change the coin from being fair to being slightly biased
    • You change the coin from being slightly to heavily biased

Poisson Probability Distribution

  • Another common situation in biology is when each trial is discrete but number of observations of each outcome is observed

  • Some examples are
    • counts of snails in several plots of land
    • observations of the firing of a neuron in a unit of time
    • count of genes in a genome binned to units of 500 amino acids in length
  • Just like before you have ‘successes’, but
    • now you count them for each replicate
    • the replicates now are units of area or time
    • the values can now range from 0 to an arbitrarily large number

Poisson Probability Distribution

  • For example, you can examine 100 plots of land
    • count the number of snails in each plot
    • what is the probability of observing a plot with ‘r’ snails is
  • Pr(Y=r) is the probability that the number of occurrences of an event y equals a count r in the total number of trials

\[Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}\]

  • Note that this is a single parameter function because \(\mu = \sigma^2\), and the two together are often just represented by \(\lambda\)

\[Pr(y=r) = \frac{e^{-\lambda}\lambda^r}{r!}\]

  • This means that for a variable that is truly Poisson distributed:
    • the mean and variance should be equal to one another, a hypothesis that you can test
    • variables that are approximately Poisson distributed but have a larger variance than mean are often called ‘overdispersed’

Poisson Probability Distribution

gene length by bins of 500 nucleotides

Poisson Probability Distribution

increasing parameter values of \(\lambda\)

Log-normal PDF

Continuous version of Poisson (-ish)

Transformations to ‘normalize’ data

Transformations to ‘normalize’ data

Binomial to Normal

Categorical to continuous

The Normal (aka Gaussian)

Probability Density Function (PDF)

Normal PDF

Normal PDF

A function of two parameters (\(\mu\) and \(\sigma\))

where \[\large \pi \approx 3.14159\]

\[\large \epsilon \approx 2.71828\]

To write that a variable (v) is distributed as a normal distribution with mean \(\mu\) and variance \(\sigma^2\), we write the following:

\[\large v \sim \mathcal{N} (\mu,\sigma^2)\]

Normal PDF

estimates of mean and variance

Estimate of the mean from a single sample

\[\Large \bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i} \]

Estimate of the variance from a single sample

\[\Large s^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(x_i - \bar{x})^2} \]

The standard deviation is the square root of the variance

\[\Large s = \sqrt{s^2} \]

Normal PDF

Why is the Normal special in biology?

Why is the Normal special in biology?

Why is the Normal special in biology?

Parent-offspring resemblance

Genetic model of complex traits

Distribution of \(F_2\) genotypes

really just binomial sampling

Why else is the Normal special?

  • The normal distribution is immensely useful because of the central limit theorem
  • The central limit theorem states that, under mild conditions, the mean of many random variables independently drawn from the same distribution is distributed approximately normally, irrespective of the form of the original distribution
  • One can think of numerous situations, such as when multiple genes contribute to a phenotype, that many factors contribute to a biological process
  • In addition, whenever there is variance introduced by stochastic factors or sampling error, the central limit theorem holds as well
  • Thus, normal distributions occur throughout biology and biostatistics

z-scores of normal variables

Mean centering and ranging

  • Often we want to make variables more comparable to one another
  • For example, consider measuring the leg length of mice and of elephants
  • Which animal has longer legs in absolute terms?
  • Which has longer legs on average proportional to their body size? Which has more variation proportional to their body size?
  • A qood way to answer this last question is to use ‘z-scores’, which are standardized to a mean of 0 and a s.d. of 1
  • We can modify any normal distribution to have a mean of 0 and a standard deviation of 1
  • Another term for this is the standard normal distribution

\[\huge z_i = \frac{(x_i - \bar{x})}{s}\]

R INTERLUDE

Simulate a population and sample it!

Simulate a population of 10,000 individual values for a variable x:

x <- rnorm(10000, mean=50.5, sd=5.5) 

Take 1000 random samples of size 20, take the mean of each sample, and plot the distribution of these 1000 sample means.

x_sample_means <- NULL
for(i in 1:1000){
x_samp <- sample(x, 20, replace=FALSE)
x_sample_means[i] <- mean(x_samp)
}

For one of your samples, use the equation from the previous slide to transform the values to z-scores.

Plot the distribution of the z-scores, and calculate the mean and the standard deviation.

R INTERLUDE

Simulate a population and sample it!

  • Now, create a second population (called x.lognorm) by log-transforming all 10,000 values from population “x”
  • Plot the histogram of these data
  • Repeat the taking of 1000 samples of size 20
  • Take the mean of each sample
  • Plot the distribution of these 1000 sample means from the known lognormal population.
  • What does the distribution of the sampled means look like?

More than one variable

Bivariate normal, correlation and covariation

More than one variable

Bivariate normal, correlation and covariation

Covariance and Correlation

Linear Models

Show the data!

Anscombe’s Quartet

Show the data!

Anscombe’s Quartet

  • Mean of x in each case 9 (exact)

  • Variance of x in each case 11 (exact)

  • Mean of y in each case 7.50 (to 2 decimal places)

  • Variance of y in each case 4.122 or 4.127 (to 3 decimal places)

  • Correlation between x and y in each case 0.816 (to 3 decimal places)

  • Linear regression line in each case \(y =3.00 + 0.50x\) (to 2 decimal places)

Sampling and estimation

Accuracy vs. Precision

Accuracy vs. Precision

  • Accuracy is the closeness of an estimated value to its true value
  • Precision is the closeness of repeated estimates to one another
  • Our goal is to have unbiased estimates that are the most precise
  • We have to estimate parameters and test hypotheses by taking samples that approximate the underlying distribution
  • The goal of replication is to quantify variation at as many levels in a study as possible
  • The goal of randomization is to avoid bias as much as possible

Parameter Estimation

  • Parametric (a few special exceptions, like the sample mean and its standard error)
  • Ordinary Least Squares (OLS) - optimized procedure that produces one definitive result, easy to use but no estimates of confidence
  • Resampling - bootstrapping and randomization
  • Maximum Likelihood (ML) - Can provide model-based estimates with confidence, but harder to calculate
  • Bayesian Approaches - Incorporates prior information into ML estimation

Sampling variation of a parameter

Estimation

  • Estimation is the process of inferring a population parameter from sample data
  • The value of one sample estimate is almost never the same as the population parameter because of random sampling error
  • Imagine taking multiple samples, each will be different from the true value
  • Most will be close, but some will be far away
  • The expected value of a very large number of sample estimates is the value of the parameter being estimated
  • Sampling distribution of an estimate
    • all values we might have obtained from our sample
    • probabilities of occurrence
  • Standard error of an estimate
    • standard deviation of a sampling distribution
    • measures the precision of the parameter estimate
    • NO ESTIMATE IS USEFUL WITHOUT IT!

Estimation and confidence intervals

Estimation and confidence intervals

Estimation and confidence intervals

Estimation and confidence intervals

Standard Error of the Mean (SEM)

\[\huge \sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}} \]

  • where \(s_{\bar{x}}\) is the estimated standard error of the distribution of the mean estimates
  • which is usually just referred to as the ’standard error of the mean (SEM)
  • note that this is not the standard deviation of the original distribution
  • importantly, the SEM will go down as the sample size increases

Standard Error

  • Sadly, most other kinds of estimates do not have this amazing property.

  • What to do?

  • One answer: make your own sampling distribution for the estimate using the “bootstrap”.

  • Method invented by Efron (1979).

Parameter Estimation

Bootstrap Algorithm

  • Use R to take a random sample of individuals from the original data
  • Calculate the estimate using the measurements in the bootstrap sample (step 1)
  • This is the first bootstrap replicate estimate
  • Repeat steps 1 and 2 a large number of times (1000 times is reasonable)
  • Calculate the sample standard deviation of all the bootstrap replicate estimates obtained in step 3
  • The resulting quantity is called the bootstrap standard error

The etymology of the term ‘bootstrap’

Why the bootstrap is good

  • Can be applied to almost any sample statistic
    • This includes means, proportions, correlations, regression
  • Works when there is no ready formula for a standard error
    • For example the median, trimmed mean, correlation, eigenvalue, etc.
  • Is nonparametric, so doesn’t require normally-distributed data
  • Works for estimates based on complicated sampling procedures or calculations
    • For example, it is used in phylogeny estimation

R INTERLUDE

Bootstrapping to produce a confidence interval

Examine the data_frames.R script

Examine the simple_boostrap.R script

On your own - use R to figure out the bootstrap distribution for other parameters (such as variance).

R INTERLUDE

Bootstrapping to produce a confidence interval

x <- c(0.9, 1.2, 1.2, 1.3, 1.4, 1.4, 1.6, 1.6, 2.0, 2.0)

  • Use R to make 1000 “pseudo-samples” of size 10 (with replacement), using a for loop as before.
  • Name the pseudo-sample object “xboot”, and name the means of the xboot samples “z”.
  • Plot the histogram of the resampled means, and calculate the standard deviation of the sample means (the bootstrap SEM) using the sd() function.
  • How does it compare with the ordinary standard error of the mean calculated from the original, real sample?

sd(x)/sqrt(10)

  • Now take one of the genes from the GacuRNAseq_Subset.csv data and obtain a bootstrapped estimate of the mean expression level.

R INTERLUDE

Bootstrapping to produce a confidence interval

  • The 2.5th and 97.5th percentiles of the bootstrap sampling distribution are a passable 95% confidence interval
  • Note that no transformations or normality assumptions needed
  • You can use the quantile() function to calculate these
  • On your own - use R to figure out the bootstrap distribution for other parameters (such as variance).

Parameter Estimation

Ordinary Least Squares (OLS)

  • Algorithmic approach to parameter estimations
  • One of the oldest and best developed statistical approaches
  • Used extensively in linear models (ANOVA and regression)
  • By itself only produces a single best estimate (No C.I.’s)
  • Can use resampling approaches to get C.I.’s
  • Many OLS estimators have been duplicated by ML estimators

Parameter Estimation

Ordinary Least Squares (OLS)