10/09/2018 & 10/11/2018

Goals for this week

  • Finish point estimation
  • Tidy data wrangling and Exploratory Data Analysis (EDA)
  • Hypothesis testing and Types of errors
  • Figures - the good, the bad and the ugly
  • Grammar of Graphics (GG) and Creating beautiful graphics using GGPlot2
  • Introduction to Linear Models
  • Simple linear regression and multiple linear regression (next week)
  • Git and GitHub (starting next Tuesday)

Hypothesis testing, test statistics, p-values

What is a hypothesis

  • A statement of belief about the world
  • Need a critical test to
    • accept or reject the hypothesis
    • compare the relative merits of different models
  • This is where statistical sampling distributions come into play

Hypothesis tests

\(H_0\) : Null hypothesis : Ponderosa pine trees are the same height on average as Douglas fir trees

\(H_A\) : Alternative Hypothesis: Ponderosa pine trees are not the same height on average as Douglas fir trees

Hypothesis tests

  • What is the probability that we would reject a true null hypothesis?

  • What is the probability that we would accept a false null hypothesis?

  • How do we decide when to reject a null hypothesis and support an alternative?

  • What can we conclude if we fail to reject a null hypothesis?

  • What parameter estimates of distributions are important to test hypotheses?

Null and alternative hypotheses

population distributions

Null and alternative hypotheses

population distributions

Statistical sampling distributions

  • Statistical tests provide a way to perform critical tests of hypotheses
  • Just like raw data, statistics are random variables and depend on sampling distributions of the underlying data
  • The particular form of the statistical distribution depends on the test statistic and parameters such as the degrees of freedom that are determined by sample size.

Statistical sampling distributions

  • In many cases we create a null statistical distribution that models the distribution of a test statistic under the null hypothesis.
  • Similar to point estimates, we calculate an observed test statistic value for our data
  • Then see how probable it was by comparing against the null distribution
  • The probability of seeing that value or greater is called the p-value of the statistic

Four common statistical distributions

The t-test and t sampling distribution

\(H_0\) : Null hypothesis : Ponderosa pine trees are the same height on average as Douglas fir trees

\[H_0 : \mu_1 = \mu_2\]

\(H_A\) : Alternative Hypothesis: Ponderosa pine trees are not the same height as Douglas fir trees

\[H_A : \mu_1 \neq \mu_2\]

The t-test and t sampling distribution

\[\huge t = \frac{(\bar{y}_1-\bar{y}_2)}{s_{\bar{y}_1-\bar{y}_2}} \]

where

which is the calculation for the standard error of the mean difference

The t-test and t sampling distribution

under different degrees of freedom

The t-test and t sampling distribution

one tailed test

The t-test and t sampling distribution

two tailed test

Assumptions of parameteric t-tests

  • The theoretical t-distributions for each degree of freedom were calculated for populations that are:
    • normally distributed
    • have equal variances (if comparing two means)
    • observations are independent (randomly drawn)
  • This is an example of a parametric test
  • What do you do if the there is non-normality?
    • nonparametric tests such as Mann-Whitney-Wilcoxon
    • randomization tests to create a null distribution

Type 1 and Type 2 errors

Components of hypothesis testing

  • p-value = the long run probability of rejecting a true null hypothesis
  • alpha = critical value of p-value cutoff for experiments. The Type I error we are willing to tolerate.
  • beta = cutoff for probability of accepting a false null hypothesis
  • Power = the probability that a test will reject a false null hypothesis (1 - beta). It depends on effect size, sample size, chosen alpha, and population standard deviation
  • Multiple testing = performing the same or similar tests multiple times - need to correct

Null distributions and p-values

Why do we use \(\alpha = 0.5\) as a cutoff?

R INTERLUDE

Parametric t-test in R

  1. Make a dummy data set that contains one continuous and one categorical value (with two levels). Draw individuals of the two different factor levels from normal distributions with slightly different means. Take 100 obsv. per factor level.

  2. Now, perform a t-test to see if the two populations are statistically different from one another

boxplot(continuous_variable~cat_variable, dataset name) 
t.test(continuous_variable~cat_variable, dataset name) 
  1. Repeat steps 1 and 2 above but use different sample sizes (e.g. 10 , then 100, then 1000, then 10,000 obsv. per factor level)

  2. Repeat steps 1 and 2 above but now test between a categorical and continuous variable in the perchlorate data set (or any of your favorite data sets)

R INTERLUDE

Parametric t-test in R

  • Now, apply the equations for the t-statistic above to your simulated sample to perform the same test, but using a resampling approach
    • hint: go back to your Bootstrapping script
  • Use this approach to calculate a distribution of random t statistics assuming the the null hypothesis of no difference is true.
  • Now, compare your observed value to your created null distribution.
  • What is the probability of that value occurring by chance under the null?
    • This is your p-value! (Assume you’re doing a one-tailed test)
    • hint: Proportion of values in the null distribution equal to or larger than the observed t-value = p-value

R INTERLUDE

Permutation t-test in R

  • Randomization test example 1
  • Data: The number of successful broods of pseudoscorpion females that were mated twice to either a single male (SM) or two different males (DM).
    • SM: 4 0 3 1 2 3 4 2 4 2 0 2 0 1 2 6 0 2 3 3
    • DM: 2 0 2 6 4 3 4 4 2 7 4 1 6 3 6 4
  • Observed difference in the means between the two
    • \(\bar{Y}_{SM}\) − \(\bar{Y}_{DM}\) = 2.2−3.625 = −1.425

R INTERLUDE

Permutation t-test in R

  • Steps of the randomization test
  • First, create a randomized data set in which the measurements are randomly reassigned (without replacement) to the two groups
    • For example, the following
    • SM: 4 0 7 4 2 2 2 1 4 4 0 3 3 4 6 2 4 6 0 0
    • DM: 2 2 3 3 2 3 3 4 1 2 1 4 6 6 2 0
  • Second, calculate the test statistic measuring the association between variables
    • difference between group means
    • this is the first bootstrap replicate
  • Repeat steps 1 and 2 many times to produce many bootstrap replicates

R INTERLUDE

Permutation t-test in R

  • Below is the null distribution of the test statistic from 10,000 replications
    • This is produced by the randomized assignment of values to each group (SM or DM)
    • Thus this is the distribution we’d expect under the null hypothesis of no difference
  • The observed value from the data is –1.425
  • The area in red is the tail beyond this observed value and is therefore the bootstrap p-value

R INTERLUDE

Permutation t-test in R

  • Of these 10,000 randomizations, only 176 had a test statistic (difference between means) equal to or less than the observed value, –1.425.
  • You can use the simulated null distribution in the same way as t or F distribution in conventional tests.
  • Proportion of values in the null distribution equal or larger than the observed value of the test statistic is 176/10000 = 0.0176.

Data wrangling and exploratory data analysis (EDA)

Tidyverse family of packages

Tidyverse family of packages

  • Hadley Wickham and others have written R packages to modify data

  • These packages do many of the same things as base functions in R

  • However, they are specifically designed to do them faster and more easily

  • Wickham also wrote the package GGPlot2 for elegant graphics creations

  • GG stands for ‘Grammar of Graphics’

Example of a tibble

Example of a tibble

Types of vectors of data

Types of vectors of data

int stands for integers

dbl stands for doubles, or real numbers

chr stands for character vectors, or strings

dttm stands for date-times (a date + a time)

lgl stands for logical, vectors that contain only TRUE or FALSE

fctr stands for factors, which R uses to represent categorical variables with fixed possible values

date stands for dates

Types of vectors of data

  • Logical vectors can take only three possible values:
    • FALSE
    • TRUE
    • NA which is ‘not available’.
  • Integer and double vectors are known collectively as numeric vectors.
    • In R numbers are doubles by default.
  • Integers have one special value: NA, while doubles have four:
    • NA
    • NaN which is ‘not a number’
    • Inf
    • -Inf

Types of vectors of data

  • R will also implicitly coerce the length of vectors.
    • This is called vector recycling
    • The shorter vector is repeated or recycled
    • The shorter vector will be made the same length as the longer vector
  • R will expand the shortest vector to the same length as the longest
    • This is so-called recycling
    • This is silent except when the length of the longer is not an integer multiple of the length of the shorter
    • When it is not you’ll get an error
  • The vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar.
    • If you do want to recycle something other than a scaler
    • You’ll need to do it yourself with rep()

Key functions in dplyr for vectors

  • Pick observations by their values with filter().
  • Reorder the rows with arrange().
  • Pick variables by their names with select().
  • Create new variables with functions of existing variables with mutate().
  • Collapse many values down to a single summary with summarise().

filter(), arrange() & select()

filter(flights, month == 11 | month == 12)
arrange(flights, year, month, day)
select(flights, year, month, day)

mutate() & transmutate()

This function will add a new variable that is a function of other variable(s)

mutate(flights_sml,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

This function will eplace the old variable with the new variable

transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

group_by( ) & summarize( )

This first function allows you to aggregate data by values of categorical variables

by_day <- group_by(flights, year, month, day)

Once you have done this aggregation, you can then calculate values (in this case the mean) of other variables split by the new aggregated levels of the categorical variable

summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
  • Note - you can get a lot of missing values!
  • That’s because aggregation functions obey the usual rule of missing values:
    • if there’s any missing value in the input, the output will be a missing value.
    • fortunately, all aggregation functions have an na.rm argument which removes the missing values prior to computation

R INTERLUDE

Playing with Tidyverse functions

  • Examine my R script TidyVerse.R’
  • Step 1 - Read in the GacuRNAseq_Subset.csv dataset
  • Step 2 - Make the dataset into a tibble
  • Step 3 - Select all of the categorical variables and only 4 of the gene count variables and put them into a new variable
  • Step 4 - Mutate each of the 4 gene expression values by performing a square root transformation making a new variable for each of the original (keep all 8 in the dataset).
  • Step 5 - Summarize the mean and standard deviation for each of the gene count variables grouped by the ‘sex’ and ‘population’ and ‘treatment’ categorical variables
  • Step 6 - Create a histogram for one of the original gene expression variables, and one of the derived variables
  • Step 7 - Create a box plot for one of the original gene expression variables, and one of the derived variables, split by treatment
  • Step 8 - Write the final data table to a .csv file and one of the figures to a .pdf file