Advanced Biological Statistics Week 5

10/23/2018 and 10/25/2018

Tuesday

General Linear Models

Goals for this week

Finish simple linear regression
Analysis of residuals
How to report your statistical results
Multiple linear regression
Git and GitHub
One factor ANOVA
Experimental Design

Model fitting in regression

\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]

full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)

reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)

fits a “reduced” model without slope term (H0)
fits the “full” model with slope term added back
compares fit of full and reduced models using an F test

Model fitting in regression

Hypothesis tests in linear regression

Estimation of the variation that is explained by the model (SS_model)
SS_model = SS_total(reduced model) - SS_residual(full model)
The variation that is unexplained by the model (SS_residual)

Hypothesis tests in linear regression

\(r^2\) as a measure of model fit

\[r^2 = SS_{regression}/SS_{total} = 1 - (SS_{residual}/SS_{total})\] or \[r^2 = 1 - (SS_{residual(full)}/SS_{total(reduced)})\]

Which is the proportion of the variance in Y that is explained by X

Residual Analysis

did we meet our assumptions?

Independent errors (residuals)
Equal variance of residuals in all groups
Normally-distributed residuals
Robustness to departures from these assumptions is improved when sample size is large and design is balanced

Residual Analysis

did we meet our assumptions?

\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]

Residual Analysis

Handling violations of the assumptions of linear models

What if your residuals aren’t normal because of outliers?
Nonparametric methods exist, but these don’t provide parameter estimates with CIs.
Robust regression (rlm)
Randomization tests

Anscombe’s quartet again

what would residual plots look like for these?

Anscombe’s quartet again

what would residual plots look like for these?

Residual Plots

Spotting assumption violations

Residuals

leverage and influence

1 is an outlier for both Y and X
2 is not an outlier for either Y or X but has a high residual
3 is an outlier in just X - and thus a high residual - and therefore has high influence as measured by Cook’s D

Residuals

leverage and influence

Residuals - As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions
Leverage - a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. (Values > 2*(2/n) for simple regression are cause for concern)
Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations.

R INTERLUDE

residual analyses

Let’s look at the zebrafish diet data.

x <- zfish_diet$Protein
y <- zfish_diet$Weight

zfish_lm <- lm(y ~ x)
summary (zfish_lm)

plot(y ~ x, col = "blue")
abline(zfish_lm, col = "red")

hist(residuals(zfish_lm), breaks=30)

plot (residuals(zfish_lm) ~ fitted.values(zfish_lm))
plot (residuals(zfish_lm) ~ x)

R INTERLUDE

residual analyses

Or apply the plot() function to the linear model object directly

plot(zfish_lm)

Figure out what these plots are telling you

R INTERLUDE

How about using the influence.measures function???

influence.measures(zfish_lm)

Do we have any high leverage observations we need to worry about??? Now go back and try to recreate these graphs in GGPlot2

Different Types of Regression - Model 1 and Model 2

Different Types of Regression

Model 1 and Model 2

Different Types of Regression

Model 1 and Model 2

Different Types of Regression

Model 1 and Model 2

Fitting more complicated models

Smoothers and local regression

running medians (or less robust running means) generate predicted values that are the medians of the responses in the bands surrounding each observation.
loess and lowesse (locally weighted scatterplot smoothing) - fit least squares regression lines to successive subsets of the observations weighted according to their distance from the target observation and thus depict changes in the trends throughout the data cloud.
kernel smoothers - new smoothed y-values are computed as the weighted averages of points within a defined window (bandwidth) or neighbourhood of the original x-values. Hence the bandwidth depends on the scale of the x-axis. Weightings are determined by the type of kernel smoother specified, and for. Nevertheless, the larger the window, the greater the degree of smoothing.
splines - join together a series of polynomial fits that have been generated after the entire data cloud is split up into a number of smaller windows, the widths of which determine the smoothness of the resulting piecewise polynomial.

R INTERLUDE

Model I and II regression

x <- zfish$SL
y <- zfish$Weight

size_lm <- lm(y~x)
summary(size_lm)

plot(y~x, col = "blue")
abline(size_lm, col = "red")

Note that when we run the lm() function we use Model I regression. Is this appropriate?

R INTERLUDE

Model I and II regression

install.packages(“lmodel2”)
library(lmodel2)

size_lm_ModII <- lmodel2(y ~ x)
size_lm_ModII

abline(a=0.1912797, b=0.1097880, lty=2)

You should have generated OLS, MA, and SMA regression models, and the last plots SMA line from parameter estimates

R INTERLUDE

Model I and II regression

Say you measured green fluorescent protein abundance in cell culture across several key, carefully controlled temperatures.
You ultimately want to know whether protein expression changes as a function of temperature in your experiment.
Read in the gfp_temp.tsv data and perform a regression analysis.
Address the following questions
- Which is X and which is Y?
- Are residual assumptions met?
- What model of regression should we use in this case and why?
- What is our decision regarding our null hypothesis of no relationship?

How to present your statistical results

Style of a results section

Write the text of the Results section concisely and objectively.
The passive voice will likely dominate here, but use the active voice as much as possible.
Use the past tense.
Avoid repetitive paragraph structures. Do not interpret the data here.

Function of a results section

The function is to objectively present your key results, without interpretation, in an orderly and logical sequence using both text and illustrative materials (Tables and Figures).
The results section always begins with text, reporting the key results and referring to figures and tables as you proceed.
The text of the Results section should be crafted to follow this sequence and highlight the evidence needed to answer the questions/hypotheses you investigated.
Important negative results should be reported, too. Authors usually write the text of the results section based upon the sequence of Tables and Figures.

Summaries of the statistical analyses

May appear either in the text (usually parenthetically) or in the relevant Tables or Figures (in the legend or as footnotes to the Table or Figure). Each Table and Figure must be referenced in the text portion of the results, and you must tell the reader what the key result(s) is that each Table or Figure conveys.

Tables and Figures are assigned numbers separately and in the sequence that you will refer to them from the text.
- The first Table you refer to is Table 1, the next Table 2 and so forth.
- Similarly, the first Figure is Figure 1, the next Figure 2, etc.
Each Table or Figure must include a brief description of the results being presented and other necessary information in a legend.
- Table legends go above the Table; tables are read from top to bottom.
- Figure legends go below the figure; figures are usually viewed from bottom to top.
When referring to a Figure from the text, “Figure” is abbreviated as Fig.,e.g., (Fig. 1. Table is never abbreviated, e.g., Table 1.

Example

For example, suppose you asked the question, “Is the average height of male students the same as female students in a pool of randomly selected Biology majors?” You would first collect height data from large random samples of male and female students. You would then calculate the descriptive statistics for those samples (mean, SD, n, range, etc) and plot these numbers. Suppose you found that male Biology majors are, on average, 12.5 cm taller than female majors; this is the answer to the question. Notice that the outcome of a statistical analysis is not a key result, but rather an analytical tool that helps us understand what is our key result.

Differences, directionality, and magnitude

Report your results so as to provide as much information as possible to the reader about the nature of differences or relationships.
For example, if you are testing for differences among groups, and you find a significant difference, it is not sufficient to simply report that “groups A and B were significantly different”. How are they different? How much are they different?
It is much more informative to say something like, “Group A individuals were 23% larger than those in Group B”, or, “Group B pups gained weight at twice the rate of Group A pups.”
Report the direction of differences (greater, larger, smaller, etc) and the magnitude of differences (% difference, how many times, etc.) whenever possible.

Statistical results in text

Statistical test summaries (test name, p-value) are usually reported parenthetically in conjunction with the biological results they support. This parenthetical reference should include the statistical test used, the value, degrees of freedom and the level of significance.
For example, if you found that the mean height of male Biology majors was significantly larger than that of female Biology majors, you might report this result (in blue) and your statistical conclusion (shown in red) as follows:
- “Males (180.5 ± 5.1 cm; n=34) averaged 12.5 cm taller than females (168 ± 7.6 cm; n=34) in the pool of Biology majors (two-sample t-test, t = 5.78, 33 d.f., p < 0.001).”
If the summary statistics are shown in a figure, the sentence above need not report them specifically, but must include a reference to the figure where they may be seen:
- “Males averaged 12.5 cm taller than females in the pool of Biology majors (two-sample t-test, t = 5.78, 33 d.f., p < 0.001; Fig. 1).”

Statistical results in text

Always enter the appropriate units when reporting data or summary statistics.
- for an individual value you would write, “the mean length was 10 cm”, or, “the maximum time was 140 min.”
- When including a measure of variability, place the unit after the error value, e.g., “…was 10 ± 2.3 m”.
- Likewise place the unit after the last in a series of numbers all having the same unit. For example: “lengths of 5, 10, 15, and 20 m”, or “no differences were observed after 2, 4, 6, or 8 min. of incubation”.

Thursday

Goals for this week

Finish simple linear regression
Analysis of residuals
How to report your statistical results
Non-linear regression
Multiple linear regression
Git and GitHub
One factor ANOVA
Experimental Design (Tuesday)

Non-Linear Regression

Complex non-linear regression

one response and one predictor

Complex non-linear regression

one response and one predictor

power
exponential
polynomial

Complex non-linear regression

one response and one predictor

Multiple Linear Regression

Multiple Linear Regression - Goals

To develop a better predictive model than is possible from models based on single independent variables.
To investigate the relative individual effects of each of the multiple independent variables above and beyond the effects of the other variables.
The individual effects of each of the predictor variables on the response variable can be depicted by single partial regression lines.
The slope of any single partial regression line (partial regression slope) thereby represents the rate of change or effect of that specific predictor variable (holding all the other predictor variables constant to their respective mean values) on the response variable.

Multiple Linear Regression

Additive and multiplicative models of 2 or more predictors

Additive model \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + B_jx_{ij} + \epsilon_i\]

Multiplicative model (with two predictors) \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + B_3x_{i1}x_{i2} + \epsilon_i\]

Multiple Liner Regression

Additive and multiplicative models

R INTERLUDE

multiple regression with 2 predictor variables

Read in the RNAseq_lip.tsv and examine the continuous variables.
We are interested in whether Gene01 and/or Gene02 expression levels influence lipid content.
First plot \(Y\) vs \(Gene01\), \(Y\) vs \(Gene02\), then set up and test a multiplicative model.
What are the parameter estimates of interest in this case?
What are the outcomes from our hypothesis tests?

RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t')

y <- RNAseq_Data$Lipid_Conc
g1 <- RNAseq_Data$Gene01
g2 <- RNAseq_Data$Gene02
g3 <- RNAseq_Data$Gene03
g4 <- RNAseq_Data$Gene04

Mult_lm <- lm(y ~ g1*g2)
summary(Mult_lm)

R INTERLUDE

multiple regression with 2 predictor variables

Now get rid of the interaction term, and set up a purely additive model
Did any of our estimates change? Why?
Did the degrees of freedom change? Why?

Add_lm <- lm(y ~ g1+g2)
summary(Add_lm)

Finally try different combinations of genes

R INTERLUDE

Now see if a polynomial fits better

RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t')

RNAseq_lm_linear <- lm(y ~ g4)
summary (RNAseq_lm_linear)
influence.measures(RNAseq_lm_linear)

RNAseq_lm_poly <- lm(y ~ poly(g4, 2))
summary (RNAseq_lm_poly)

lines(y, fitted(lm(y ~ poly(g4, 2))), type="l", col="blue")
influence.measures(RNAseq_lm_poly)

Try this for different genes
Try this for different degree polynomials

Multiple linear regression assumptions

linearity
normality
homogeneity of variance
multi-collinearity - a predictor variable must not be correlated to the combination of other predictor variables.

checking for multi-collinearity

library(car)
scatterplotMatrix(~var1+var2+var3, diag=”boxplot”)

R INTERLUDE

multiple regression with three variables

RNAseq3_lm <- lm(y ~ g1+g2+g3)
summary(RNAseq3_lm)
plot(RNAseq3_lm)

library(car)
scatterplotMatrix(~g1+g2+g3, diag=”boxplot”)
scatterplotMatrix(~y+g1+g2+g3, diag=”boxplot”)

To get tolerance (1-\(R^2\)) calculations

1/vif(lm(y ~ g1 + g2 + g3))

Is there any evidence for multi-collinearity?
Try running a simple linear regression for y and Gene03 (y ~ g3).
What is the slope estimate for Gene03, and how does it compare to the partial slope estimate above?

Model selection

the problems

How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….
Which variables to keep/ discard when building a multiple regression model?
Selecting from candidate models representing different biological processes.

Model selection

a beetle example

Start with linear regression

Quadratic (2nd degree) polynomial?

Quadratic (3rd degree) polynomial?

A polynomial of degree 5?

A polynomial of degree 10?

The problem with this approach

The log likelihood of the model increases with the number of parameters
So does \(r^2\)
Isn’t this good - the best fit to the data?

The problem with this approach

Does it violate a principle of parsimony
Fit no more parameters than is necessary.
If two or more models fit the data almost equally well, prefer the simpler model.

“models should be pared down until they are minimal and adequate”

Crawley 2007, p325

Let’s consider our objectives

Model should predicts well
Approximates true relationship between the variables
Be able to evaluate a wider array of models. Not only or more “reduced” models.
NOTE: Reduced vs. full models are referred to as nested models. Non-subset models are called non-nested models.
Don’t confuse with nested experimental designs or sampling designs.

How do we accomplish these goals?

How to accomplish these goals To answer this, we need

A criterion to compare models:
- Mallow’s Cp
- AIC (Akaike’s Information Criterion)
- BIC (Bayesian Information Criterion)
A strategy for searching the candidate models

How do we accomplish these goals?

How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….
Which variables to keep/ discard when building a multiple regression model?
Selecting from candidate models representing different biological processes.

MSresiduals- represents the mean amount of variation unexplained by the model, and therefore the lowest value indicates the best fit.
Adjusted \(r^2\) - (the proportion of mean amount of variation in response variable explained by the model) is calculated as adj. r2 which is adjusted for both sample size and the number of terms. Larger values indicate better fit.
Mallow’s Cp - is an index resulting from the comparison of the specific model to a model that contain all the possible terms. Models with the lowest value and/or values closest to their respective p (the number of model terms, including the y-intercept) indicate best fit.
Akaike Information Criteria (AIC) - there are several different versions of AIC, each of which adds a different constant designed to penalize according to the number of parameters and sample size to a likelihood function to produce a relative measure of the information content of a model. Smaller values indicate more parsimonious models. As a rule of thumb, if the difference between two AIC values (delta AIC) is greater than 2, the lower AIC is a significant improvement in parsimony.
Schwarz or Bayesian Information Criteria (BIC or SIC) - is outwardly similar to AIC. The constant added to the likelihood function penalizes models with more predictor terms more heavily (and thus select more simple models) than AIC. It is for this reason that BIC is favored by many researchers.

Mallow’s Cp

Frequently used in multiple regression
Test case: “all subsets regression”
Strategy: Test all possible models and selection the one with the smallest Cp
leaps package in R. Smart search among a potentially huge number of models
Typically we are modeling observational data. Experimental data often allows smart use of variables
Investigating all possible subsets of variables = only intelligent decision is choice of variables ….

Mallow’s Cp

For prediction: All models with Cp < p predict about equally well. Don’t get carried away with a “best”.

For explanation: If numerous equally well fitting models fit the data, it is difficult to deduce which predictor “explains” the response.

General caveat: “regression is not causation”. Experiment needed to get to causal explanation.