10/23/2018 and 10/25/2018
\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]
full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)
reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)
Estimation of the variation that is explained by the model (SS_model)
SS_model = SS_total(reduced model) - SS_residual(full model)
The variation that is unexplained by the model (SS_residual)
\[r^2 = SS_{regression}/SS_{total} = 1 - (SS_{residual}/SS_{total})\] or \[r^2 = 1 - (SS_{residual(full)}/SS_{total(reduced)})\]
Which is the proportion of the variance in Y that is explained by X
\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]
What if your residuals aren’t normal because of outliers?
Nonparametric methods exist, but these don’t provide parameter estimates with CIs.
Robust regression (rlm)
Randomization tests
Residuals - As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions
Leverage - a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. (Values > 2*(2/n) for simple regression are cause for concern)
Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations.
Let’s look at the zebrafish diet data.
x <- zfish_diet$Protein y <- zfish_diet$Weight zfish_lm <- lm(y ~ x) summary (zfish_lm) plot(y ~ x, col = "blue") abline(zfish_lm, col = "red") hist(residuals(zfish_lm), breaks=30) plot (residuals(zfish_lm) ~ fitted.values(zfish_lm)) plot (residuals(zfish_lm) ~ x)
Or apply the plot() function to the linear model object directly
plot(zfish_lm)
Figure out what these plots are telling you
How about using the influence.measures function???
influence.measures(zfish_lm)
Do we have any high leverage observations we need to worry about??? Now go back and try to recreate these graphs in GGPlot2
x <- zfish$SL y <- zfish$Weight size_lm <- lm(y~x) summary(size_lm) plot(y~x, col = "blue") abline(size_lm, col = "red")
Note that when we run the lm()
function we use Model I regression. Is this appropriate?
install.packages(“lmodel2”) library(lmodel2) size_lm_ModII <- lmodel2(y ~ x) size_lm_ModII abline(a=0.1912797, b=0.1097880, lty=2)
You should have generated OLS, MA, and SMA regression models, and the last plots SMA line from parameter estimates
Say you measured green fluorescent protein abundance in cell culture across several key, carefully controlled temperatures.
You ultimately want to know whether protein expression changes as a function of temperature in your experiment.
Read in the gfp_temp.tsv data and perform a regression analysis.
The function is to objectively present your key results, without interpretation, in an orderly and logical sequence using both text and illustrative materials (Tables and Figures).
The results section always begins with text, reporting the key results and referring to figures and tables as you proceed.
The text of the Results section should be crafted to follow this sequence and highlight the evidence needed to answer the questions/hypotheses you investigated.
Important negative results should be reported, too. Authors usually write the text of the results section based upon the sequence of Tables and Figures.
May appear either in the text (usually parenthetically) or in the relevant Tables or Figures (in the legend or as footnotes to the Table or Figure). Each Table and Figure must be referenced in the text portion of the results, and you must tell the reader what the key result(s) is that each Table or Figure conveys.
For example, suppose you asked the question, “Is the average height of male students the same as female students in a pool of randomly selected Biology majors?” You would first collect height data from large random samples of male and female students. You would then calculate the descriptive statistics for those samples (mean, SD, n, range, etc) and plot these numbers. Suppose you found that male Biology majors are, on average, 12.5 cm taller than female majors; this is the answer to the question. Notice that the outcome of a statistical analysis is not a key result, but rather an analytical tool that helps us understand what is our key result.
Report your results so as to provide as much information as possible to the reader about the nature of differences or relationships.
For example, if you are testing for differences among groups, and you find a significant difference, it is not sufficient to simply report that “groups A and B were significantly different”. How are they different? How much are they different?
It is much more informative to say something like, “Group A individuals were 23% larger than those in Group B”, or, “Group B pups gained weight at twice the rate of Group A pups.”
Report the direction of differences (greater, larger, smaller, etc) and the magnitude of differences (% difference, how many times, etc.) whenever possible.
Statistical test summaries (test name, p-value) are usually reported parenthetically in conjunction with the biological results they support. This parenthetical reference should include the statistical test used, the value, degrees of freedom and the level of significance.
For example, if you found that the mean height of male Biology majors was significantly larger than that of female Biology majors, you might report this result (in blue) and your statistical conclusion (shown in red) as follows:
If the summary statistics are shown in a figure, the sentence above need not report them specifically, but must include a reference to the figure where they may be seen:
To develop a better predictive model than is possible from models based on single independent variables.
To investigate the relative individual effects of each of the multiple independent variables above and beyond the effects of the other variables.
The individual effects of each of the predictor variables on the response variable can be depicted by single partial regression lines.
The slope of any single partial regression line (partial regression slope) thereby represents the rate of change or effect of that specific predictor variable (holding all the other predictor variables constant to their respective mean values) on the response variable.
Additive model \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + B_jx_{ij} + \epsilon_i\]
Multiplicative model (with two predictors) \[y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + B_3x_{i1}x_{i2} + \epsilon_i\]
RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t') y <- RNAseq_Data$Lipid_Conc g1 <- RNAseq_Data$Gene01 g2 <- RNAseq_Data$Gene02 g3 <- RNAseq_Data$Gene03 g4 <- RNAseq_Data$Gene04 Mult_lm <- lm(y ~ g1*g2) summary(Mult_lm)
Add_lm <- lm(y ~ g1+g2) summary(Add_lm)
RNAseq_Data <- read.table(‘RNAseq_lip.tsv', header=T, sep=‘\t') RNAseq_lm_linear <- lm(y ~ g4) summary (RNAseq_lm_linear) influence.measures(RNAseq_lm_linear) RNAseq_lm_poly <- lm(y ~ poly(g4, 2)) summary (RNAseq_lm_poly) lines(y, fitted(lm(y ~ poly(g4, 2))), type="l", col="blue") influence.measures(RNAseq_lm_poly)
library(car) scatterplotMatrix(~var1+var2+var3, diag=”boxplot”)
RNAseq3_lm <- lm(y ~ g1+g2+g3) summary(RNAseq3_lm) plot(RNAseq3_lm)
library(car) scatterplotMatrix(~g1+g2+g3, diag=”boxplot”) scatterplotMatrix(~y+g1+g2+g3, diag=”boxplot”)
To get tolerance (1-\(R^2\)) calculations
1/vif(lm(y ~ g1 + g2 + g3))
How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….
Which variables to keep/ discard when building a multiple regression model?
Selecting from candidate models representing different biological processes.
“models should be pared down until they are minimal and adequate”
Crawley 2007, p325
Model should predicts well
Approximates true relationship between the variables
Be able to evaluate a wider array of models. Not only or more “reduced” models.
NOTE: Reduced vs. full models are referred to as nested models. Non-subset models are called non-nested models.
Don’t confuse with nested experimental designs or sampling designs.
How to accomplish these goals To answer this, we need
How to decide the complexity of polynomial: straight line regression, quadratic, cubic, ….
Which variables to keep/ discard when building a multiple regression model?
Selecting from candidate models representing different biological processes.
MSresiduals- represents the mean amount of variation unexplained by the model, and therefore the lowest value indicates the best fit.
Adjusted \(r^2\) - (the proportion of mean amount of variation in response variable explained by the model) is calculated as adj. r2 which is adjusted for both sample size and the number of terms. Larger values indicate better fit.
Mallow’s Cp - is an index resulting from the comparison of the specific model to a model that contain all the possible terms. Models with the lowest value and/or values closest to their respective p (the number of model terms, including the y-intercept) indicate best fit.
Akaike Information Criteria (AIC) - there are several different versions of AIC, each of which adds a different constant designed to penalize according to the number of parameters and sample size to a likelihood function to produce a relative measure of the information content of a model. Smaller values indicate more parsimonious models. As a rule of thumb, if the difference between two AIC values (delta AIC) is greater than 2, the lower AIC is a significant improvement in parsimony.
Schwarz or Bayesian Information Criteria (BIC or SIC) - is outwardly similar to AIC. The constant added to the likelihood function penalizes models with more predictor terms more heavily (and thus select more simple models) than AIC. It is for this reason that BIC is favored by many researchers.
leaps
package in R. Smart search among a potentially huge number of modelsFor prediction: All models with Cp < p predict about equally well. Don’t get carried away with a “best”.
For explanation: If numerous equally well fitting models fit the data, it is difficult to deduce which predictor “explains” the response.
General caveat: “regression is not causation”. Experiment needed to get to causal explanation.