Advanced Biological Statistics Week 4

10/16/2018 & 10/18/2018

Graphical representation

Goals for this week

Figures - the good, the bad and the ugly
Grammar of Graphics (GG)
Creating beautiful graphics using GGPlot2
Introduction to Linear Models
Simple linear regression
Analysis of residuals
Multiple linear regression
Git and GitHub (Thursday)

Graphical representation

general approaches

Distributions of data
- location
- spread
- shape
Associations between variables
- relationship among two or more variables
- differences among groups in their distributions

Graphical representation

general approaches

Distributions of data
- bar graph
- histogram
- box plot
Associations between variables
- pie chart
- grouped bar graph
- mosaic plot
- box plot
- scatter plot
- dot plot ‘stripchart’

Box Plot

Displays median, first and third quartile, range, and extreme observations
Can be combined with mean and standard error of the mean
Concise way to visualize many aspects of distribution

xxx

Scatter Plot

Displays association between two numerical variables
Non-zero baseline often ok
Goal is association not magnitude or frequency
Points fill the space available

xxx

Examples of the good, bad and the ugly of graphical representation

Examples of bad graphs and how to improve them.
Courtesy of K.W. Broman
www.biostat.wisc.edu/~kbroman/topten_worstgraphs/

Ticker tape parade

xxx

A line to no understanding

xxx

Distribution of TFBS

xxx

Carolyn’s favorite figure

xxx

A bake sale of pie charts

xxx

Wack a mole

xxx

Graphical representation best practices

Principles of effective display

“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”

— Edward Tufte

The best statistical graphic ever drawn

according to Edward Tufte

xxx

Principles of effective display

Show the data
Encourage the eye to compare differences
Represent magnitudes honestly and accurately
Draw graphical elements clearly, minimizing clutter
Make displays easy to interpret

“Above all else show the data”

Tufte 1983

xxx

“Maximize the data to ink ratio, within reason”

Tufte 1983

Draw graphical elements clearly, minimizing clutter

xxx

“A graphic does not distort if the visual representation of the data is consistent with the numerical representation” – Tufte 1983

Represent magnitudes honestly and accurately

xxx

How Fox News makes a figure ….

xxx

How Fox News makes a figure ….

xxx

xxx

“Graphical excellence begins with telling the truth about the data” – Tufte 1983

Make displays easy to interpret

“Graphical excellence consists of complex ideas communicated with clarity, precision and efficiency” – Tufte 1983

xxx

The Tidyverse

https://www.tidyverse.org

xxx

GGPlot2 and the Grammar of Graphics

GG stands for ‘Grammar of Graphics’
A good paragraph uses good grammar to convey information
A good figure uses good grammar in the same way
Seven general components can be used to create most figures

GGPlot2 and the Grammar of Graphics

xxx

The `geom_bar` function

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut))

Now try this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, colour=cut))

and this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, fill=cut))

and finally this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, fill=clarity), position="dodge")

The `geom_histogram` and `geom_freqpoly`function

With this function you can make a histogram

ggplot(data=diamonds) +
  geom_histogram(mapping=aes(x=carat), binwidth=0.5)

This allows you to make a frequency polygram

ggplot(data=diamonds) +
  geom_histogram(mapping=aes(x=carat), binwidth=0.5)

The `geom_boxplot` function

Boxplots are very useful for visualizing data

ggplot(data=diamonds, mapping=aes(x=cut, y=price)) +
  geom_boxplot()

ggplot(data=mpg, mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy)) +
  coordflip()

ggplot(data=mpg, mapping=aes(x=class, y=hwy)) +
  geom_boxplot() +
  coordflip

The `geom_point` & `geom_smooth` functions

ggplot(data=diamonds2, mapping=aes(x=x, y=y)) +
  geompoint()

ggplot(data=mpg) +
  geompoint(mapping=aes(x=displ, y=hwy)) +
  facet_wrap(~class, nrow=2)

ggplot(data=mpg) +
  geompoint(mapping=aes(x=displ, y=hwy)) +
  facet_grid(drv~cyl)

ggplot(data=mpg) +
  geomsmooth(mapping=aes(x=displ, y=hwy))

Combining geoms

ggplot(data=mpg) +
  geompoint(mapping=aes(x=displ, y=hwy)) +
  geomsmooth(mapping=aes(x=displ, y=hwy))

Adding labels

ggplot(data=mpg, aes(displ, hwy)) +
  geompoint(aes(color=class)) +
  geomsmooth(se=FALSE) +
  labs(
    title = "Fuel efficiency generally decreases with engine size"
    subtitle = "Two seaters (sports cars) are an exception because of their light weight"
    caption = "Data from fueleconomy.gov"
  )

Themes

xxx

Linear Models and Regression

Parent offspring regression

Linear Models - a note on history

Bivariate normality

Covariance and correlation

Anscombe’s Quartet

Mean of x in each case 9 (exact)
Variance of x in each case 11 (exact)
Mean of y in each case 7.50 (to 2 decimal places)
Variance of y in each case 4.122 or 4.127 (to 3 decimal places)
Correlation between x and y in each case 0.816 (to 3 decimal places)
Linear regression line in each case \[ y = 3.00 + 0.50x\]

A linear model to relate two variables

Many approaches are linear models

Is flexible: Applicable to many different study designs
Provides a common set of tools (lm in R for fixed effects)
Includes tools to estimate parameters:
- (e.g. sizes of effects, like the slope, or change in means)
Is easier to work with, especially with multiple variables

Many approaches are linear models

Linear regression
Single factor ANOVA
Analysis of covariance
Multiple regression
Multi-factor ANOVA
Repeated-measures ANOVA

Plethora of linear models

General Linear Model (GLM) - two or more continuous variables
General Linear Mixed Model (GLMM) - a continuous response variable with a mix of continuous and categorical predictor variables
Generalized Linear Model - a GLMM that doesn’t assume normality of the response (we’ll get to this later)
Generalized Additive Model (GAM) - a model that doesn’t assume linearity (we won’t get to this later)

Linear models

All an be written in the form

response variable = intercept + (explanatory_variables) + random_error

in the general form:

\[ Y=\beta_0 +\beta_1*X_1 + \beta_2*X_2 +... + error\]

where \(\beta_0, \beta_1, \beta_2, ....\) are the parameters of the linear model

linear model parameters

linear models in R

All of these will include the intercept

Y~X
Y~1+X
Y~X+1

All of these will exclude the intercept

Y~-1+X
Y~X-1

Need to fit the model and then ‘read’ the output

trial_lm <- lm(Y~X)
summary (trial_lm)

R INTERLUDE

Write a script to read in the perchlorate data set. Now, add the code to perform a linear model of two continuous variables. Notice how the output of the linear model is specified to a new variable. Also note that the variables and dataset are placeholders

my_lm <- lm(dataset$variable1 ~ dataset$variable2)

Now look at a summary of the linear model

summary(my_lm)
print(my_lm)

Now let’s produce a nice regression plot

plot(dataset$variable1 ~ dataset$variable2, col = “blue”)
abline(my_lm, col = “red”)

Notice that you are adding the fitted line from your linear model Finally, remake this plot in GGPlot2

R INTERLUDE

zfish_diet_IA.tsv
- Protein - amount of protein in the diet
- Weight - weight of the fish
- SL - standard length of the fish
Perchlorate_Data.tsv
- Strain - of zebrafish
- Perchlorate Level - treatment of perchlorate
- T4 Hormone Level
- Follicle Area
- Number of Follicles
- Age_Category
- Testes Area
- Testes Stage

R INTERLUDE

Perchlorate_Data <- read.table(‘perchlorate_data.tsv’, header=T, sep=‘/t’)
head(Perchlorate_Data)

x <- Perchlorate_Data$T4_Hormone_Level
y <- Perchlorate_Data$Testes_Area

perc_lm <- lm(y ~ x)
summary (perc_lm)

plot(y ~ x, col = "blue")
abline(perc_lm, col = "red")

R INTERLUDE - Checking the output

Model fitting and hypothesis tests in regression

\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]

full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)

reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)

fits a “reduced” model without slope term (H0)
fits the “full” model with slope term added back
compares fit of full and reduced models using an F test

Model fitting and hypothesis tests in regression

Hypothesis tests in linear regression

Estimation of the variation that is explained by the model (SS_model)

SS_model = SS_total(reduced model) - SS_residual(full model)

The variation that is unexplained by the model (SS_residual)

SS_residual(full model)

Hypothesis tests in linear regression

\(r^2\) as a measure of model fit

\[r^2 = SS_{regression}/SS_{total} = 1 - (SS_{residual}/SS_{total})\] or \[r^2 = 1 - (SS_{residual(full)}/SS_{total(reduced)})\] Which is the proportion of the variance in Y that is explained by X

Relationship of correlation and regression

\[\beta_{YX}=\rho_{YX}*\sigma_Y/\sigma_X\] \[b_{YX} = r_{YX}*S_Y/S_X\]

Residual Analysis

did we meet our assumptions?

Independent errors (residuals)
Equal variance of residuals in all groups
Normally-distributed residuals
Robustness to departures from these assumptions is improved when sample size is large and design is balanced

Residual Analysis

did we meet our assumptions?

\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]

Residual Analysis

Handling violations of the assumptions of linear models

What if your residuals aren’t normal because of outliers?
Nonparametric methods exist, but these don’t provide parameter estimates with CIs.
Robust regression (rlm)
Randomization tests

Anscombe’s quartet again

what would residual plots look like for these?

Anscombe’s quartet again

what would residual plots look like for these?

Residual Plots

Spotting assumption violations

Residuals

leverage and influence

1 is an outlier for both Y and X
2 is not an outlier for either Y or X but has a high residual
3 is an outlier in just X - and thus a high residual - and therefore has high influence as measured by Cook’s D

Residuals

leverage and influence

Leverage - a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. (Values > 2*(2/n) for simple regression are cause for concern)
Residuals - As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions
Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations.

R INTERLUDE

residual analyses

Let’s look at the zebrafish diet data.

x <- zfish_diet$Protein
y <- zfish_diet$Weight

zfish_lm <- lm(y ~ x)
summary (zfish_lm)

plot(y ~ x, col = "blue")
abline(zfish_lm, col = "red")

hist(residuals(zfish_lm), breaks=30)

plot (residuals(zfish_lm) ~ fitted.values(zfish_lm))
plot (residuals(zfish_lm) ~ x)

R INTERLUDE

residual analyses

Or apply the plot() function to the linear model object directly

plot(zfish_lm)

Figure out what these plots are telling you

R INTERLUDE

How about using the influence.measures function???

influence.measures(zfish_lm)

Do we have any high leverage observations we need to worry about??? Now go back and try to recreate these graphs in GGPlot2

Graphical representation

Goals for this week

Graphical representation

general approaches

Graphical representation

general approaches

Box Plot

Scatter Plot

Examples of the good, bad and the ugly of graphical representation

Ticker tape parade

A line to no understanding

Distribution of TFBS

Carolyn’s favorite figure

A bake sale of pie charts

Wack a mole

Graphical representation best practices

Principles of effective display

The best statistical graphic ever drawn

according to Edward Tufte

Principles of effective display

“Above all else show the data”

Tufte 1983

“Maximize the data to ink ratio, within reason”

Tufte 1983

“A graphic does not distort if the visual representation of the data is consistent with the numerical representation” – Tufte 1983

How Fox News makes a figure ….

How Fox News makes a figure ….

Make displays easy to interpret

The Tidyverse

https://www.tidyverse.org

GGPlot2 and the Grammar of Graphics

GGPlot2 and the Grammar of Graphics

The geom_bar function

The geom_histogram and geom_freqpolyfunction

The geom_boxplot function

The geom_point & geom_smooth functions

Combining geoms

Adding labels

Themes

Linear Models and Regression

Parent offspring regression

Linear Models - a note on history

Linear Models - a note on history

Bivariate normality

Covariance and correlation

Anscombe’s Quartet

Anscombe’s Quartet

A linear model to relate two variables

Many approaches are linear models

Many approaches are linear models

Plethora of linear models

Linear models

linear model parameters

linear model parameters

linear models in R

R INTERLUDE

R INTERLUDE

R INTERLUDE

R INTERLUDE - Checking the output

Model fitting and hypothesis tests in regression

Model fitting and hypothesis tests in regression

Hypothesis tests in linear regression

Hypothesis tests in linear regression

Hypothesis tests in linear regression

\(r^2\) as a measure of model fit

Relationship of correlation and regression

Residual Analysis

did we meet our assumptions?

Residual Analysis

did we meet our assumptions?

Residual Analysis

Residual Analysis

Handling violations of the assumptions of linear models

Anscombe’s quartet again

what would residual plots look like for these?

Anscombe’s quartet again

what would residual plots look like for these?

Residual Plots

Spotting assumption violations

Residuals

The `geom_bar` function

The `geom_histogram` and `geom_freqpoly`function

The `geom_boxplot` function

The `geom_point` & `geom_smooth` functions