10/16/2018 & 10/18/2018

Graphical representation

Goals for this week

  • Figures - the good, the bad and the ugly
  • Grammar of Graphics (GG)
  • Creating beautiful graphics using GGPlot2
  • Introduction to Linear Models
  • Simple linear regression
  • Analysis of residuals
  • Multiple linear regression
  • Git and GitHub (Thursday)

Graphical representation

general approaches

  1. Distributions of data
    • location
    • spread
    • shape
  2. Associations between variables
    • relationship among two or more variables
    • differences among groups in their distributions

Graphical representation

general approaches

  1. Distributions of data
    • bar graph
    • histogram
    • box plot
  2. Associations between variables
    • pie chart
    • grouped bar graph
    • mosaic plot
    • box plot
    • scatter plot
    • dot plot ‘stripchart’

Box Plot

  • Displays median, first and third quartile, range, and extreme observations
  • Can be combined with mean and standard error of the mean
  • Concise way to visualize many aspects of distribution
xxx

xxx

Scatter Plot

  • Displays association between two numerical variables
  • Non-zero baseline often ok
  • Goal is association not magnitude or frequency
  • Points fill the space available
xxx

xxx

Examples of the good, bad and the ugly of graphical representation

  • Examples of bad graphs and how to improve them.
  • Courtesy of K.W. Broman
  • www.biostat.wisc.edu/~kbroman/topten_worstgraphs/

Ticker tape parade

xxx

xxx

A line to no understanding

xxx

xxx

Distribution of TFBS

xxx

xxx

Carolyn’s favorite figure

xxx

xxx

A bake sale of pie charts

xxx

xxx

Wack a mole

xxx

xxx

Graphical representation best practices

Principles of effective display

“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”

— Edward Tufte

The best statistical graphic ever drawn

according to Edward Tufte

xxx

xxx

Principles of effective display

  • Show the data
  • Encourage the eye to compare differences
  • Represent magnitudes honestly and accurately
  • Draw graphical elements clearly, minimizing clutter
  • Make displays easy to interpret

“Above all else show the data”

Tufte 1983

xxx

xxx

“Maximize the data to ink ratio, within reason”

Tufte 1983

Draw graphical elements clearly, minimizing clutter

xxx

xxx

“A graphic does not distort if the visual representation of the data is consistent with the numerical representation” – Tufte 1983

Represent magnitudes honestly and accurately

xxx

xxx

How Fox News makes a figure ….

xxx

xxx

How Fox News makes a figure ….

xxx

xxx

xxx

xxx

“Graphical excellence begins with telling the truth about the data” – Tufte 1983

Make displays easy to interpret

“Graphical excellence consists of complex ideas communicated with clarity, precision and efficiency” – Tufte 1983

xxx

xxx

The Tidyverse

https://www.tidyverse.org

xxx

xxx

GGPlot2 and the Grammar of Graphics

  • GG stands for ‘Grammar of Graphics’
  • A good paragraph uses good grammar to convey information
  • A good figure uses good grammar in the same way
  • Seven general components can be used to create most figures

GGPlot2 and the Grammar of Graphics

xxx

xxx

The geom_bar function

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut))

Now try this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, colour=cut))

and this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, fill=cut))

and finally this…

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut, fill=clarity), position="dodge")

The geom_histogram and geom_freqpolyfunction

With this function you can make a histogram

ggplot(data=diamonds) +
  geom_histogram(mapping=aes(x=carat), binwidth=0.5)

This allows you to make a frequency polygram

ggplot(data=diamonds) +
  geom_histogram(mapping=aes(x=carat), binwidth=0.5)

The geom_boxplot function

Boxplots are very useful for visualizing data

ggplot(data=diamonds, mapping=aes(x=cut, y=price)) +
  geom_boxplot()
ggplot(data=mpg, mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy)) +
  coordflip()
ggplot(data=mpg, mapping=aes(x=class, y=hwy)) +
  geom_boxplot() +
  coordflip

The geom_point & geom_smooth functions

ggplot(data=diamonds2, mapping=aes(x=x, y=y)) +
  geompoint()
ggplot(data=mpg) +
  geompoint(mapping=aes(x=displ, y=hwy)) +
  facet_wrap(~class, nrow=2)
ggplot(data=mpg) +
  geompoint(mapping=aes(x=displ, y=hwy)) +
  facet_grid(drv~cyl)
ggplot(data=mpg) +
  geomsmooth(mapping=aes(x=displ, y=hwy))

Combining geoms

ggplot(data=mpg) +
  geompoint(mapping=aes(x=displ, y=hwy)) +
  geomsmooth(mapping=aes(x=displ, y=hwy))

Adding labels

ggplot(data=mpg, aes(displ, hwy)) +
  geompoint(aes(color=class)) +
  geomsmooth(se=FALSE) +
  labs(
    title = "Fuel efficiency generally decreases with engine size"
    subtitle = "Two seaters (sports cars) are an exception because of their light weight"
    caption = "Data from fueleconomy.gov"
  )

Themes

xxx

xxx

Linear Models and Regression

Parent offspring regression

Linear Models - a note on history

Linear Models - a note on history

Bivariate normality

Covariance and correlation

Anscombe’s Quartet

Anscombe’s Quartet

  • Mean of x in each case 9 (exact)

  • Variance of x in each case 11 (exact)

  • Mean of y in each case 7.50 (to 2 decimal places)

  • Variance of y in each case 4.122 or 4.127 (to 3 decimal places)

  • Correlation between x and y in each case 0.816 (to 3 decimal places)

  • Linear regression line in each case \[ y = 3.00 + 0.50x\]

A linear model to relate two variables

Many approaches are linear models

  • Is flexible: Applicable to many different study designs
  • Provides a common set of tools (lm in R for fixed effects)
  • Includes tools to estimate parameters:
    • (e.g. sizes of effects, like the slope, or change in means)
  • Is easier to work with, especially with multiple variables

Many approaches are linear models

  • Linear regression
  • Single factor ANOVA
  • Analysis of covariance
  • Multiple regression
  • Multi-factor ANOVA
  • Repeated-measures ANOVA

Plethora of linear models

  • General Linear Model (GLM) - two or more continuous variables

  • General Linear Mixed Model (GLMM) - a continuous response variable with a mix of continuous and categorical predictor variables

  • Generalized Linear Model - a GLMM that doesn’t assume normality of the response (we’ll get to this later)

  • Generalized Additive Model (GAM) - a model that doesn’t assume linearity (we won’t get to this later)

Linear models

All an be written in the form

response variable = intercept + (explanatory_variables) + random_error

in the general form:

\[ Y=\beta_0 +\beta_1*X_1 + \beta_2*X_2 +... + error\]

where \(\beta_0, \beta_1, \beta_2, ....\) are the parameters of the linear model

linear model parameters

linear model parameters

linear models in R

All of these will include the intercept

Y~X
Y~1+X
Y~X+1

All of these will exclude the intercept

Y~-1+X
Y~X-1

Need to fit the model and then ‘read’ the output

trial_lm <- lm(Y~X)
summary (trial_lm)

R INTERLUDE

Write a script to read in the perchlorate data set. Now, add the code to perform a linear model of two continuous variables. Notice how the output of the linear model is specified to a new variable. Also note that the variables and dataset are placeholders

my_lm <- lm(dataset$variable1 ~ dataset$variable2)

Now look at a summary of the linear model

summary(my_lm)
print(my_lm)

Now let’s produce a nice regression plot

plot(dataset$variable1 ~ dataset$variable2, col = “blue”)
abline(my_lm, col = “red”)

Notice that you are adding the fitted line from your linear model Finally, remake this plot in GGPlot2

R INTERLUDE

  • zfish_diet_IA.tsv
    • Protein - amount of protein in the diet
    • Weight - weight of the fish
    • SL - standard length of the fish
  • Perchlorate_Data.tsv
    • Strain - of zebrafish
    • Perchlorate Level - treatment of perchlorate
    • T4 Hormone Level
    • Follicle Area
    • Number of Follicles
    • Age_Category
    • Testes Area
    • Testes Stage

R INTERLUDE

Perchlorate_Data <- read.table(‘perchlorate_data.tsv’, header=T, sep=‘/t’)
head(Perchlorate_Data)

x <- Perchlorate_Data$T4_Hormone_Level
y <- Perchlorate_Data$Testes_Area

perc_lm <- lm(y ~ x)
summary (perc_lm)

plot(y ~ x, col = "blue")
abline(perc_lm, col = "red")

R INTERLUDE - Checking the output

Model fitting and hypothesis tests in regression

\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]

full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)

reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)

  1. fits a “reduced” model without slope term (H0)
  2. fits the “full” model with slope term added back
  3. compares fit of full and reduced models using an F test

Model fitting and hypothesis tests in regression

Hypothesis tests in linear regression

Estimation of the variation that is explained by the model (SS_model)

SS_model = SS_total(reduced model) - SS_residual(full model)

The variation that is unexplained by the model (SS_residual)

SS_residual(full model)

Hypothesis tests in linear regression

Hypothesis tests in linear regression

\(r^2\) as a measure of model fit

\[r^2 = SS_{regression}/SS_{total} = 1 - (SS_{residual}/SS_{total})\] or \[r^2 = 1 - (SS_{residual(full)}/SS_{total(reduced)})\] Which is the proportion of the variance in Y that is explained by X

Relationship of correlation and regression

\[\beta_{YX}=\rho_{YX}*\sigma_Y/\sigma_X\] \[b_{YX} = r_{YX}*S_Y/S_X\]

Residual Analysis

did we meet our assumptions?

  • Independent errors (residuals)
  • Equal variance of residuals in all groups
  • Normally-distributed residuals
  • Robustness to departures from these assumptions is improved when sample size is large and design is balanced

Residual Analysis

did we meet our assumptions?

\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]

Residual Analysis

Residual Analysis

Handling violations of the assumptions of linear models

  • What if your residuals aren’t normal because of outliers?

  • Nonparametric methods exist, but these don’t provide parameter estimates with CIs.

  • Robust regression (rlm)

  • Randomization tests

Anscombe’s quartet again

what would residual plots look like for these?

Anscombe’s quartet again

what would residual plots look like for these?

Residual Plots

Spotting assumption violations

Residuals

leverage and influence

  • 1 is an outlier for both Y and X
  • 2 is not an outlier for either Y or X but has a high residual
  • 3 is an outlier in just X - and thus a high residual - and therefore has high influence as measured by Cook’s D

Residuals

leverage and influence

  • Leverage - a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. (Values > 2*(2/n) for simple regression are cause for concern)

  • Residuals - As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions

  • Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations.

R INTERLUDE

residual analyses

Let’s look at the zebrafish diet data.

x <- zfish_diet$Protein
y <- zfish_diet$Weight

zfish_lm <- lm(y ~ x)
summary (zfish_lm)

plot(y ~ x, col = "blue")
abline(zfish_lm, col = "red")

hist(residuals(zfish_lm), breaks=30)

plot (residuals(zfish_lm) ~ fitted.values(zfish_lm))
plot (residuals(zfish_lm) ~ x)

R INTERLUDE

residual analyses

Or apply the plot() function to the linear model object directly

plot(zfish_lm)

Figure out what these plots are telling you

R INTERLUDE

How about using the influence.measures function???

influence.measures(zfish_lm)

Do we have any high leverage observations we need to worry about??? Now go back and try to recreate these graphs in GGPlot2

R INTERLUDE