10/16/2018 & 10/18/2018
“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”
— Edward Tufte
Draw graphical elements clearly, minimizing clutter
Represent magnitudes honestly and accurately
“Graphical excellence begins with telling the truth about the data” – Tufte 1983
“Graphical excellence consists of complex ideas communicated with clarity, precision and efficiency” – Tufte 1983
geom_bar
functionggplot(data=diamonds) + geom_bar(mapping=aes(x=cut))
Now try this…
ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, colour=cut))
and this…
ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, fill=cut))
and finally this…
ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, fill=clarity), position="dodge")
geom_histogram
and geom_freqpoly
functionWith this function you can make a histogram
ggplot(data=diamonds) + geom_histogram(mapping=aes(x=carat), binwidth=0.5)
This allows you to make a frequency polygram
ggplot(data=diamonds) + geom_histogram(mapping=aes(x=carat), binwidth=0.5)
geom_boxplot
functionBoxplots are very useful for visualizing data
ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_boxplot()
ggplot(data=mpg, mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy)) + coordflip()
ggplot(data=mpg, mapping=aes(x=class, y=hwy)) + geom_boxplot() + coordflip
geom_point
& geom_smooth
functionsggplot(data=diamonds2, mapping=aes(x=x, y=y)) + geompoint()
ggplot(data=mpg) + geompoint(mapping=aes(x=displ, y=hwy)) + facet_wrap(~class, nrow=2)
ggplot(data=mpg) + geompoint(mapping=aes(x=displ, y=hwy)) + facet_grid(drv~cyl)
ggplot(data=mpg) + geomsmooth(mapping=aes(x=displ, y=hwy))
ggplot(data=mpg) + geompoint(mapping=aes(x=displ, y=hwy)) + geomsmooth(mapping=aes(x=displ, y=hwy))
ggplot(data=mpg, aes(displ, hwy)) + geompoint(aes(color=class)) + geomsmooth(se=FALSE) + labs( title = "Fuel efficiency generally decreases with engine size" subtitle = "Two seaters (sports cars) are an exception because of their light weight" caption = "Data from fueleconomy.gov" )
Mean of x in each case 9 (exact)
Variance of x in each case 11 (exact)
Mean of y in each case 7.50 (to 2 decimal places)
Variance of y in each case 4.122 or 4.127 (to 3 decimal places)
Correlation between x and y in each case 0.816 (to 3 decimal places)
Linear regression line in each case \[ y = 3.00 + 0.50x\]
General Linear Model (GLM) - two or more continuous variables
General Linear Mixed Model (GLMM) - a continuous response variable with a mix of continuous and categorical predictor variables
Generalized Linear Model - a GLMM that doesn’t assume normality of the response (we’ll get to this later)
Generalized Additive Model (GAM) - a model that doesn’t assume linearity (we won’t get to this later)
All an be written in the form
response variable = intercept + (explanatory_variables) + random_error
in the general form:
\[ Y=\beta_0 +\beta_1*X_1 + \beta_2*X_2 +... + error\]
where \(\beta_0, \beta_1, \beta_2, ....\) are the parameters of the linear model
All of these will include the intercept
Y~X Y~1+X Y~X+1
All of these will exclude the intercept
Y~-1+X Y~X-1
Need to fit the model and then ‘read’ the output
trial_lm <- lm(Y~X) summary (trial_lm)
Write a script to read in the perchlorate data set. Now, add the code to perform a linear model of two continuous variables. Notice how the output of the linear model is specified to a new variable. Also note that the variables and dataset are placeholders
my_lm <- lm(dataset$variable1 ~ dataset$variable2)
Now look at a summary of the linear model
summary(my_lm) print(my_lm)
Now let’s produce a nice regression plot
plot(dataset$variable1 ~ dataset$variable2, col = “blue”) abline(my_lm, col = “red”)
Notice that you are adding the fitted line from your linear model Finally, remake this plot in GGPlot2
Perchlorate_Data <- read.table(‘perchlorate_data.tsv’, header=T, sep=‘/t’) head(Perchlorate_Data) x <- Perchlorate_Data$T4_Hormone_Level y <- Perchlorate_Data$Testes_Area perc_lm <- lm(y ~ x) summary (perc_lm) plot(y ~ x, col = "blue") abline(perc_lm, col = "red")
\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]
full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)
reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)
Estimation of the variation that is explained by the model (SS_model)
SS_model = SS_total(reduced model) - SS_residual(full model)
The variation that is unexplained by the model (SS_residual)
SS_residual(full model)
\[r^2 = SS_{regression}/SS_{total} = 1 - (SS_{residual}/SS_{total})\] or \[r^2 = 1 - (SS_{residual(full)}/SS_{total(reduced)})\] Which is the proportion of the variance in Y that is explained by X
\[\beta_{YX}=\rho_{YX}*\sigma_Y/\sigma_X\] \[b_{YX} = r_{YX}*S_Y/S_X\]
\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]
What if your residuals aren’t normal because of outliers?
Nonparametric methods exist, but these don’t provide parameter estimates with CIs.
Robust regression (rlm)
Randomization tests
Leverage - a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. (Values > 2*(2/n) for simple regression are cause for concern)
Residuals - As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions
Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations.
Let’s look at the zebrafish diet data.
x <- zfish_diet$Protein y <- zfish_diet$Weight zfish_lm <- lm(y ~ x) summary (zfish_lm) plot(y ~ x, col = "blue") abline(zfish_lm, col = "red") hist(residuals(zfish_lm), breaks=30) plot (residuals(zfish_lm) ~ fitted.values(zfish_lm)) plot (residuals(zfish_lm) ~ x)
Or apply the plot() function to the linear model object directly
plot(zfish_lm)
Figure out what these plots are telling you
How about using the influence.measures function???
influence.measures(zfish_lm)
Do we have any high leverage observations we need to worry about??? Now go back and try to recreate these graphs in GGPlot2