10/16/2018 & 10/18/2018
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”
— Edward Tufte
xxx
xxx
Draw graphical elements clearly, minimizing clutter
xxx
Represent magnitudes honestly and accurately
xxx
xxx
xxx
xxx
“Graphical excellence begins with telling the truth about the data” – Tufte 1983
“Graphical excellence consists of complex ideas communicated with clarity, precision and efficiency” – Tufte 1983
xxx
xxx
xxx
geom_bar
functionggplot(data=diamonds) + geom_bar(mapping=aes(x=cut))
Now try this…
ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, colour=cut))
and this…
ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, fill=cut))
and finally this…
ggplot(data=diamonds) + geom_bar(mapping=aes(x=cut, fill=clarity), position="dodge")
geom_histogram
and geom_freqpoly
functionWith this function you can make a histogram
ggplot(data=diamonds) + geom_histogram(mapping=aes(x=carat), binwidth=0.5)
This allows you to make a frequency polygram
ggplot(data=diamonds) + geom_histogram(mapping=aes(x=carat), binwidth=0.5)
geom_boxplot
functionBoxplots are very useful for visualizing data
ggplot(data=diamonds, mapping=aes(x=cut, y=price)) + geom_boxplot()
ggplot(data=mpg, mapping=aes(x=reorder(class, hwy, FUN=median), y=hwy)) + coordflip()
ggplot(data=mpg, mapping=aes(x=class, y=hwy)) + geom_boxplot() + coordflip
geom_point
& geom_smooth
functionsggplot(data=diamonds2, mapping=aes(x=x, y=y)) + geompoint()
ggplot(data=mpg) + geompoint(mapping=aes(x=displ, y=hwy)) + facet_wrap(~class, nrow=2)
ggplot(data=mpg) + geompoint(mapping=aes(x=displ, y=hwy)) + facet_grid(drv~cyl)
ggplot(data=mpg) + geomsmooth(mapping=aes(x=displ, y=hwy))
ggplot(data=mpg) + geompoint(mapping=aes(x=displ, y=hwy)) + geomsmooth(mapping=aes(x=displ, y=hwy))
ggplot(data=mpg, aes(displ, hwy)) + geompoint(aes(color=class)) + geomsmooth(se=FALSE) + labs( title = "Fuel efficiency generally decreases with engine size" subtitle = "Two seaters (sports cars) are an exception because of their light weight" caption = "Data from fueleconomy.gov" )
xxx
Mean of x in each case 9 (exact)
Variance of x in each case 11 (exact)
Mean of y in each case 7.50 (to 2 decimal places)
Variance of y in each case 4.122 or 4.127 (to 3 decimal places)
Correlation between x and y in each case 0.816 (to 3 decimal places)
Linear regression line in each case \[ y = 3.00 + 0.50x\]
General Linear Model (GLM) - two or more continuous variables
General Linear Mixed Model (GLMM) - a continuous response variable with a mix of continuous and categorical predictor variables
Generalized Linear Model - a GLMM that doesn’t assume normality of the response (we’ll get to this later)
Generalized Additive Model (GAM) - a model that doesn’t assume linearity (we won’t get to this later)
All an be written in the form
response variable = intercept + (explanatory_variables) + random_error
in the general form:
\[ Y=\beta_0 +\beta_1*X_1 + \beta_2*X_2 +... + error\]
where \(\beta_0, \beta_1, \beta_2, ....\) are the parameters of the linear model
All of these will include the intercept
Y~X Y~1+X Y~X+1
All of these will exclude the intercept
Y~-1+X Y~X-1
Need to fit the model and then ‘read’ the output
trial_lm <- lm(Y~X) summary (trial_lm)
Write a script to read in the perchlorate data set. Now, add the code to perform a linear model of two continuous variables. Notice how the output of the linear model is specified to a new variable. Also note that the variables and dataset are placeholders
my_lm <- lm(dataset$variable1 ~ dataset$variable2)
Now look at a summary of the linear model
summary(my_lm) print(my_lm)
Now let’s produce a nice regression plot
plot(dataset$variable1 ~ dataset$variable2, col = “blue”) abline(my_lm, col = “red”)
Notice that you are adding the fitted line from your linear model Finally, remake this plot in GGPlot2
Perchlorate_Data <- read.table(‘perchlorate_data.tsv’, header=T, sep=‘/t’) head(Perchlorate_Data) x <- Perchlorate_Data$T4_Hormone_Level y <- Perchlorate_Data$Testes_Area perc_lm <- lm(y ~ x) summary (perc_lm) plot(y ~ x, col = "blue") abline(perc_lm, col = "red")
\[H_0 : \beta_0 = 0\] \[H_0 : \beta_1 = 0\]
full model - \(y_i = \beta_0 + \beta_1*x_i + error_i\)
reduced model - \(y_i = \beta_0 + 0*x_i + error_i\)
Estimation of the variation that is explained by the model (SS_model)
SS_model = SS_total(reduced model) - SS_residual(full model)
The variation that is unexplained by the model (SS_residual)
SS_residual(full model)
\[r^2 = SS_{regression}/SS_{total} = 1 - (SS_{residual}/SS_{total})\] or \[r^2 = 1 - (SS_{residual(full)}/SS_{total(reduced)})\] Which is the proportion of the variance in Y that is explained by X
\[\beta_{YX}=\rho_{YX}*\sigma_Y/\sigma_X\] \[b_{YX} = r_{YX}*S_Y/S_X\]
\[y_i = \beta_0 + \beta_1 * x_I + \epsilon_i\]
What if your residuals aren’t normal because of outliers?
Nonparametric methods exist, but these don’t provide parameter estimates with CIs.
Robust regression (rlm)
Randomization tests
Leverage - a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. (Values > 2*(2/n) for simple regression are cause for concern)
Residuals - As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions
Cook’s D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values ≥ 1 (or even approaching 1) correspond to highly influential observations.
Let’s look at the zebrafish diet data.
x <- zfish_diet$Protein y <- zfish_diet$Weight zfish_lm <- lm(y ~ x) summary (zfish_lm) plot(y ~ x, col = "blue") abline(zfish_lm, col = "red") hist(residuals(zfish_lm), breaks=30) plot (residuals(zfish_lm) ~ fitted.values(zfish_lm)) plot (residuals(zfish_lm) ~ x)
Or apply the plot() function to the linear model object directly
plot(zfish_lm)
Figure out what these plots are telling you
How about using the influence.measures function???
influence.measures(zfish_lm)
Do we have any high leverage observations we need to worry about??? Now go back and try to recreate these graphs in GGPlot2