\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Homework 9: GLMs

Assignment:

Due: Submit your work via Canvas by the end of the day (midnight) on Thursday, December 10th. Please submit both the Rmd file and the resulting html or pdf file. You can work with other members of class, but I expect each of you to construct and run all of the scripts yourself.

Pick one of the following hypothetical situations, and briefly

You don’t need to write a lot - just a paragraph explaining your choice and simulation rationale, then a bit of code and a plot. Note: unlike in most of your homework assignments, please do show your code (but explain what it’s going to do, beforehand, in words).

Situations:

  1. How number of pumpkins per vine depends on amount of fertilizer.
  2. How (presence or absence of) hip dysplasia in dogs depends on age and breed.
  3. How doughnut weight varies between and within bakeries and doughnut types.

Example solutions

How height of human Eugene children ages 2-12 varies with age.

Human height varies relatively little about the mean, and has often been modeled as Normally distributed. As long as the SD isn’t too big we won’t get any negative heights. Some googling turns up average curves that look more or less straight over this age range, so we’ll go with a normal family GLM with an identity link function.

To simulate this, it looks like children at age 2 have a mean height of around 80cm, which goes up to about 150cm over the next 10 years. The standard deviation is probably around 7%, so 7cm.

n <- 100 # children
heights <- data.frame(
                      age = round(runif(n, 2, 12), 1) )
params <- list(
               age = 7, # = (150 - 80)/10
               intercept = 66, # = 80 - 2 * 7
               sd = 7)

heights$mean <- params$intercept + params$age * heights$age
heights$height <- rnorm(n, mean=heights$mean, sd=params$sd)

(ggplot(heights, aes(x=age, y=height)) + geom_point() + geom_smooth(method='loess', formula=y~x)
 + xlab("age (years)") + ylab("height (cm)"));
plot of chunk r sim_heights

How house prices are predicted by elevation and square footage.

House prices are (more or less) continuously distributed, nonnegative, and highly right-skewed (the most expensive houses cost a lot more than most houses). The Gamma distribution can fit all those criteria, and if we make the mean depend on the exponential of the linear predictor then we can ensure it’ll always be positive. So, we might use a gamma family GLM with a log link function.

To simulate this, let’s say that elevation varies by a thousand feet, and (googling) houses are between a thousand and (rarely) a few thousand feet. I think typical houses cost hundreds of thousands of dollars, and square footage is more important than elevation, so let’s say that mean house price goes up by a factor of 2 per thousand square feet, and by 50% over the thousand feet of elevation. (A factor of 5 produced billion-dollar houses, whoops.) We don’t want a lot of dispersion, so we’ll pick the shape parameter of the Gamma to be on the large side so there isn’t a lot of variance. In reality, elevation and square footage are probably correlated, but I’ll ignore that.

n <- 100 # houses
houses <- data.frame(
                     elevation = round(runif(n, 0, 1000), 0),
                     size = round(rgamma(n, shape=1.2, scale=1000), 0))

# note these are on a log scale
params <- list(gamma_shape = 5,
               intercept = log(1e5),
               elevation = log(1.5) / 1000, # since mean should go up by exp(1000 * this) over 1000 feet
               size = log(2) / 1000)

houses$mean <- exp(
                   params$intercept
                   + params$elevation * houses$elevation
                   + params$size * houses$size )
houses$price <- rgamma(n, shape=params$gamma_shape, scale=houses$mean/params$gamma_shape)

grid.arrange(
    (ggplot(houses, aes(x=size, y=price, col=elevation)) + geom_point()
     + xlab("house size (sq ft)") + ylab("house price ($)")),
    (ggplot(houses, aes(x=elevation, y=price, col=size)) + geom_point()
     + xlab("house elevation above reference (ft)") + ylab("house price ($)")),
    ncol=2);
plot of chunk r sim_houses

The simulation has a probably unrealistic range of house prices, but nothing is wildly unreasonable.