Assignment:
Due: Submit your work via Canvas by the end of the day (midnight) on Thursday, December 9th. Please submit both the Rmd file and the resulting html or pdf file. You can work with other members of class, but I expect each of you to construct and run all of the scripts yourself.
Pick one of the three following hypothetical situations, and briefly
Situations:
You don’t need to write a lot - just a paragraph explaining your choice and simulation rationale, then a bit of code and a plot. Note: unlike in most of your homework assignments, please do show your code (but explain what it’s going to do, beforehand, in words).
Here are some examples of how I might answer the question in some other situations
Human height varies relatively little about the mean, and has often been modeled as Normally distributed. As long as the SD isn’t too big we won’t get any negative heights. Some googling turns up average curves that look more or less straight over this age range, so we’ll go with a normal family GLM with an identity link function.
To simulate this, it looks like children at age 2 have a mean height of around 80cm, which goes up to about 150cm over the next 10 years. The standard deviation is probably around 7%, so 7cm.
100 # children
n <- data.frame(
heights <-age = round(runif(n, 2, 12), 1) )
list(
params <-age = 7, # = (150 - 80)/10
intercept = 66, # = 80 - 2 * 7
sd = 7)
$mean <- params$intercept + params$age * heights$age
heights$height <- rnorm(n, mean=heights$mean, sd=params$sd)
heights
ggplot(heights, aes(x=age, y=height)) + geom_point() + geom_smooth(method='loess', formula=y~x)
(+ xlab("age (years)") + ylab("height (cm)"));
House prices are (more or less) continuously distributed, nonnegative, and highly right-skewed (the most expensive houses cost a lot more than most houses). The Gamma distribution can fit all those criteria, and if we make the mean depend on the exponential of the linear predictor then we can ensure it’ll always be positive. So, we might use a gamma family GLM with a log link function.
To simulate this, let’s say that elevation varies by a thousand feet, and (googling) houses are between a thousand and (rarely) a few thousand feet. I think typical houses cost hundreds of thousands of dollars, and square footage is more important than elevation, so let’s say that mean house price goes up by a factor of 2 per thousand square feet, and by 50% over the thousand feet of elevation. (A factor of 5 produced billion-dollar houses, whoops.) We don’t want a lot of dispersion, so we’ll pick the shape parameter of the Gamma to be on the large side so there isn’t a lot of variance. In reality, elevation and square footage are probably correlated, but I’ll ignore that.
100 # houses
n <- data.frame(
houses <-elevation = round(runif(n, 0, 1000), 0),
size = round(rgamma(n, shape=1.2, scale=1000), 0))
# note these are on a log scale
list(gamma_shape = 5,
params <-intercept = log(1e5),
elevation = log(1.5) / 1000, # since mean should go up by exp(1000 * this) over 1000 feet
size = log(2) / 1000)
$mean <- exp(
houses$intercept
params+ params$elevation * houses$elevation
+ params$size * houses$size )
$price <- rgamma(n, shape=params$gamma_shape, scale=houses$mean/params$gamma_shape)
houses
grid.arrange(
ggplot(houses, aes(x=size, y=price, col=elevation)) + geom_point()
(+ xlab("house size (sq ft)") + ylab("house price ($)")),
ggplot(houses, aes(x=elevation, y=price, col=size)) + geom_point()
(+ xlab("house elevation above reference (ft)") + ylab("house price ($)")),
ncol=2);
That’s hard to see; here’s the same thing on a log scale:
grid.arrange(
ggplot(houses, aes(x=size, y=price, col=elevation)) + geom_point()
(+ xlab("house size (sq ft)") + ylab("house price ($)")
+ scale_y_log10()),
ggplot(houses, aes(x=elevation, y=price, col=size)) + geom_point()
(+ xlab("house elevation above reference (ft)") + ylab("house price ($)")
+ scale_y_log10()),
ncol=2);
The simulation has a probably unrealistic range of house prices, but nothing is wildly unreasonable.