Peter Ralph
Advanced Biological Statistics
Today: link
Thursday: discussion of “the two cultures” and review.
In an experiment, we usually go to great lengths to measure the effect of just one thing on the respose.
Out in the wild, we can’t always do that.
Easy: predict house prices in South Eugene using square footage, elevation, lot size, distance to transit, latitude, longitude.
Hard: predict building prices in Oregon using square footage, elevation, lot size, distance to transit, latitude, longitude.
Why is it hard? The effect of everything depends on everything else, and there’s lots of nonlinearities.
Think of an example where the effect of one variable on the response changes direction depending on the value of another variable. Sketch the relationship.
Example: temperature with cloud cover and day of the year: it’s colder when it’s cloudy in the summer, but warmer when it’s cloudy in the winter.
How do we predict in highly nonlinear situations with a great many explanatory variables?
Split the data in two, optimally, according to the value of some variable.
Predict the value on each half.
If good enough, stop. Otherwise, repeat on each subset.
Benefits:
easy to explain
easy to visualize
Problems:
kinda arbitrary?
prone to overfitting
high variance
Potential solution: “bagging”.
Take lots of bootstrap subsamples from the data.
Build a decision tree on each one.
Average their predictions.
This works pretty well.
Problem: with strong predictors, most trees look the same.
Solution: don’t use all the data.
Take lots of bootstrap subsamples from the data.
Build a decision tree on each, at each split of each tree using only a random subset of the variables.
Average their predictions.
## season mnth day hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered bikers
## 1 1 Jan 1 0 0 6 0 clear 0.24 0.2879 0.81 0.0000 3 13 16
## 2 1 Jan 1 1 0 6 0 clear 0.22 0.2727 0.80 0.0000 8 32 40
## 3 1 Jan 1 2 0 6 0 clear 0.22 0.2727 0.80 0.0000 5 27 32
## 4 1 Jan 1 3 0 6 0 clear 0.24 0.2879 0.75 0.0000 3 10 13
## 5 1 Jan 1 4 0 6 0 clear 0.24 0.2879 0.75 0.0000 0 1 1
## 6 1 Jan 1 5 0 6 0 cloudy/misty 0.24 0.2576 0.75 0.0896 0 1 1
> ?Bikeshare
Bikeshare package:ISLR2 R Documentation
Bike sharing data
Description:
This data set contains the hourly and daily count of rental bikes
between years 2011 and 2012 in Capital bikeshare system, along
with weather and seasonal information.
Usage:
Bikeshare
Format:
A data frame with 8645 observations on a number of variables.
‘season’ Season of the year, coded as Winter=1, Spring=2,
Summer=3, Fall=4.
‘mnth’ Month of the year, coded as a factor.
‘day’ Day of the year, from 1 to 365
‘hr’ Hour of the day, coded as a factor from 0 to 23.
‘holiday’ Is it a holiday? Yes=1, No=0.
‘weekday’ Day of the week, coded from 0 to 6, where Sunday=0,
Monday=1, Tuesday=2, etc.
‘workingday’ Is it a work day? Yes=1, No=0.
‘weathersit’ Weather, coded as a factor.
‘temp’ Normalized temperature in Celsius. The values are derived
via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39.
‘atemp’ Normalized feeling temperature in Celsius. The values are
derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50.
‘hum’ Normalized humidity. The values are divided to 100 (max).
‘windspeed’ Normalized wind speed. The values are divided by 67
(max).
‘casual’ Number of casual bikers.
‘registered’ Number of registered bikers.
‘bikers’ Total number of bikers.
Source:
The UCI Machine Learning Repository <URL:
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset>
References:
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021) _An
Introduction to Statistical Learning with applications in R,
Second Edition_, <URL: https://www.statlearning.com>,
Springer-Verlag, New York
Examples:
lm(bikers~hr, data=Bikeshare)
ISLR2
and randomForest
packages.randomForest( )
to predict bikers
using the other variables except casual
and registered
, and keeping 20% of the data aside for testing.lm( )
.glm( )
do better?# TODO: subsample entire days? weeks?
train <-sample( rep(c(TRUE, FALSE), c(0.8, 0.2) * nrow(Bikeshare)) )
rf <- randomForest(bikers ~ . - casual - registered,
data=Bikeshare,
subset=train)
bike_lm <- lm(bikers ~ . - casual - registered,
data=Bikeshare,
subset=train)
ut <- (Bikeshare$weathersit != "heavy rain/snow")
pred_rf <- predict(rf, newdata=Bikeshare[ut,])
pred_lm <- predict(bike_lm, newdata=Bikeshare[ut,])
Let’s compare bicycling times by month:
And, between workdays and non-workdays:
Finally, compare to a quasiPoisson GLM:
An Introduction to Statistical Learning, by James, Witten, Hastie, and Tibshirani
The two cultures (with comments), by Leo Breiman