Peter Ralph
Advanced Biological Statistics
Abstract: There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.
What’s the difference between a “data model” and an “algorithmic model”?
At first glance Leo Breiman’s stimulating paper looks like an argument against parsimony and scientific insight, and in favor of black boxes with lots of knobs to twiddle. At second glance it still looks that way, but the paper is stimulating, and Leo has some important points to hammer home.
What is Efron’s criticism here?
“In the mid-1980s two powerful new algorithms for fitting data became available: neural nets and decision trees. A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”
“In the mid-1980s two powerful new algorithms for fitting data became available: neural nets and decision trees. A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”
Questions to consider:
Are we trying to learn about the underlying mechanism, or do we only care about good prediction?
Do we want to squeeze every possible drop of information out of the data, or do we have plenty of data and just need something that works?
Is our data well-described by a GLM, or not?
Exercise: make up some situations that have a wide variety of answers to these questions.
Where’s visualization fit into all this?
The most obvious way to see how well the model box emulates nature’s box is this: put a case \(x\) down nature’s box getting an output \(y\). Similarly, put the same case \(x\) down the model box getting an output \(y'\). The closeness of \(y\) and \(y'\) a measure of how good the emulation is.
Breiman contrasts crossvalidation with goodness-of-fit (e.g., residual analysis, or a posterior predictive check). What’s the difference?
Care, or at least think, about the data.
Look at the data.
Query the data.
Check the answers.
Communicate.
a numerical description of a dataset.
a numerical attribute of a model of reality.
Often, statistics are used to estimate parameters.
is uncertainty
thanks to randomness.
How do we understand randomness, concretely and quantitatively?
With models.
statistics are numerical summaries of data,
parameters are numerical attributes of a model.
confidence intervals
\(p\)-values
report effect sizes!
statistical significance does not imply real-world significance
Central Limit Theorems:
experiment versus observational study
controls, randomization, replicates
samples: from what population?
statistical power : \(\sigma/\sqrt{n}\)
confounding factors
correlation versus causation
readable
descriptive
documented
columns are variables, rows are observations
semantically coherent
makes precise visual analogies
with real units
labeled
maximize information per unit ink
\(z\)-scores and \(t\)-tests
ANOVA: ratios of mean-squares
Kaplan-Meier survival curves
Cox proportional hazard models
smoothing: loess
multiple comparisons: Bonferroni; FDR
the bootstrap - resampling
conditional probability
simulation
simulation
oh, and simulation
power analysis
permutation tests
goodness of fit
crossvalidation
imputation and interpolation
nonidentifiability
lm()
\[ y_i = \mu + \alpha_{g_i} + \beta x_j + \epsilon_{ijk} \]
linear: describes the +
s
R’s formulas are powerful (model.matrix( )
!!)
least-squares: implies Gaussian noise
model comparison: with ANOVA and the \(F\) test
Random effects / mixed models:
ALGAE ~ TREAT + (1|PATCH)
Care, or at least think, about the data.
Look at the data.
Query the data.
Check the answers.
Communicate.
What questions are you asking?
well, numbers aren’t racist?
but how we use them might be
More important than statistical technique:
What questions are being asked?
What data is being collected? (and how)
What assumptions are being made?
What are the important conclusions?
A \(p\)-value is:
the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.
A \(p\)-value is not:
an effect size.
kinda picky
can climb the posterior likelihood surface (optimizing( )
)
or, can skateboard around on it (sampling( )
)
needs checking up on
the basis for brms
Ingredients:
Examples:
brms
versus glm()
/glmer
glm(er)
:
brms
:
Time series and spatial statistics:
both have to deal with
autocorrelation.
Methods:
mechanistic models
multivariate Normal / Gaussian process
spline / smoothing / loess