Peter Ralph
11 March 2021 – Advanced Biological Statistics
Care, or at least think, about the data.
Look at the data.
Query the data.
Check the answers.
Communicate.
a numerical description of a dataset.
a numerical attribute of a model of reality.
Often, statistics are used to estimate parameters.
is uncertainty
thanks to randomness.
How do we understand randomness, concretely and quantitatively?
With models.
statistics are numerical summaries of data,
parameters are numerical attributes of a model.
confidence intervals
\(p\)-values
report effect sizes!
statistical significance does not imply real-world significance
Central Limit Theorems:
experiment versus observational study
controls, randomization, replicates
samples: from what population?
statistical power : \(\sigma/\sqrt{n}\)
confounding factors
correlation versus causation
readable
descriptive
documented
columns are variables, rows are observations
semantically coherent
makes precise visual analogies
with real units
labeled
maximize information per unit ink
\(z\)-scores and \(t\)-tests
ANOVA: ratios of mean-squares
Kaplan-Meier survival curves
Cox proportional hazard models
smoothing: loess
multiple comparisons: Bonferroni; FDR
the bootstrap - resampling
conditional probability
simulation
simulation
oh, and simulation
power analysis
permutation tests
goodness of fit
crossvalidation
imputation and interpolation
nonidentifiability
lm()
\[ y_i = \mu + \alpha_{g_i} + \beta x_j + \epsilon_{ijk} \]
linear: describes the +
s
R’s formulas are powerful (model.matrix( )
!!)
least-squares regression: implies Gaussian noise
model comparison: with ANOVA and the \(F\) test
Random effects / mixed models:
ALGAE ~ TREAT + (1|PATCH)
Care, or at least think, about the data.
Look at the data.
Query the data.
Check the answers.
Communicate.
What questions are you asking?
well, numbers aren’t racist?
but how we use them might be
More important than statistical technique:
What questions are being asked?
What data is being collected? (and how)
What assumptions are being made?
What are the important conclusions?
A \(p\)-value is:
the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.
A \(p\)-value is not:
an effect size.
kinda picky
can climb the posterior likelihood surface (optimizing( )
)
or, can skateboard around on it (sampling( )
)
needs checking up on
Ingredients:
Examples:
glm()
/glmer
glm(er)
:
stan
:
brms
:
My recommendation: get familiar with
Time series and spatial statistics:
both have to deal with
autocorrelation.
Methods:
mechanistic models
multivariate Normal / Gaussian process / Kriging
spline / smoothing / loess