Looking back

Peter Ralph

11 March 2021 – Advanced Biological Statistics

Looking back

Steps in data analysis

Care, or at least think, about the data.
Look at the data.
Query the data.
Check the answers.
Communicate.

Statistics or parameters?

A statistic is: a numerical description of a dataset.

A parameter is: a numerical attribute of a model of reality.

Often, statistics are used to estimate parameters.

Lurking, behind everything:

is uncertainty

thanks to randomness.

How do we understand randomness, concretely and quantitatively?

With models.

Statistics

statistics are numerical summaries of data,

parameters are numerical attributes of a model.

confidence intervals
\(p\)-values
report effect sizes!
statistical significance does not imply real-world significance

Central Limit Theorems:
- sums of many independent sources of noise gives a Gaussian
- count of many rare, independent events is Poisson

Experimental design

experiment versus observational study
controls, randomization, replicates
samples: from what population?
statistical power : \(\sigma/\sqrt{n}\)
confounding factors
correlation versus causation

Tidy data

readable
descriptive
documented
columns are variables, rows are observations
semantically coherent

Visualization

makes precise visual analogies
with real units
labeled
maximize information per unit ink

Concepts and skills

\(z\)-scores and \(t\)-tests
ANOVA: ratios of mean-squares
Kaplan-Meier survival curves
Cox proportional hazard models
smoothing: loess
multiple comparisons: Bonferroni; FDR
the bootstrap - resampling
conditional probability

simulation
simulation
oh, and simulation
power analysis
permutation tests
goodness of fit
crossvalidation
imputation and interpolation
nonidentifiability

Randomness: deal with it

Distributions

Normal (a.k.a, Gaussian)
logNormal
scale mixtures of Normals
multivariate Normal
Student’s \(t\)
the \(F\) distribution
Beta
Binomial
Beta-Binomial

Exponential
Gamma
Cauchy
Poisson
Weibull
chi-squared
Dirichlet

https://en.wikipedia.org/wiki/Relationships_among_probability_distributions

Linear models and `lm()`

\[ y_i = \mu + \alpha_{g_i} + \beta x_j + \epsilon_{ijk} \]

linear: describes the +s
R’s formulas are powerful (model.matrix( )!!)

least-squares regression: implies Gaussian noise
model comparison: with ANOVA and the \(F\) test

Random effects / mixed models:

ALGAE ~ TREAT + (1|PATCH)

Stepping back

Steps in data analysis

Care, or at least think, about the data.
Look at the data.
Query the data.
Check the answers.
Communicate.

What questions are you asking?

well, numbers aren’t racist?

but how we use them might be

More important than statistical technique:

What questions are being asked?
What data is being collected? (and how)
What assumptions are being made?
What are the important conclusions?

Models

A \(p\)-value is:

the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.

A \(p\)-value is not:

an effect size.

Bayesian what-not

bags of biased coins
probability, and Bayes’ rule
posterior = prior \(\times\) likelihood: \[ p(\theta | D) = \frac{ p(D | \theta) p(\theta) }{ p(D) } \]
“updating prior ideas based on (more) data”
credible intervals
hierarchical models
sharing power and shrinkage
posterior predictive sampling
overdispersion: make it random

MC Stan

kinda picky
can climb the posterior likelihood surface (optimizing( ))
or, can skateboard around on it (sampling( ))
needs checking up on

Examples:

AirBnB
pumpkins (ANOVA)
limpits (mixed models)
biased coins
baseball players
lung cancer survival
diabetes
Mauna Loa C02
ocean temperatures
hair and eye color
beer
wine
gene expression
biketown

Austen versus Melville
blue tit nestlings
hurricane lizards

GLMs

Ingredients:

response distribution (“family”)
inverse link function
linear predictor

Examples:

Gaussian + identity
Binomial + logistic
Poisson + gamma

parametric survival analysis

Stan versus `glm()`/`glmer`

glm(er):

fast
easy
quick
not so picky about syntax
uses formulas

stan:

does not sweep convergence issues under the rug
doesn’t use formulas
more control, for:
overdispersion
hierarchical modeling

brms:

best of both worlds?

My recommendation: get familiar with

https://paul-buerkner.github.io/brms/reference/index.html

Things that aren’t obviously models that we did in Stan anyhow

robust regression: response is Cauchy
sparse regression: coefficients are Cauchy
dimension reduction: PCA, t-SNE
deconvolution: NMF
spatial smoothing: multivariate Gaussian

Space and time

Time series and spatial statistics:

both have to deal with

autocorrelation.

Methods:

mechanistic models
multivariate Normal / Gaussian process / Kriging
spline / smoothing / loess

Conclusion

Be confident!

Looking back

Looking back

Steps in data analysis

Statistics or parameters?

Lurking, behind everything:

Statistics

Experimental design

Tidy data

Visualization

Concepts and skills

Randomness: deal with it

Distributions

Linear models and lm()

Stepping back

Steps in data analysis

Models

Bayesian what-not

MC Stan

Examples:

GLMs

Stan versus glm()/glmer

Things that aren’t obviously models that we did in Stan anyhow

Space and time

Conclusion

Be confident!

Thank you!!!

Linear models and `lm()`

Stan versus `glm()`/`glmer`