Looking back

Peter Ralph

Advanced Biological Statistics

Statistical modeling

Abstract: There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.

What’s the difference between a “data model” and an “algorithmic model”?

Efron’s summary:

At ﬁrst glance Leo Breiman’s stimulating paper looks like an argument against parsimony and scientiﬁc insight, and in favor of black boxes with lots of knobs to twiddle. At second glance it still looks that way, but the paper is stimulating, and Leo has some important points to hammer home.

What is Efron’s criticism here?

Some history

“In the mid-1980s two powerful new algorithms for ﬁtting data became available: neural nets and decision trees. A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in ﬁnancial markets.”

Some history

“a few aging statisticians” (Grace Wahba in 1986)

Questions to consider:

Are we trying to learn about the underlying mechanism, or do we only care about good prediction?
Do we want to squeeze every possible drop of information out of the data, or do we have plenty of data and just need something that works?
Is our data well-described by a GLM, or not?

Exercise: make up some situations that have a wide variety of answers to these questions.

Where’s visualization fit into all this?

Prediction

The most obvious way to see how well the model box emulates nature’s box is this: put a case \(x\) down nature’s box getting an output \(y\). Similarly, put the same case \(x\) down the model box getting an output \(y'\). The closeness of \(y\) and \(y'\) a measure of how good the emulation is.

Breiman contrasts crossvalidation with goodness-of-fit (e.g., residual analysis, or a posterior predictive check). What’s the difference?

The conversation continues

Looking back

Steps in data analysis

Care, or at least think, about the data.
Look at the data.
Query the data.
Check the answers.
Communicate.

Statistics or parameters?

A statistic is: a numerical description of a dataset.

A parameter is: a numerical attribute of a model of reality.

Often, statistics are used to estimate parameters.

Lurking, behind everything:

is uncertainty

thanks to randomness.

How do we understand randomness, concretely and quantitatively?

With models.

Statistics

statistics are numerical summaries of data,

parameters are numerical attributes of a model.

confidence intervals
\(p\)-values
report effect sizes!
statistical significance does not imply real-world significance

Central Limit Theorems:
- sums of many independent sources of noise gives a Gaussian
- count of many rare, independent events is Poisson

Experimental design

experiment versus observational study
controls, randomization, replicates
samples: from what population?
statistical power : \(\sigma/\sqrt{n}\)
confounding factors
correlation versus causation

Tidy data

readable
descriptive
documented
columns are variables, rows are observations
semantically coherent

Visualization

makes precise visual analogies
with real units
labeled
maximize information per unit ink

Concepts and skills

\(z\)-scores and \(t\)-tests
ANOVA: ratios of mean-squares
Kaplan-Meier survival curves
Cox proportional hazard models
smoothing: loess
multiple comparisons: Bonferroni; FDR
the bootstrap - resampling
conditional probability

simulation
simulation
oh, and simulation
power analysis
permutation tests
goodness of fit
crossvalidation
imputation and interpolation
nonidentifiability

Randomness: deal with it

Distributions

Normal (a.k.a, Gaussian)
logNormal
scale mixtures of Normals
multivariate Normal
Student’s \(t\)
the \(F\) distribution
Beta
Binomial
Beta-Binomial

Exponential
Gamma
Cauchy
Poisson
Weibull
chi-squared
Dirichlet

https://en.wikipedia.org/wiki/Relationships_among_probability_distributions

Linear models and `lm()`

\[ y_i = \mu + \alpha_{g_i} + \beta x_j + \epsilon_{ijk} \]

linear: describes the +s
R’s formulas are powerful (model.matrix( )!!)

least-squares: implies Gaussian noise
model comparison: with ANOVA and the \(F\) test

Random effects / mixed models:

ALGAE ~ TREAT + (1|PATCH)

Stepping back

Steps in data analysis

Care, or at least think, about the data.
Look at the data.
Query the data.
Check the answers.
Communicate.

What questions are you asking?

well, numbers aren’t racist?

but how we use them might be

More important than statistical technique:

What questions are being asked?
What data is being collected? (and how)
What assumptions are being made?
What are the important conclusions?

Models

A \(p\)-value is:

the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.

A \(p\)-value is not:

an effect size.

Bayesian what-not

bags of biased coins
probability, and Bayes’ rule
posterior = prior \(\times\) likelihood: \[ p(\theta | D) = \frac{ p(D | \theta) p(\theta) }{ p(D) } \]
“updating prior ideas based on (more) data”
credible intervals
hierarchical models
sharing power and shrinkage
posterior predictive sampling
overdispersion: make it random

MC Stan

kinda picky
can climb the posterior likelihood surface (optimizing( ))
or, can skateboard around on it (sampling( ))
needs checking up on
the basis for brms

Examples:

AirBnB
pumpkins (ANOVA)
limpets (mixed models)
biased coins
baseball players
lung cancer survival
diabetes
hair and eye color
beer
wine
gene expression
biketown

Austen versus Melville
blue tit nestlings
hurricane lizards
bikeshare

GLMs

Ingredients:

response distribution (“family”)
inverse link function
linear predictor

Examples:

Gaussian + identity
Binomial + logistic
Poisson + gamma

parametric survival analysis

`brms` versus `glm()`/`glmer`

glm(er):

fast
easy
quick

brms:

easy
does not sweep convergence issues under the rug
more control, for:
overdispersion
hierarchical modeling

Space and time

Time series and spatial statistics:

both have to deal with

autocorrelation.

Methods:

mechanistic models
multivariate Normal / Gaussian process
spline / smoothing / loess

Conclusion

Be confident!

Looking back

Statistical modeling

Efron’s summary:

Some history

Some history

Prediction

The conversation continues

See also: Algorithmic bias

Looking back

Steps in data analysis

Statistics or parameters?

Lurking, behind everything:

Statistics

Experimental design

Tidy data

Visualization

Concepts and skills

Randomness: deal with it

Distributions

Linear models and `lm()`

Stepping back

Steps in data analysis

Models

Bayesian what-not

MC Stan

Examples:

GLMs

`brms` versus `glm()`/`glmer`

Space and time

Conclusion

Be confident!

Thank you!!!

Looking back

Statistical modeling

Efron’s summary:

Some history

Some history

Prediction

The conversation continues

See also: Algorithmic bias

Looking back

Steps in data analysis

Statistics or parameters?

Lurking, behind everything:

Statistics

Experimental design

Tidy data

Visualization

Concepts and skills

Randomness: deal with it

Distributions

Linear models and lm()

Stepping back

Steps in data analysis

Models

Bayesian what-not

MC Stan

Examples:

GLMs

brms versus glm()/glmer

Space and time

Conclusion

Be confident!

Thank you!!!

Linear models and `lm()`

`brms` versus `glm()`/`glmer`