\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Looking back

Peter Ralph

11 March 2021 – Advanced Biological Statistics

Looking back

a box of tools

Steps in data analysis

  1. Care, or at least think, about the data.

  2. Look at the data.

  3. Query the data.

  4. Check the answers.

  5. Communicate.

impostor syndrome

Statistics or parameters?

A statistic is

a numerical description of a dataset.

A parameter is

a numerical attribute of a model of reality.

Often, statistics are used to estimate parameters.

Lurking, behind everything:

is uncertainty

thanks to randomness.

How do we understand randomness, concretely and quantitatively?

With models.

Statistics

statistics are numerical summaries of data,

parameters are numerical attributes of a model.

  • confidence intervals

  • \(p\)-values

  • report effect sizes!

  • statistical significance does not imply real-world significance

  • Central Limit Theorems:

    • sums of many independent sources of noise gives a Gaussian
    • count of many rare, independent events is Poisson

Experimental design

  • experiment versus observational study

  • controls, randomization, replicates

  • samples: from what population?

  • statistical power : \(\sigma/\sqrt{n}\)

  • confounding factors

  • correlation versus causation

Tidy data

  • readable

  • descriptive

  • documented

  • columns are variables, rows are observations

  • semantically coherent

Visualization

  • makes precise visual analogies

  • with real units

  • labeled

  • maximize information per unit ink

Concepts and skills

  • \(z\)-scores and \(t\)-tests

  • ANOVA: ratios of mean-squares

  • Kaplan-Meier survival curves

  • Cox proportional hazard models

  • smoothing: loess

  • multiple comparisons: Bonferroni; FDR

  • the bootstrap - resampling

  • conditional probability

  • simulation

  • simulation

  • oh, and simulation

  • power analysis

  • permutation tests

  • goodness of fit

  • crossvalidation

  • imputation and interpolation

  • nonidentifiability

Randomness: deal with it

Distributions

  • Normal (a.k.a, Gaussian)
  • logNormal
  • scale mixtures of Normals
  • multivariate Normal
  • Student’s \(t\)
  • the \(F\) distribution
  • Beta
  • Binomial
  • Beta-Binomial
  • Exponential
  • Gamma
  • Cauchy
  • Poisson
  • Weibull
  • chi-squared
  • Dirichlet

https://en.wikipedia.org/wiki/Relationships_among_probability_distributions

Linear models and lm()

\[ y_i = \mu + \alpha_{g_i} + \beta x_j + \epsilon_{ijk} \]

  • linear: describes the +s

  • R’s formulas are powerful (model.matrix( )!!)

  • least-squares regression: implies Gaussian noise

  • model comparison: with ANOVA and the \(F\) test

Random effects / mixed models:

ALGAE ~ TREAT + (1|PATCH)

Stepping back

Steps in data analysis

  1. Care, or at least think, about the data.

  2. Look at the data.

  3. Query the data.

  4. Check the answers.

  5. Communicate.

What questions are you asking?

well, numbers aren’t racist?

but how we use them might be

More important than statistical technique:

  • What questions are being asked?

  • What data is being collected? (and how)

  • What assumptions are being made?

  • What are the important conclusions?

Models

A \(p\)-value is:

the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.

A \(p\)-value is not:

an effect size.

Bayesian what-not

  • bags of biased coins
  • probability, and Bayes’ rule
  • posterior = prior \(\times\) likelihood: \[ p(\theta | D) = \frac{ p(D | \theta) p(\theta) }{ p(D) } \]
  • “updating prior ideas based on (more) data”
  • credible intervals
  • hierarchical models
  • sharing power and shrinkage
  • posterior predictive sampling
  • overdispersion: make it random

MC Stan

  • kinda picky

  • can climb the posterior likelihood surface (optimizing( ))

  • or, can skateboard around on it (sampling( ))

  • needs checking up on

Stan

Examples:

  • AirBnB
  • pumpkins (ANOVA)
  • limpits (mixed models)
  • biased coins
  • baseball players
  • lung cancer survival
  • diabetes
  • Mauna Loa C02
  • ocean temperatures
  • hair and eye color
  • beer
  • wine
  • gene expression
  • biketown
cat
  • Austen versus Melville
  • blue tit nestlings
  • hurricane lizards

GLMs

Ingredients:

  • response distribution (“family”)
  • inverse link function
  • linear predictor

Examples:

  • Gaussian + identity
  • Binomial + logistic
  • Poisson + gamma
  • parametric survival analysis

Stan versus glm()/glmer

glm(er):

  • fast
  • easy
  • quick
  • not so picky about syntax
  • uses formulas

stan:

  • does not sweep convergence issues under the rug
  • doesn’t use formulas
  • more control, for:
  • overdispersion
  • hierarchical modeling

brms:

  • best of both worlds?

My recommendation: get familiar with

https://paul-buerkner.github.io/brms/reference/index.html

brms function reference

Things that aren’t obviously models that we did in Stan anyhow

  • robust regression: response is Cauchy
  • sparse regression: coefficients are Cauchy
  • dimension reduction: PCA, t-SNE
  • deconvolution: NMF
  • spatial smoothing: multivariate Gaussian

Space and time

Time series and spatial statistics:

both have to deal with

autocorrelation.

Methods:

  • mechanistic models

  • multivariate Normal / Gaussian process / Kriging

  • spline / smoothing / loess

Conclusion

Be confident!

cor

Thank you!!!

// reveal.js plugins