\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Looking back

Peter Ralph

Advanced Biological Statistics

Statistical modeling

statistical modeling: the two cultures

Abstract: There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.

What’s the difference between a “data model” and an “algorithmic model”?

Efron’s summary:

At first glance Leo Breiman’s stimulating paper looks like an argument against parsimony and scientific insight, and in favor of black boxes with lots of knobs to twiddle. At second glance it still looks that way, but the paper is stimulating, and Leo has some important points to hammer home.

What is Efron’s criticism here?

Some history

“In the mid-1980s two powerful new algorithms for fitting data became available: neural nets and decision trees. A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”

Some history

“In the mid-1980s two powerful new algorithms for fitting data became available: neural nets and decision trees. A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”

Grace Wahba in 1986

“a few aging statisticians” (Grace Wahba in 1986)

Questions to consider:

  1. Are we trying to learn about the underlying mechanism, or do we only care about good prediction?

  2. Do we want to squeeze every possible drop of information out of the data, or do we have plenty of data and just need something that works?

  3. Is our data well-described by a GLM, or not?

Exercise: make up some situations that have a wide variety of answers to these questions.

Where’s visualization fit into all this?

Prediction

The most obvious way to see how well the model box emulates nature’s box is this: put a case \(x\) down nature’s box getting an output \(y\). Similarly, put the same case \(x\) down the model box getting an output \(y'\). The closeness of \(y\) and \(y'\) a measure of how good the emulation is.

Breiman contrasts crossvalidation with goodness-of-fit (e.g., residual analysis, or a posterior predictive check). What’s the difference?

The conversation continues

interpretability in machine learning

See also: Algorithmic bias

fairness in machine learning

Looking back

a box of tools

Steps in data analysis

  1. Care, or at least think, about the data.

  2. Look at the data.

  3. Query the data.

  4. Check the answers.

  5. Communicate.

impostor syndrome

Statistics or parameters?

A statistic is

a numerical description of a dataset.

A parameter is

a numerical attribute of a model of reality.

Often, statistics are used to estimate parameters.

Lurking, behind everything:

is uncertainty

thanks to randomness.

How do we understand randomness, concretely and quantitatively?

With models.

Statistics

statistics are numerical summaries of data,

parameters are numerical attributes of a model.

  • confidence intervals

  • \(p\)-values

  • report effect sizes!

  • statistical significance does not imply real-world significance

  • Central Limit Theorems:

    • sums of many independent sources of noise gives a Gaussian
    • count of many rare, independent events is Poisson

Experimental design

  • experiment versus observational study

  • controls, randomization, replicates

  • samples: from what population?

  • statistical power : \(\sigma/\sqrt{n}\)

  • confounding factors

  • correlation versus causation

Tidy data

  • readable

  • descriptive

  • documented

  • columns are variables, rows are observations

  • semantically coherent

Visualization

  • makes precise visual analogies

  • with real units

  • labeled

  • maximize information per unit ink

Concepts and skills

  • \(z\)-scores and \(t\)-tests

  • ANOVA: ratios of mean-squares

  • Kaplan-Meier survival curves

  • Cox proportional hazard models

  • smoothing: loess

  • multiple comparisons: Bonferroni; FDR

  • the bootstrap - resampling

  • conditional probability

  • simulation

  • simulation

  • oh, and simulation

  • power analysis

  • permutation tests

  • goodness of fit

  • crossvalidation

  • imputation and interpolation

  • nonidentifiability

Randomness: deal with it

Distributions

  • Normal (a.k.a, Gaussian)
  • logNormal
  • scale mixtures of Normals
  • multivariate Normal
  • Student’s \(t\)
  • the \(F\) distribution
  • Beta
  • Binomial
  • Beta-Binomial
  • Exponential
  • Gamma
  • Cauchy
  • Poisson
  • Weibull
  • chi-squared
  • Dirichlet

https://en.wikipedia.org/wiki/Relationships_among_probability_distributions

Linear models and lm()

\[ y_i = \mu + \alpha_{g_i} + \beta x_j + \epsilon_{ijk} \]

  • linear: describes the +s

  • R’s formulas are powerful (model.matrix( )!!)

  • least-squares: implies Gaussian noise

  • model comparison: with ANOVA and the \(F\) test

Random effects / mixed models:

ALGAE ~ TREAT + (1|PATCH)

Stepping back

Steps in data analysis

  1. Care, or at least think, about the data.

  2. Look at the data.

  3. Query the data.

  4. Check the answers.

  5. Communicate.

What questions are you asking?

well, numbers aren’t racist?

but how we use them might be

More important than statistical technique:

  • What questions are being asked?

  • What data is being collected? (and how)

  • What assumptions are being made?

  • What are the important conclusions?

Models

A \(p\)-value is:

the probability of seeing a result at least as surprising as what was observed in the data, if the null hypothesis is true.

A \(p\)-value is not:

an effect size.

Bayesian what-not

  • bags of biased coins
  • probability, and Bayes’ rule
  • posterior = prior \(\times\) likelihood: \[ p(\theta | D) = \frac{ p(D | \theta) p(\theta) }{ p(D) } \]
  • “updating prior ideas based on (more) data”
  • credible intervals
  • hierarchical models
  • sharing power and shrinkage
  • posterior predictive sampling
  • overdispersion: make it random

MC Stan

  • kinda picky

  • can climb the posterior likelihood surface (optimizing( ))

  • or, can skateboard around on it (sampling( ))

  • needs checking up on

  • the basis for brms

Stan

Examples:

  • AirBnB
  • pumpkins (ANOVA)
  • limpets (mixed models)
  • biased coins
  • baseball players
  • lung cancer survival
  • diabetes
  • hair and eye color
  • beer
  • wine
  • gene expression
  • biketown
cat
  • Austen versus Melville
  • blue tit nestlings
  • hurricane lizards
  • bikeshare

GLMs

Ingredients:

  • response distribution (“family”)
  • inverse link function
  • linear predictor

Examples:

  • Gaussian + identity
  • Binomial + logistic
  • Poisson + gamma
  • parametric survival analysis

brms versus glm()/glmer

glm(er):

  • fast
  • easy
  • quick

brms:

  • easy
  • does not sweep convergence issues under the rug
  • more control, for:
  • overdispersion
  • hierarchical modeling

Space and time

Time series and spatial statistics:

both have to deal with

autocorrelation.

Methods:

  • mechanistic models

  • multivariate Normal / Gaussian process

  • spline / smoothing / loess

Conclusion

Be confident!

cor

Thank you!!!

// reveal.js plugins