\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Looking back

Peter Ralph

12 March 2020 – Advanced Biological Statistics

Looking back

a box of tools
a box of tools

Steps in data analysis

  1. Care, or at least think, about the data.

  2. Look at the data.

  3. Query the data.

  4. Sanity check.

  5. Communicate.

impostor syndrome
impostor syndrome

Statistics

statistics are numerical summaries of data,

parameters are numerical attributes of a model.

  • confidence intervals

  • \(p\)-values

  • report effect sizes!

  • statistical significance does not imply real-world significance
  • Central Limit Theorems:

    • sums of many independent sources of noise gives a Gaussian
    • sums of squared deviations is Chi-squared
    • count of many rare events is Poisson
cor
cor

Experimental design

  • experiment versus observational study

  • controls, randomization, replicates

  • samples: from what population?

  • statistical power : \(\sigma/\sqrt{n}\)

  • confounding factors

  • correlation versus causation

Tidy data

  • readable

  • descriptive

  • documented

  • columns are variables, rows are observations

  • semantically coherent

Visualization

  • makes precise visual analogies

  • with real units

  • labeled

  • maximize information per unit ink

Tests and skills

  • \(t\)-tests

  • ANOVA: ratios of mean-squares

  • Kaplan-Meier survival curves

  • Cox proportional hazard models

  • smoothing: loess

  • multiple comparisons: Bonferroni; FDR

  • simulation

  • simulation

  • oh, and simulation

  • permutation tests

  • goodness of fit

  • crossvalidation

  • interpolation

  • nonidentifiability

Randomness: deal with it

Distributions

  • Normal (a.k.a, Gaussian)
  • logNormal
  • scale mixtures of Normals
  • multivariate Normal
  • Student’s \(t\)
  • the \(F\) distribution
  • Beta
  • Binomial
  • Beta-Binomial
  • Exponential
  • Gamma
  • Cauchy
  • Poisson
  • Weibull
  • chi-squared
  • Dirichlet

In-class: brainstormed examples

Negative-binomial: Differential gene expression

Poisson: non-normalized transcript countsi

Cauchy: # of cases of Coronavirus in cities around the world

Normal (gaussian): height

Beta-binomial: Coin flips with random coin

Exponential: radioactive decay

LogNormal: milk production by cows

Beta: probability that a baseball player gets a hit

Gamma: Amount of snowfall accumulated

Dirichlet: probability distribution of n proportions sizes (which must all sum to 1). Such as if you have three subspecies making up a species population: this would be an appropriate prior for the proportion sizes

Poisson: Total counts of coffee’s sold at starbucks at any particular time 

Weibull: Survival analysis 

Multivariate Normal: probability of something happening in 3D space

Poisson--birth rates

Poisson - Counts of rare plant species (from presence/absence surveys)

Weibull: time until someone falls ill with coronavirus

Cauchy: Distance commuted on a daily basis

Multivariate normal: height, weight, and shoe size

Weibull: hazard rate in survival analysis

F distribution: the F statistic

Multivariate Normal: spatial correlation of bike trip
Poisson distribution: observing transcript counts, looking at the effects of age and exposure

Negative binomial: how many eggs a chicken lays until the 5th egg with two yolks

logNormal: rent prices across Manhattan

Poisson: number of bursts of EEG activity

Binomial: number of people on the bus who should be at home ‘cause they are sick

Poisson: number of cosmic particles going through a detector

Poisson: number of toilet paper sheets with errors in perforation

Beta binomial - number of polls (across different political analysis sources?) that say that Bernie will win 

https://en.wikipedia.org/wiki/Relationships_among_probability_distributions
https://en.wikipedia.org/wiki/Relationships_among_probability_distributions

Linear models and lm()

\[ X_{ijk} = \mu + \alpha_i + \beta_j + \epsilon_{ijk} \]

  • linear: describes the +s

  • R’s formulas are powerful (model.matrix( )!!)

  • least-squares regression: implies Gaussian noise

  • model comparison: with ANOVA and the \(F\) test

Random effects:

ALGAE ~ TREAT + (1|PATCH)

Models

Bayesian what-not

  • bags of biased coins
  • probability, and Bayes’ rule
  • posterior = prior \(\times\) likelihood: \[ p(\theta | D) = \frac{ p(D | \theta) p(\theta) }{ p(D) } \]
  • “updating prior ideas based on (more) data”
  • credible intervals
  • hierarchical models
  • sharing power and shrinkage
  • posterior predictive sampling
  • overdispersion: make it random

MC Stan

  • kinda picky

  • can climb the posterior likelihood surface (optimizing( ))

  • or, can skateboard around on it (sampling( ))

Stan
Stan

Examples:

  • pumpkins (ANOVA)
  • limpits (mixed models)
  • biased coins
  • baseball players
  • snail parasites
  • lung cancer survival
  • Mauna Loa C02
  • ocean temperatures
  • hair and eye color
  • gene ontonlogy
  • cream cheese
  • beer
  • wine
  • Austen versus Melville
  • gene expression
cat
cat

GLMs

\[\begin{aligned} &\text{(response distribution)} \\ &\qquad \sim \text{(inverse link function)} + \text{(linear predictor)} \end{aligned}\]

  • Gaussian + identity
  • Binomial + logistic
  • Poisson + gamma
  • parametric survival analysis

Stan versus glm()/glmer

glm(er):

  • fast
  • easy
  • quick
  • not so picky about syntax
  • uses formulas

stan:

  • does not sweep convergence issues under the rug
  • doesn’t use formulas
  • more control, for:
  • overdispersion
  • hierarchical modeling

brms:

  • best of both worlds?

Things that aren’t obviously models that we did in Stan anyhow

  • robust regression: response is Cauchy
  • sparse regression: coefficients are Cauchy
  • dimension reduction: PCA, t-SNE
  • deconvolution: NMF
  • spatial smoothing: multivariate Gaussian

Thank you!!!