Looking back

Peter Ralph

12 March 2020 – Advanced Biological Statistics

Looking back

Steps in data analysis

Care, or at least think, about the data.
Look at the data.
Query the data.
Sanity check.
Communicate.

Statistics

statistics are numerical summaries of data,

parameters are numerical attributes of a model.

confidence intervals
\(p\)-values
report effect sizes!
statistical significance does not imply real-world significance

Central Limit Theorems:
- sums of many independent sources of noise gives a Gaussian
- sums of squared deviations is Chi-squared
- count of many rare events is Poisson

Experimental design

experiment versus observational study
controls, randomization, replicates
samples: from what population?
statistical power : \(\sigma/\sqrt{n}\)
confounding factors
correlation versus causation

Tidy data

readable
descriptive
documented
columns are variables, rows are observations
semantically coherent

Visualization

makes precise visual analogies
with real units
labeled
maximize information per unit ink

Tests and skills

\(t\)-tests
ANOVA: ratios of mean-squares
Kaplan-Meier survival curves
Cox proportional hazard models
smoothing: loess
multiple comparisons: Bonferroni; FDR

simulation
simulation
oh, and simulation
permutation tests
goodness of fit
crossvalidation
interpolation
nonidentifiability

Randomness: deal with it

Distributions

Normal (a.k.a, Gaussian)
logNormal
scale mixtures of Normals
multivariate Normal
Student’s \(t\)
the \(F\) distribution
Beta
Binomial
Beta-Binomial

Exponential
Gamma
Cauchy
Poisson
Weibull
chi-squared
Dirichlet

In-class: brainstormed examples

Negative-binomial: Differential gene expression

Poisson: non-normalized transcript countsi

Cauchy: # of cases of Coronavirus in cities around the world

Normal (gaussian): height

Beta-binomial: Coin flips with random coin

Exponential: radioactive decay

LogNormal: milk production by cows

Beta: probability that a baseball player gets a hit

Gamma: Amount of snowfall accumulated

Dirichlet: probability distribution of n proportions sizes (which must all sum to 1). Such as if you have three subspecies making up a species population: this would be an appropriate prior for the proportion sizes

Poisson: Total counts of coffee’s sold at starbucks at any particular time 

Weibull: Survival analysis 

Multivariate Normal: probability of something happening in 3D space

Poisson--birth rates

Poisson - Counts of rare plant species (from presence/absence surveys)

Weibull: time until someone falls ill with coronavirus

Cauchy: Distance commuted on a daily basis

Multivariate normal: height, weight, and shoe size

Weibull: hazard rate in survival analysis

F distribution: the F statistic

Multivariate Normal: spatial correlation of bike trip
Poisson distribution: observing transcript counts, looking at the effects of age and exposure

Negative binomial: how many eggs a chicken lays until the 5th egg with two yolks

logNormal: rent prices across Manhattan

Poisson: number of bursts of EEG activity

Binomial: number of people on the bus who should be at home ‘cause they are sick

Poisson: number of cosmic particles going through a detector

Poisson: number of toilet paper sheets with errors in perforation

Beta binomial - number of polls (across different political analysis sources?) that say that Bernie will win

https://en.wikipedia.org/wiki/Relationships_among_probability_distributions

Linear models and `lm()`

\[ X_{ijk} = \mu + \alpha_i + \beta_j + \epsilon_{ijk} \]

linear: describes the +s
R’s formulas are powerful (model.matrix( )!!)

least-squares regression: implies Gaussian noise
model comparison: with ANOVA and the \(F\) test

Random effects:

ALGAE ~ TREAT + (1|PATCH)

Models

Bayesian what-not

bags of biased coins
probability, and Bayes’ rule
posterior = prior \(\times\) likelihood: \[ p(\theta | D) = \frac{ p(D | \theta) p(\theta) }{ p(D) } \]
“updating prior ideas based on (more) data”
credible intervals
hierarchical models
sharing power and shrinkage
posterior predictive sampling
overdispersion: make it random

MC Stan

kinda picky
can climb the posterior likelihood surface (optimizing( ))
or, can skateboard around on it (sampling( ))

Examples:

pumpkins (ANOVA)
limpits (mixed models)
biased coins
baseball players
snail parasites
lung cancer survival
Mauna Loa C02
ocean temperatures
hair and eye color
gene ontonlogy
cream cheese
beer
wine
Austen versus Melville
gene expression

GLMs

\[\begin{aligned} &\text{(response distribution)} \\ &\qquad \sim \text{(inverse link function)} + \text{(linear predictor)} \end{aligned}\]

Gaussian + identity
Binomial + logistic
Poisson + gamma

parametric survival analysis

Stan versus `glm()`/`glmer`

glm(er):

fast
easy
quick
not so picky about syntax
uses formulas

stan:

does not sweep convergence issues under the rug
doesn’t use formulas
more control, for:
overdispersion
hierarchical modeling

brms:

best of both worlds?

Things that aren’t obviously models that we did in Stan anyhow

robust regression: response is Cauchy
sparse regression: coefficients are Cauchy
dimension reduction: PCA, t-SNE
deconvolution: NMF
spatial smoothing: multivariate Gaussian

Looking back

Looking back

Steps in data analysis

Statistics

Experimental design

Tidy data

Visualization

Tests and skills

Randomness: deal with it

Distributions

In-class: brainstormed examples

Linear models and lm()

Models

Bayesian what-not

MC Stan

Examples:

GLMs

Stan versus glm()/glmer

Things that aren’t obviously models that we did in Stan anyhow

Thank you!!!

Linear models and `lm()`

Stan versus `glm()`/`glmer`