\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

On ordination and dimension reduction methods

Peter Ralph

Advanced Biological Statistics

An ordination of dimension reduction techniques?

The menagerie

There are many dimension reduction methods, e.g.:

principal components analysis (PCA)
non-negative matrix factorization (NMF)
independent components analysis (ICA)
canonical correpondence analysis (CCA)
principal coordinates analysis (PCoA)
multidimensional scaling (MDS)
redundancy analysis (RDA)
Sammon mapping
kernel PCA
t-SNE
UMAP
locally linear embedding (LLE)
Laplacian eigenmaps
autoencoders

Using distances or similarities?

PCA uses the covariance matrix, which measures similarity.

t-SNE begins with the matrix of distances, measuring dissimilarity.

Metric or non-Metric?

Are distances interpretable?

metric: In PCA, each axis is a fixed linear combination of variables. So, distances always mean the same thing no matter where you are on the plot.

non-metric: In t-SNE, distances within different clusters are not comparable.

Why ordination?

From ordination.okstate.edu, about ordination in ecology:

Graphical results often lead to intuitive interpretations of species-environment relationships.
A single multivariate analysis saves time, in contrast to a separate univariate analysis for each species.
Ideally and typically, dimensions of this ‘low dimensional space’ will represent important and interpretable environmental gradients.
If statistical tests are desired, problems of multiple comparisons are diminished when species composition is studied in its entirety.
Statistical power is enhanced when species are considered in aggregate, because of redundancy.
By focusing on ‘important dimensions’, we avoid interpreting (and misinterpreting) noise.

Beware overinterpretation

Ordination methods are strongly influenced by sampling: ordination may ignore large-scale patterns in favor of describing variation within a highly oversampled area.
Ordination methods also describe patterns common to many variables: measuring the same thing many times may drive results.
Many methods are designed to find clusters, because our brain likes to categorize things. This doesn’t mean those clusters are well-separated in reality.

Some questions to ask

The goal is usually to produce a picture in which similar things are nearby each other, while also capturing global structure.

How is similarity measured in the original data?
How does the algorithm use that information?