On ordination and dimension reduction methods

Peter Ralph

23 February 2020 – Advanced Biological Statistics

An ordination of dimension reduction techniques?

The menagerie

There are many dimension reduction methods, e.g.:

principal components analysis (PCA)
non-negative matrix factorization (NMF)
independent components analysis (ICA)
canonical correpondence analysis (CCA)
principal coordinates analysis (PCoA)
multidimensional scaling (MDS)
redundancy analysis (RDA)
Sammon mapping
kernel PCA
t-SNE
UMAP
locally linear embedding (LLE)
Laplacian eigenmaps
autoencoders

Using distances or similarities?

PCA uses the covariance matrix, which measures similarity.

t-SNE begins with the matrix of distances, measuring dissimilarity.

Metric or non-Metric?

Are distances interpretable?

metric: In PCA, each axis is a fixed linear combination of variables. So, distances always mean the same thing no matter where you are on the plot.

non-metric: In t-SNE, distances within different clusters are not comparable.

Why ordination?

From ordination.okstate.edu, about ordination in ecology:

Graphical results often lead to intuitive interpretations of species-environment relationships.
A single multivariate analysis saves time, in contrast to a separate univariate analysis for each species.
Ideally and typically, dimensions of this ‘low dimensional space’ will represent important and interpretable environmental gradients.
If statistical tests are desired, problems of multiple comparisons are diminished when species composition is studied in its entirety.
Statistical power is enhanced when species are considered in aggregate, because of redundancy.
By focusing on ‘important dimensions’, we avoid interpreting (and misinterpreting) noise.

Beware overinterpretation

Ordination methods are strongly influenced by sampling: ordination may ignore large-scale patterns in favor of describing variation within a highly oversampled area.
Ordination methods also describe patterns common to many variables: measuring the same thing many times may drive results.
Many methods are designed to find clusters, because our brain likes to categorize things. This doesn’t mean those clusters are well-separated in reality.

Some questions to ask

The goal is usually to produce a picture in which similar things are nearby each other, while also capturing global structure.

How is similarity measured in the original data?
How does the algorithm use that information?

Text analysis

Identifying authors

In data/passages.txt we have a number of short passages from a few different books.

Can we identify the authors of each passage?

The true sources of the passages are in data/passage_sources.tsv.

Turn the data into a matrix

passages <- readLines("data/passages.txt")
sources <- read.table("data/passage_sources.tsv", header=TRUE, stringsAsFactors=TRUE)
words <- sort(unique(strsplit(paste(passages, collapse=" "), " +")[[1]]))
tabwords <- function (x, w) { tabulate(match(strsplit(x, " ")[[1]], w), nbins=length(w)) }
wordmat <- sapply(passages, tabwords, words)
dimnames(wordmat) <- list(words, NULL)
stopifnot( min(rowSums(wordmat)) > 0 )
wordmat[1:20, 1:20]

##              [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## a              15    9   12    8    7    7    5   12    8     3    18    12    13    11     2     2    12    14    11    14
## aback           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abaft           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abandon         0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abandoned       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abandonment     0    0    0    0    0    1    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abased          0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abasement       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abashed         0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abate           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abated          0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abatement       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abating         0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abbeyland       0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abbreviate      0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abbreviation    0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abeam           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abednego        0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abhor           0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0
## abhorred        0    0    0    0    0    0    0    0    0     0     0     0     0     0     0     0     0     0     0     0

PCA?

wordpcs <- prcomp(wordmat, scale.=TRUE)
layout(t(1:2))
plot(wordpcs$rotation[,1:2], col=sources$source, pch=20, xlab="PC1", ylab="PC2")
plot(wordpcs$rotation[,2:3], col=sources$source, pch=20, xlab="PC2", ylab="PC3")
legend("topright", pch=20, col=1:3, legend=levels(sources$source))

PC1 is shortness

plot(colSums(wordmat), wordpcs$rotation[,1], col=sources$source, xlab='length', ylab='PC1')

plot of chunk r wordlen

PC2 is book

##                   PC1        PC2         PC3
## her        -594.34798 -369.50681 -320.553975
## to        -1240.72704 -323.87211  -13.293113
## i          -491.93174 -267.06979  369.332262
## she        -386.87463 -258.52708 -165.767955
## you        -282.97126 -156.95806  229.581460
## not        -363.11421 -136.89716   43.290928
## be         -347.78507 -136.58763   52.664086
## was        -522.83075 -136.22638 -141.902635
## had        -304.53160  -98.53497  -76.705108
## my         -177.13942  -93.01227  134.849801
## have       -225.44005  -81.33521   81.851304
## it         -484.51769  -74.21097   93.288019
## could      -142.25006  -73.11607  -40.469689
## for        -371.58339  -59.91131   29.838338
## your        -93.51307  -57.03041   80.831901
## very       -123.01954  -54.21827   13.100543
## elinor      -70.48250  -52.55216  -16.329420
## he         -375.61713  -49.66022   15.594216
## would      -140.85426  -49.03480    9.479050
## me         -114.77231  -48.79689   95.229970
## as         -403.15746  -48.75619   18.801570
## been       -136.39247  -48.21771    6.551887
## said        -92.80372  -46.61745   33.125661
## am          -55.87480  -43.56158   42.113707
## elizabeth   -61.38129  -40.80064  -10.895442
## him        -229.87966  -40.67213   34.093291
## no         -139.07409  -39.68028   15.575598
## herself     -59.37697  -38.31004  -38.679588
## do          -77.68537  -36.83336   41.386279
## know        -53.51507  -36.41102   27.944743
## marianne    -50.96373  -36.20272  -23.321135
## what       -109.76315  -35.84450   29.308141
## sister      -44.21808  -34.84423   -5.767636
## such       -106.38272  -33.68966   11.373512
## miss        -56.38328  -33.60602   -6.279509
## think       -51.63823  -32.33123   25.786284
## will        -94.31849  -32.09140   52.125262
## much        -84.29715  -31.23711   -3.004868
## which      -182.14128  -30.76793  -23.845607
## any        -100.25109  -30.32870    8.360409
## must        -76.92666  -30.27011   15.851418
## mrs         -45.17642  -29.56358  -21.998778
## than        -99.42283  -29.37513   -4.707476
## should      -65.32619  -29.16489   17.106205
## them       -139.34498  -27.98789  -15.384618
## on         -231.34602  -27.40464  -10.929738
## and       -1226.79489  -27.01957  -63.646246
## mother      -38.18706  -26.96834  -15.208029
## mr          -47.16713  -25.67153  -12.360595
## did         -69.05182  -25.26154   10.888321

##                  PC1       PC2         PC3
## the     -1938.848597 679.25472 -15.8599522
## whale     -52.409687  72.64643  18.4030059
## in       -729.432446  55.44280   3.8135028
## this     -177.745979  44.18572  36.1075787
## a        -742.159320  39.07534  38.5873415
## like      -54.401485  36.88865  12.0032804
## upon      -59.279446  36.62920  19.9052683
## ahab      -24.108156  34.27340   5.4928174
## sea       -22.390040  32.28631   4.6829825
## ship      -21.772015  29.76893   4.7971335
## his      -446.007445  29.52948  -6.7113140
## old       -31.478667  28.92232  12.1107257
## whales    -19.959899  28.44594   7.0712904
## one      -124.792228  28.22723  25.7945336
## into      -67.132210  26.19590   4.9371075
## now       -91.983077  22.97597  11.0948656
## these     -37.944027  21.97420   5.9644468
## of      -1287.261454  21.73291 -61.9511010
## boat      -12.757167  20.89801   0.7750276
## white     -14.249543  20.34029   3.8079378
## all      -236.207991  20.29192   8.1020127
## ye        -16.517589  20.13232  10.4005112
## then      -71.850275  19.83070  18.8825398
## out       -60.387437  19.14103   5.6808385
## up        -52.052288  17.78528   3.8663572
## sperm     -10.859947  16.81080   4.2079551
## its       -55.040431  16.69443  -0.4461618
## down      -37.213502  16.20793   6.2333193
## through   -21.993116  16.20333   0.4409778
## round     -17.918369  16.16908   1.0900027
## is       -244.347214  16.12722 117.9122463
## captain   -12.446603  16.05703   3.8943171
## from     -190.342540  15.91746 -14.5418029
## yet       -37.055639  15.68260   6.3217292
## those     -30.706626  15.62040   4.7922181
## still     -43.674287  15.41750  -5.2932666
## some      -95.410309  15.23838  -1.0026757
## men       -18.224598  15.10218   3.7872895
## deck       -9.081950  14.37372   1.9517430
## boats      -8.020552  14.26438   0.7798884
## head      -20.834646  14.04222   3.2905213
## crew       -7.655209  13.45631   0.5844938
## air       -15.710056  13.31316   0.2249265
## thou      -10.013584  13.15565   7.4352051
## water      -8.452857  13.15480   3.2918904
## ships      -8.079648  12.97220   2.1174358
## seemed    -37.338906  12.90973  -4.8322828
## feet       -8.144120  12.56765   3.2873123
## stubb      -8.909317  12.50658   3.4643008
## over      -48.103826  11.69170   1.8487713

PC3 ???

##                    PC1         PC2         PC3
## her        -594.347982 -369.506815 -320.553975
## she        -386.874631 -258.527076 -165.767955
## was        -522.830748 -136.226376 -141.902635
## had        -304.531600  -98.534971  -76.705108
## and       -1226.794885  -27.019568  -63.646246
## of        -1287.261454   21.732911  -61.951101
## their      -152.207591   -3.769349  -45.613796
## could      -142.250056  -73.116065  -40.469689
## herself     -59.376969  -38.310040  -38.679588
## were       -158.026382   -5.357269  -30.453443
## by         -239.810894   -5.266732  -24.018741
## which      -182.141283  -30.767930  -23.845607
## marianne    -50.963734  -36.202720  -23.321135
## mrs         -45.176423  -29.563584  -21.998778
## every       -82.911029  -20.341334  -20.004948
## they       -166.865546  -18.890090  -16.964416
## elinor      -70.482501  -52.552157  -16.329420
## the       -1938.848597  679.254717  -15.859952
## them       -139.344979  -27.987894  -15.384618
## mother      -38.187061  -26.968343  -15.208029
## from       -190.342540   15.917459  -14.541803
## though      -73.286464   -5.727032  -14.461336
## to        -1240.727040 -323.872106  -13.293113
## mr          -47.167135  -25.671530  -12.360595
## lady        -31.310833  -14.364470  -12.316318
## jennings    -19.331479  -13.554029  -11.558537
## being       -55.035550   -5.479227  -11.316167
## on         -231.346021  -27.404638  -10.929738
## elizabeth   -61.381287  -40.800640  -10.895442
## who         -87.147855  -11.164063   -9.994290
## house       -28.554967  -11.133732   -9.296765
## after       -62.709360  -10.020566   -9.154717
## day         -40.133291   -5.288760   -9.068357
## with       -352.093260  -12.644256   -8.416187
## mariannes    -7.428993   -6.258819   -8.036769
## however     -36.872137  -14.075439   -8.013655
## dashwood    -22.084048  -15.985772   -7.960745
## gave        -16.089620   -5.657903   -7.607681
## whom        -21.444114   -6.833535   -6.999343
## daughter    -10.847104   -8.762217   -6.905520
## arrival      -5.658775   -4.116137   -6.827031
## jane        -25.052823  -19.911961   -6.754992
## his        -446.007445   29.529484   -6.711314
## collins     -10.313165   -4.338964   -6.502801
## visit       -12.533368   -8.636846   -6.487316
## sisters     -21.643420  -15.057484   -6.463277
## miss        -56.383281  -33.606023   -6.279509
## middleton    -8.380983   -4.208276   -6.218767
## soon        -49.583737  -21.027316   -6.161786
## saw         -29.922605   -8.130940   -6.135733

##                PC1         PC2       PC3
## i       -491.93174 -267.069786 369.33226
## you     -282.97126 -156.958060 229.58146
## my      -177.13942  -93.012272 134.84980
## is      -244.34721   16.127225 117.91225
## me      -114.77231  -48.796894  95.22997
## it      -484.51769  -74.210965  93.28802
## have    -225.44005  -81.335209  81.85130
## your     -93.51307  -57.030410  80.83190
## that    -527.06116  -21.559708  71.45501
## be      -347.78507 -136.587632  52.66409
## but     -301.79180  -14.722531  52.57835
## will     -94.31849  -32.091402  52.12526
## are      -85.98520    4.509834  46.27442
## if      -103.82385  -20.073529  45.01196
## not     -363.11421 -136.897156  43.29093
## am       -55.87480  -43.561584  42.11371
## do       -77.68537  -36.833362  41.38628
## can      -52.24709  -16.748397  39.02358
## a       -742.15932   39.075342  38.58734
## this    -177.74598   44.185716  36.10758
## him     -229.87966  -40.672129  34.09329
## said     -92.80372  -46.617451  33.12566
## has      -56.67006  -16.185706  31.75367
## so      -197.10548  -21.638563  30.05256
## for     -371.58339  -59.911312  29.83834
## we       -68.23087    3.179111  29.52713
## may      -49.01391   -9.671180  29.41729
## what    -109.76315  -35.844503  29.30814
## know     -53.51507  -36.411024  27.94474
## one     -124.79223   28.227226  25.79453
## think    -51.63823  -32.331230  25.78628
## myself   -24.11155  -16.729988  21.07018
## say      -43.87668  -14.953912  20.29900
## shall    -31.73066  -16.384084  20.10284
## upon     -59.27945   36.629204  19.90527
## then     -71.85027   19.830696  18.88254
## as      -403.15746  -48.756189  18.80157
## whale    -52.40969   72.646425  18.40301
## should   -65.32619  -29.164886  17.10621
## there   -109.60786    3.924406  16.92776
## us       -32.27530    2.478811  16.10351
## must     -76.92666  -30.270110  15.85142
## he      -375.61713  -49.660219  15.59422
## no      -139.07409  -39.680275  15.57560
## cannot   -21.46326  -10.728569  14.56389
## replied  -20.04468  -17.392070  13.70480
## sure     -26.02845  -20.264634  13.34910
## here     -30.28855   11.181846  13.26546
## very    -123.01954  -54.218271  13.10054
## our      -24.80363    2.248309  12.55457

Visualizing expression space

A conceptual model

Let’s build a conceptual model for descriptive analysis of “mixture” expression data.

Data: expression data from tissue samples that consist of various mixtures of different cell types.

Goal: identify shared coexpression patterns corresponding to cell type.

Similar situations: identify different developmental stages from whole-organism expression; common community structures from metagenomic data.

Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.

Mean expression by cell type.
Cell type proportions by sample.

\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).

\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).

\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]