Peter Ralph
Advanced Biological Statistics
There are many dimension reduction methods, e.g.:
PCA uses the covariance matrix, which measures similarity.
t-SNE begins with the matrix of distances, measuring dissimilarity.
Are distances interpretable?
metric: In PCA, each axis is a fixed linear combination of variables. So, distances always mean the same thing no matter where you are on the plot.
non-metric: In t-SNE, distances within different clusters are not comparable.
From ordination.okstate.edu, about ordination in ecology:
Graphical results often lead to intuitive interpretations of species-environment relationships.
A single multivariate analysis saves time, in contrast to a separate univariate analysis for each species.
Ideally and typically, dimensions of this ‘low dimensional space’ will represent important and interpretable environmental gradients.
If statistical tests are desired, problems of multiple comparisons are diminished when species composition is studied in its entirety.
Statistical power is enhanced when species are considered in aggregate, because of redundancy.
By focusing on ‘important dimensions’, we avoid interpreting (and misinterpreting) noise.
Ordination methods are strongly influenced by sampling: ordination may ignore large-scale patterns in favor of describing variation within a highly oversampled area.
Ordination methods also describe patterns common to many variables: measuring the same thing many times may drive results.
Many methods are designed to find clusters, because our brain likes to categorize things. This doesn’t mean those clusters are well-separated in reality.
The goal is usually to produce a picture in which similar things are nearby each other, while also capturing global structure.
In data/passages.txt we have a number of short passages from a few different books.
Can we identify the authors of each passage?
The true sources of the passages are in data/passage_sources.tsv.
passages <- readLines("data/passages.txt")
sources <- read.table("data/passage_sources.tsv", header=TRUE, stringsAsFactors=TRUE)
words <- sort(unique(strsplit(paste(passages, collapse=" "), " +")[[1]]))
tabwords <- function (x, w) { tabulate(match(strsplit(x, " ")[[1]], w), nbins=length(w)) }
wordmat <- sapply(passages, tabwords, words)
dimnames(wordmat) <- list(words, NULL)
stopifnot( min(rowSums(wordmat)) > 0 )
wordmat[1:20, 1:20]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## a 15 9 12 8 7 7 5 12 8 3 18 12 13 11 2 2 12 14 11 14
## aback 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abaft 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abandon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abandoned 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abandonment 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abased 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abasement 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abashed 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abate 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abated 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abatement 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abating 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abbeyland 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abbreviate 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abbreviation 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abeam 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abednego 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abhor 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## abhorred 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## PC1 PC2 PC3
## her -594.34798 -369.50681 -320.553975
## to -1240.72704 -323.87211 -13.293113
## i -491.93174 -267.06979 369.332262
## she -386.87463 -258.52708 -165.767955
## you -282.97126 -156.95806 229.581460
## not -363.11421 -136.89716 43.290928
## be -347.78507 -136.58763 52.664086
## was -522.83075 -136.22638 -141.902635
## had -304.53160 -98.53497 -76.705108
## my -177.13942 -93.01227 134.849801
## have -225.44005 -81.33521 81.851304
## it -484.51769 -74.21097 93.288019
## could -142.25006 -73.11607 -40.469689
## for -371.58339 -59.91131 29.838338
## your -93.51307 -57.03041 80.831901
## very -123.01954 -54.21827 13.100543
## elinor -70.48250 -52.55216 -16.329420
## he -375.61713 -49.66022 15.594216
## would -140.85426 -49.03480 9.479050
## me -114.77231 -48.79689 95.229970
## as -403.15746 -48.75619 18.801570
## been -136.39247 -48.21771 6.551887
## said -92.80372 -46.61745 33.125661
## am -55.87480 -43.56158 42.113707
## elizabeth -61.38129 -40.80064 -10.895442
## him -229.87966 -40.67213 34.093291
## no -139.07409 -39.68028 15.575598
## herself -59.37697 -38.31004 -38.679588
## do -77.68537 -36.83336 41.386279
## know -53.51507 -36.41102 27.944743
## marianne -50.96373 -36.20272 -23.321135
## what -109.76315 -35.84450 29.308141
## sister -44.21808 -34.84423 -5.767636
## such -106.38272 -33.68966 11.373512
## miss -56.38328 -33.60602 -6.279509
## think -51.63823 -32.33123 25.786284
## will -94.31849 -32.09140 52.125262
## much -84.29715 -31.23711 -3.004868
## which -182.14128 -30.76793 -23.845607
## any -100.25109 -30.32870 8.360409
## must -76.92666 -30.27011 15.851418
## mrs -45.17642 -29.56358 -21.998778
## than -99.42283 -29.37513 -4.707476
## should -65.32619 -29.16489 17.106205
## them -139.34498 -27.98789 -15.384618
## on -231.34602 -27.40464 -10.929738
## and -1226.79489 -27.01957 -63.646246
## mother -38.18706 -26.96834 -15.208029
## mr -47.16713 -25.67153 -12.360595
## did -69.05182 -25.26154 10.888321
## PC1 PC2 PC3
## the -1938.848597 679.25472 -15.8599522
## whale -52.409687 72.64643 18.4030059
## in -729.432446 55.44280 3.8135028
## this -177.745979 44.18572 36.1075787
## a -742.159320 39.07534 38.5873415
## like -54.401485 36.88865 12.0032804
## upon -59.279446 36.62920 19.9052683
## ahab -24.108156 34.27340 5.4928174
## sea -22.390040 32.28631 4.6829825
## ship -21.772015 29.76893 4.7971335
## his -446.007445 29.52948 -6.7113140
## old -31.478667 28.92232 12.1107257
## whales -19.959899 28.44594 7.0712904
## one -124.792228 28.22723 25.7945336
## into -67.132210 26.19590 4.9371075
## now -91.983077 22.97597 11.0948656
## these -37.944027 21.97420 5.9644468
## of -1287.261454 21.73291 -61.9511010
## boat -12.757167 20.89801 0.7750276
## white -14.249543 20.34029 3.8079378
## all -236.207991 20.29192 8.1020127
## ye -16.517589 20.13232 10.4005112
## then -71.850275 19.83070 18.8825398
## out -60.387437 19.14103 5.6808385
## up -52.052288 17.78528 3.8663572
## sperm -10.859947 16.81080 4.2079551
## its -55.040431 16.69443 -0.4461618
## down -37.213502 16.20793 6.2333193
## through -21.993116 16.20333 0.4409778
## round -17.918369 16.16908 1.0900027
## is -244.347214 16.12722 117.9122463
## captain -12.446603 16.05703 3.8943171
## from -190.342540 15.91746 -14.5418029
## yet -37.055639 15.68260 6.3217292
## those -30.706626 15.62040 4.7922181
## still -43.674287 15.41750 -5.2932666
## some -95.410309 15.23838 -1.0026757
## men -18.224598 15.10218 3.7872895
## deck -9.081950 14.37372 1.9517430
## boats -8.020552 14.26438 0.7798884
## head -20.834646 14.04222 3.2905213
## crew -7.655209 13.45631 0.5844938
## air -15.710056 13.31316 0.2249265
## thou -10.013584 13.15565 7.4352051
## water -8.452857 13.15480 3.2918904
## ships -8.079648 12.97220 2.1174358
## seemed -37.338906 12.90973 -4.8322828
## feet -8.144120 12.56765 3.2873123
## stubb -8.909317 12.50658 3.4643008
## over -48.103826 11.69170 1.8487713
## PC1 PC2 PC3
## her -594.347982 -369.506815 -320.553975
## she -386.874631 -258.527076 -165.767955
## was -522.830748 -136.226376 -141.902635
## had -304.531600 -98.534971 -76.705108
## and -1226.794885 -27.019568 -63.646246
## of -1287.261454 21.732911 -61.951101
## their -152.207591 -3.769349 -45.613796
## could -142.250056 -73.116065 -40.469689
## herself -59.376969 -38.310040 -38.679588
## were -158.026382 -5.357269 -30.453443
## by -239.810894 -5.266732 -24.018741
## which -182.141283 -30.767930 -23.845607
## marianne -50.963734 -36.202720 -23.321135
## mrs -45.176423 -29.563584 -21.998778
## every -82.911029 -20.341334 -20.004948
## they -166.865546 -18.890090 -16.964416
## elinor -70.482501 -52.552157 -16.329420
## the -1938.848597 679.254717 -15.859952
## them -139.344979 -27.987894 -15.384618
## mother -38.187061 -26.968343 -15.208029
## from -190.342540 15.917459 -14.541803
## though -73.286464 -5.727032 -14.461336
## to -1240.727040 -323.872106 -13.293113
## mr -47.167135 -25.671530 -12.360595
## lady -31.310833 -14.364470 -12.316318
## jennings -19.331479 -13.554029 -11.558537
## being -55.035550 -5.479227 -11.316167
## on -231.346021 -27.404638 -10.929738
## elizabeth -61.381287 -40.800640 -10.895442
## who -87.147855 -11.164063 -9.994290
## house -28.554967 -11.133732 -9.296765
## after -62.709360 -10.020566 -9.154717
## day -40.133291 -5.288760 -9.068357
## with -352.093260 -12.644256 -8.416187
## mariannes -7.428993 -6.258819 -8.036769
## however -36.872137 -14.075439 -8.013655
## dashwood -22.084048 -15.985772 -7.960745
## gave -16.089620 -5.657903 -7.607681
## whom -21.444114 -6.833535 -6.999343
## daughter -10.847104 -8.762217 -6.905520
## arrival -5.658775 -4.116137 -6.827031
## jane -25.052823 -19.911961 -6.754992
## his -446.007445 29.529484 -6.711314
## collins -10.313165 -4.338964 -6.502801
## visit -12.533368 -8.636846 -6.487316
## sisters -21.643420 -15.057484 -6.463277
## miss -56.383281 -33.606023 -6.279509
## middleton -8.380983 -4.208276 -6.218767
## soon -49.583737 -21.027316 -6.161786
## saw -29.922605 -8.130940 -6.135733
## PC1 PC2 PC3
## i -491.93174 -267.069786 369.33226
## you -282.97126 -156.958060 229.58146
## my -177.13942 -93.012272 134.84980
## is -244.34721 16.127225 117.91225
## me -114.77231 -48.796894 95.22997
## it -484.51769 -74.210965 93.28802
## have -225.44005 -81.335209 81.85130
## your -93.51307 -57.030410 80.83190
## that -527.06116 -21.559708 71.45501
## be -347.78507 -136.587632 52.66409
## but -301.79180 -14.722531 52.57835
## will -94.31849 -32.091402 52.12526
## are -85.98520 4.509834 46.27442
## if -103.82385 -20.073529 45.01196
## not -363.11421 -136.897156 43.29093
## am -55.87480 -43.561584 42.11371
## do -77.68537 -36.833362 41.38628
## can -52.24709 -16.748397 39.02358
## a -742.15932 39.075342 38.58734
## this -177.74598 44.185716 36.10758
## him -229.87966 -40.672129 34.09329
## said -92.80372 -46.617451 33.12566
## has -56.67006 -16.185706 31.75367
## so -197.10548 -21.638563 30.05256
## for -371.58339 -59.911312 29.83834
## we -68.23087 3.179111 29.52713
## may -49.01391 -9.671180 29.41729
## what -109.76315 -35.844503 29.30814
## know -53.51507 -36.411024 27.94474
## one -124.79223 28.227226 25.79453
## think -51.63823 -32.331230 25.78628
## myself -24.11155 -16.729988 21.07018
## say -43.87668 -14.953912 20.29900
## shall -31.73066 -16.384084 20.10284
## upon -59.27945 36.629204 19.90527
## then -71.85027 19.830696 18.88254
## as -403.15746 -48.756189 18.80157
## whale -52.40969 72.646425 18.40301
## should -65.32619 -29.164886 17.10621
## there -109.60786 3.924406 16.92776
## us -32.27530 2.478811 16.10351
## must -76.92666 -30.270110 15.85142
## he -375.61713 -49.660219 15.59422
## no -139.07409 -39.680275 15.57560
## cannot -21.46326 -10.728569 14.56389
## replied -20.04468 -17.392070 13.70480
## sure -26.02845 -20.264634 13.34910
## here -30.28855 11.181846 13.26546
## very -123.01954 -54.218271 13.10054
## our -24.80363 2.248309 12.55457