Dimension reduction: PCA, t-SNE and others

Peter Ralph

18 February 2020 – Advanced Biological Statistics

Multivariate data

We need multivariate statistics when we’re dealing with lots (more than one?) variables.

Goals: for this and next week:

Describe and visualize multivariate data.
Distinguish groups using many variables (e.g., cases from controls).

Both things involve dimension reduction, because we see things in ${} \le 3$ dimensions, and categorization is low-dimensional.

The setting

We have observations of many variables:

\[ \mathbf{x}_i = (x_{i1}, x_{i2}, \ldots, x_{im}) . \]

\[ x_{ij} = (\text{measurement of $j^\text{th}$ variable in $i^\text{th}$ subject}) \]

head(diabetes, 20)

##      y     age    sex     bmi     map      tc      ldl      hdl     tch     ltg     glu
## 1  151  0.0381  0.051  0.0617  0.0219 -0.0442 -3.5e-02 -0.04340 -0.0026  0.0199 -0.0176
## 2   75 -0.0019 -0.045 -0.0515 -0.0263 -0.0084 -1.9e-02  0.07441 -0.0395 -0.0683 -0.0922
## 3  141  0.0853  0.051  0.0445 -0.0057 -0.0456 -3.4e-02 -0.03236 -0.0026  0.0029 -0.0259
## 4  206 -0.0891 -0.045 -0.0116 -0.0367  0.0122  2.5e-02 -0.03604  0.0343  0.0227 -0.0094
## 5  135  0.0054 -0.045 -0.0364  0.0219  0.0039  1.6e-02  0.00814 -0.0026 -0.0320 -0.0466
## 6   97 -0.0927 -0.045 -0.0407 -0.0194 -0.0690 -7.9e-02  0.04128 -0.0764 -0.0412 -0.0963
## 7  138 -0.0455  0.051 -0.0472 -0.0160 -0.0401 -2.5e-02  0.00078 -0.0395 -0.0629 -0.0384
## 8   63  0.0635  0.051 -0.0019  0.0666  0.0906  1.1e-01  0.02287  0.0177 -0.0358  0.0031
## 9  110  0.0417  0.051  0.0617 -0.0401 -0.0140  6.2e-03 -0.02867 -0.0026 -0.0150  0.0113
## 10 310 -0.0709 -0.045  0.0391 -0.0332 -0.0126 -3.5e-02 -0.02499 -0.0026  0.0677 -0.0135
## 11 101 -0.0963 -0.045 -0.0838  0.0081 -0.1034 -9.1e-02 -0.01395 -0.0764 -0.0629 -0.0342
## 12  69  0.0272  0.051  0.0175 -0.0332 -0.0071  4.6e-02 -0.06549  0.0712 -0.0964 -0.0591
## 13 179  0.0163 -0.045 -0.0288 -0.0091 -0.0043 -9.8e-03  0.04496 -0.0395 -0.0308 -0.0425
## 14 185  0.0054  0.051 -0.0019  0.0081 -0.0043 -1.6e-02 -0.00290 -0.0026  0.0384 -0.0135
## 15 118  0.0453 -0.045 -0.0256 -0.0126  0.0177 -6.1e-05  0.08177 -0.0395 -0.0320 -0.0756
## 16 171 -0.0527  0.051 -0.0181  0.0804  0.0892  1.1e-01 -0.03972  0.1081  0.0361 -0.0425
## 17 166 -0.0055 -0.045  0.0423  0.0494  0.0246 -2.4e-02  0.07441 -0.0395  0.0523  0.0279
## 18 144  0.0708  0.051  0.0121  0.0563  0.0342  4.9e-02 -0.03972  0.0343  0.0274 -0.0011
## 19  97 -0.0382 -0.045 -0.0105 -0.0367 -0.0373 -1.9e-02 -0.02867 -0.0026 -0.0181 -0.0176
## 20 168 -0.0273 -0.045 -0.0181 -0.0401 -0.0029 -1.1e-02  0.03760 -0.0395 -0.0089 -0.0549

head(cc[,-(2:6)], 20)

##    Product.name N.Cream N.Acidic N.Butter N.Old.milk E.White E.Grey E.Yellow E.Green H.Resistance E.Grainy E.Shiny M.Firm M.Melt.down M.Resistance M.Creaminess M.Grainy M.Chalky M.Cream M.Fat M.Butter M.Salt
## 1           16%     5.5      5.4      5.4        5.1     9.4    2.4      4.8    0.75          5.4     1.50    12.8    7.3         8.4          7.3          7.7      1.8      2.7     6.9   8.2      8.2    6.2
## 2           16%     8.6      6.8      7.8        3.0     9.9    3.6      3.8    3.00          4.2     4.05    12.9    2.7         4.8          5.4          9.2      4.8      4.5     8.6   7.0      6.9    6.3
## 3           16%     7.0      9.0      9.6        1.9    10.8    1.6      2.2    0.75          6.6     1.65     9.0    5.5        10.9          6.5          7.7      2.4      3.0     6.6   8.0      9.4    5.2
## 4           16%     8.6      6.0     10.8        1.9     8.0    4.0      4.7    1.80          4.5     7.95    10.3    4.2        13.2          1.6          2.9      3.9      5.1     5.4   5.0      5.5    5.5
## 5           16%     8.8      8.7      9.8        5.0     9.8    2.9      1.9    1.80          5.2     8.10    11.1    6.6        10.3          4.8          8.7      3.3      6.3     8.1  10.5      9.4    6.8
## 6           16%    10.7      3.9      9.6        3.0    10.2    2.1      3.3    2.55          4.2     3.30    12.3    7.3        10.1          5.4          9.8      9.3      2.4     9.2   8.8      7.0   10.3
## 7           16%     8.0     10.8     10.3        1.8    12.2    3.3      1.2    0.45          5.8     1.65     9.2    3.3         9.2          4.3         11.7      1.5      3.5     9.3   8.1      9.0    8.2
## 8           16%     9.0      6.8      6.5        3.1     6.8    5.5      4.2    3.30          6.0     5.25     5.8    6.0         5.0          6.3          6.3      4.0      3.3     6.9   4.3      7.7    4.3
## 9           16%     6.0      8.4      7.2        1.6    10.8    2.7      1.4    0.90          6.8     0.45    13.7    7.3         9.4          7.0          8.4      1.1      3.0     6.9   8.2      8.1    6.8
## 10          16%     7.7      6.6      6.9        2.7     9.2    2.7      4.2    2.70          5.5     3.90    10.5    4.2         7.2          4.3         10.5      4.5      4.2     9.8  10.2      8.6    6.2
## 11          16%     7.5     10.8     10.3        1.9    10.8    2.4      1.9    1.35          5.7     1.50    11.4    6.2         8.2          6.8          8.1      3.8      2.4     9.4  10.9     10.7    6.6
## 12          16%     5.1     12.0      4.0        7.0     9.4    4.2      5.0    1.80          2.9     2.70    10.9    5.7        10.3          4.7          7.7      6.6      6.0     6.9   6.3      3.3    5.5
## 13          16%     7.7      6.9      9.4        5.5    10.2    2.4      3.8    1.95          4.0     6.00    10.5    6.5        10.3          6.2          7.0      3.3      6.6     7.5  10.5     10.1    7.0
## 14          16%     7.8      8.0      9.3        1.5    10.3    1.9      1.9    1.50          3.0     7.20    10.9    3.6        10.7          4.3          6.2      3.3      2.5     8.8   8.4      9.6   10.7
## 15          16%     7.8      8.8      7.5        1.1    11.7    1.1      2.7    1.05          5.1     1.65     9.9    2.4         9.9          3.5         10.8      1.4      3.1    10.2  10.7     10.3    8.6
## 16          16%     7.2      5.5      6.6        1.9     8.1    2.5      5.5    1.65          6.3     3.60     6.0    5.2         6.8          5.0          7.5      3.5      2.5     6.5   7.5      6.6    4.8
## 17          16%     6.0      7.2      6.2        1.8    10.5    2.4      0.9    0.00          5.4     1.20    13.1    7.7         7.3          7.0          8.0      1.1      4.0     6.2   6.9      7.8    6.5
## 18          16%     8.6      8.0      7.7        3.0     9.9    2.4      3.0    2.10          2.9     3.45    12.3    5.1         6.6         10.2          6.6      7.3      8.0     8.1   4.2      5.7    5.4
## 19          16%     8.7     10.9     10.3        1.6     9.8    3.3      2.5    1.50          3.8     1.50    11.4    6.9         7.7          7.7          8.6      4.5      2.7     8.7  10.9     11.8    7.3
## 20          16%     2.4      8.4      7.8        5.4     9.0    2.4      3.3    1.95          3.9     5.25    12.6    4.7        12.0          2.5          3.9      6.0      5.2     6.5   7.5      5.5    4.7
##    M.Sour M.Sweet
## 1     6.8     4.0
## 2     5.2     4.8
## 3     8.8     1.8
## 4    12.6     1.8
## 5     7.3     2.9
## 6     9.8     9.3
## 7     7.3     4.8
## 8     5.0     2.5
## 9     6.9     4.7
## 10    5.5     4.7
## 11    5.2     1.5
## 12    9.2     3.5
## 13    7.7     3.5
## 14    8.4     5.4
## 15    9.6     3.0
## 16    6.0     4.2
## 17    7.2     3.6
## 18    8.4     3.5
## 19    5.8     1.9
## 20    8.7     2.4

Principal component analysis (PCA)

Primary aims of PCA

Variable reduction - reduce to a small number of “composite” variables that adequately summarize the original information (dimension reduction).
Ordination - Reveal patterns in the data that could not be found by analyzing each variable separately.

Quick example: diabetes data

plot of chunk prcomp_dia

Quick example: diabetes data

plot of chunk dia2

Quick example: cream cheese

plot of chunk prcomp_cc

Quick example: cream cheese

plot of chunk cc2

PCA: how it works

Reducing dimesions

Say we’ve got a data matrix $x_{ij}$ of $n$ observations in $m$ variables,

but we want to make do with fewer variables - say, only $k$ of them.

Idea: Let’s pick a few linear combinations of the variables that best captures variability within the dataset. For instance, strongly correlated variables will be combined into one.

Notation:

These new variables are the principal components, \[u^{(1)}, \ldots, u^{(k)} . \]
The importance of each variable to the principal components - i.e., the coefficients of the linear combination - are the loadings, \[v^{(1)}, \ldots, v^{(k)} . \]
So, the position of the $i$th data point on the $\ell$th PC is \[ u_i^{(\ell)} = v_1^{(\ell)} x_{i1} + v_2^{(\ell)} x_{i2} + \cdots v_m^{(\ell)} x_{im} . \]

Geometric interpretation

The loadings are the directions in multidimensional space that explain most variation in the data.
The PCs are the coordinates of the data along these directions.
These directions are orthogonal, and the resulting variables are uncorrelated.

A model for PCA

The approximation to the data matrix using only $k$ PCs is: \[ x_{ij} \approx u_i^{(1)} v_j^{(1)} + u_i^{(2)} v_j^{(2)} + \cdots + u_i^{(k)} v_j^{(k)} . \]

This is the best possible $k$-dimensional approximation, in the least-squares sense, so is the MLE for the low-dimesional model with Gaussian noise.

Also known as

PCA is also called “eigenanalysis” because (as usually computed), the PCs are eigenvectors of the covariance matrix.

The eigenvalues are \[ \lambda_\ell = \sum_{i=1}^n \left(u_i^{(\ell)}\right)^2, \] and their sum is the total variance: \[ \lambda_1 + \lambda_2 + \cdots + \lambda_m = \sum_{ij} (x_{ij} - \bar x_j)^2 . \]

Interpretation

PCs: Observations that are close in PC space are similar.
Loadings: high values indicate a strong correlation with that PC (positive or negative).

Sometimes PCs are rotated to improve interpretability.

What next?

The PCs are nice new variables you can use in any analysis!

Example: Does mouse body size correlate with the top three climate PCs?

Beer

Example: local beer

beer <- read.csv("data/Beer_Specs.csv")
table(beer$Beer_Type)

## 
##           Ale    Citrus_IPA    Double_IPA        Helles        NW_IPA Oatmeal_Stout   Oktoberfest      Pale_Ale       Red_IPA Vanilla_Stout    Winter_Ale 
##            21            14            23            20            94            12             4             8            39            12            10

beer_vars <- c("Volume", "CO2", "Color", "DO", "pH", "Bitterness_Units", "ABV", "Real_Extract", "Real_Degrees_Fermentation", "Final_Gravity")
head(beer[,c("Beer_Type", beer_vars)])

##   Beer_Type Volume CO2 Color DO  pH Bitterness_Units ABV Real_Extract Real_Degrees_Fermentation Final_Gravity
## 1       Ale    250 2.4   4.7 80 4.1               56 4.9          3.6                        68             1
## 2       Ale    178 2.4   4.7 86 4.2               53 5.0          3.4                        70             1
## 3       Ale     70 2.4   4.9 89 4.3               61 4.8          3.6                        68             1
## 4       Ale    102 2.4   5.0 89 4.3               58 4.7          3.6                        68             1
## 5       Ale    173 2.4   4.4 82 4.4               59 4.6          3.7                        67             1
## 6       Ale    254 2.4   4.8 94 4.4               57 4.8          3.5                        69             1

Goals:

Describe major axes of variation between beer batches.
Look for variation not related to taste/style.
Make a pretty plot.

plot of chunk beerpairs

PCA: how to do it

Practical considerations

Only use numeric variables - omit factors.
Variables should all be on the same scale, at least roughly.
Variables should also probably be centered.

… however,

Beware outliers.
Transformations may be a good idea.
Missing data must be removed or imputed.
Consider replacing highly skewed variables with ranks.

`?princomp`

princomp                 package:stats                 R Documentation

Principal Components Analysis

Description:

     ‘princomp’ performs a principal components analysis on the given
     numeric data matrix and returns the results as an object of class
     ‘princomp’.

Usage:

     princomp(x, ...)
     
     ## S3 method for class 'formula'
     princomp(formula, data = NULL, subset, na.action, ...)
     
     ## Default S3 method:
     princomp(x, cor = FALSE, scores = TRUE, covmat = NULL,
              subset = rep_len(TRUE, nrow(as.matrix(x))), fix_sign = TRUE, ...)
     
     ## S3 method for class 'princomp'
     predict(object, newdata, ...)
     
Arguments:

 formula: a formula with no response variable, referring only to
          numeric variables.

    data: an optional data frame (or similar: see ‘model.frame’)
          containing the variables in the formula ‘formula’.  By
          default the variables are taken from ‘environment(formula)’.

  subset: an optional vector used to select rows (observations) of the
          data matrix ‘x’.

na.action: a function which indicates what should happen when the data
          contain ‘NA’s.  The default is set by the ‘na.action’ setting
          of ‘options’, and is ‘na.fail’ if that is unset. The
          ‘factory-fresh’ default is ‘na.omit’.

       x: a numeric matrix or data frame which provides the data for
          the principal components analysis.

     cor: a logical value indicating whether the calculation should use
          the correlation matrix or the covariance matrix.  (The
          correlation matrix can only be used if there are no constant
          variables.)

  scores: a logical value indicating whether the score on each
          principal component should be calculated.

  covmat: a covariance matrix, or a covariance list as returned by
          ‘cov.wt’ (and ‘cov.mve’ or ‘cov.mcd’ from package ‘MASS’).
          If supplied, this is used rather than the covariance matrix
          of ‘x’.

fix_sign: Should the signs of the loadings and scores be chosen so that
          the first element of each loading is non-negative?

     ...: arguments passed to or from other methods. If ‘x’ is a
          formula one might specify ‘cor’ or ‘scores’.

  object: Object of class inheriting from ‘"princomp"’.

 newdata: An optional data frame or matrix in which to look for
          variables with which to predict.  If omitted, the scores are
          used.  If the original fit used a formula or a data frame or
          a matrix with column names, ‘newdata’ must contain columns
          with the same names. Otherwise it must contain the same
          number of columns, to be used in the same order.

Details:

     ‘princomp’ is a generic function with ‘"formula"’ and ‘"default"’
     methods.

     The calculation is done using ‘eigen’ on the correlation or
     covariance matrix, as determined by ‘cor’.  This is done for
     compatibility with the S-PLUS result.  A preferred method of
     calculation is to use ‘svd’ on ‘x’, as is done in ‘prcomp’.

     Note that the default calculation uses divisor ‘N’ for the
     covariance matrix.

     The ‘print’ method for these objects prints the results in a nice
     format and the ‘plot’ method produces a scree plot (‘screeplot’).
     There is also a ‘biplot’ method.

     If ‘x’ is a formula then the standard NA-handling is applied to
     the scores (if requested): see ‘napredict’.

     ‘princomp’ only handles so-called R-mode PCA, that is feature
     extraction of variables.  If a data matrix is supplied (possibly
     via a formula) it is required that there are at least as many
     units as variables.  For Q-mode PCA use ‘prcomp’.

Value:

     ‘princomp’ returns a list with class ‘"princomp"’ containing the
     following components:

    sdev: the standard deviations of the principal components.

loadings: the matrix of variable loadings (i.e., a matrix whose columns
          contain the eigenvectors).  This is of class ‘"loadings"’:
          see ‘loadings’ for its ‘print’ method.

  center: the means that were subtracted.

   scale: the scalings applied to each variable.

   n.obs: the number of observations.

  scores: if ‘scores = TRUE’, the scores of the supplied data on the
          principal components.  These are non-null only if ‘x’ was
          supplied, and if ‘covmat’ was also supplied if it was a
          covariance list.  For the formula method, ‘napredict()’ is
          applied to handle the treatment of values omitted by the
          ‘na.action’.

    call: the matched call.

na.action: If relevant.

Note:

     The signs of the columns of the loadings and scores are arbitrary,
     and so may differ between different programs for PCA, and even
     between different builds of R: ‘fix_sign = TRUE’ alleviates that.

References:

     Mardia, K. V., J. T. Kent and J. M. Bibby (1979).  _Multivariate
     Analysis_, London: Academic Press.

     Venables, W. N. and B. D. Ripley (2002).  _Modern Applied
     Statistics with S_, Springer-Verlag.

See Also:

     ‘summary.princomp’, ‘screeplot’, ‘biplot.princomp’, ‘prcomp’,
     ‘cor’, ‘cov’, ‘eigen’.

beer_pc <- princomp(na.omit(beer[, beer_vars]), cor=TRUE, scores=TRUE)
str(beer_pc)

## List of 7
##  $ sdev    : Named num [1:10] 2.05 1.441 1.118 0.965 0.825 ...
##   ..- attr(*, "names")= chr [1:10] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ loadings: 'loadings' num [1:10, 1:10] 0.0194 -0.3338 0.2763 -0.0465 0.1402 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:10] "Volume" "CO2" "Color" "DO" ...
##   .. ..$ : chr [1:10] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ center  : Named num [1:10] 306.47 2.43 18.12 89.8 4.4 ...
##   ..- attr(*, "names")= chr [1:10] "Volume" "CO2" "Color" "DO" ...
##  $ scale   : Named num [1:10] 134.589 0.0627 29.1922 13.5223 0.1113 ...
##   ..- attr(*, "names")= chr [1:10] "Volume" "CO2" "Color" "DO" ...
##  $ n.obs   : int 223
##  $ scores  : num [1:223, 1:10] -2.98 -3.35 -2.84 -2.77 -2.36 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:223] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:10] "Comp.1" "Comp.2" "Comp.3" "Comp.4" ...
##  $ call    : language princomp(x = na.omit(beer[, beer_vars]), cor = TRUE, scores = TRUE)
##  - attr(*, "class")= chr "princomp"

What it means

> str(beer_pc)
List of 7

$ sdev: standard deviations of the PCs
$ loadings: loadings of the variables on each PC
$ center: subtracted from columns of the data
$ scale: divided from columns of the data
$ scores: the actual PCs

How many PCs should I use?

The full covariance matrix is contained in the PCs.
PCA gives you as many PCs as variables.

Depends - what for?

Some answers:

Visualization: as many as make sense (but beware overinterpretation).
Further analysis: as many as your method can handle.
Until an obvious break in the scree plot.

The scree plot

Shows the variance explained by each PC (these are always decreasing).

plot(beer_pc)

plot of chunk scree

The top four PCs

layout(t(1:2))
biplot(beer_pc, choices=1:2, xlabs=na.omit(beer[,c("Beer_Type", beer_vars)])$Beer_Type)
biplot(beer_pc, choices=3:4, xlabs=na.omit(beer[,c("Beer_Type", beer_vars)])$Beer_Type)

plot of chunk plot_prcomp_beer

plot of chunk plot_b1

Gotchas

PCs are only well-defined up to sign ($\pm$) and scale.
Some programs report PCs normalized to have SD 1.
Many ways to do basically the same thing: e.g., prcomp and princomp.

Your turn: wine

Using this dataset of chemical concentrations in wine:

(wine <- read.table("data/wine.tsv", header=TRUE))

##     Vineyard C1   C2  C3 C4  C5   C6   C7   C8   C9  C10  C11 C12  C13
## 1          1 14 1.71 2.4 16 127 2.80 3.06 0.28 2.29  5.6 1.04 3.9 1065
## 2          1 13 1.78 2.1 11 100 2.65 2.76 0.26 1.28  4.4 1.05 3.4 1050
## 3          1 13 2.36 2.7 19 101 2.80 3.24 0.30 2.81  5.7 1.03 3.2 1185
## 4          1 14 1.95 2.5 17 113 3.85 3.49 0.24 2.18  7.8 0.86 3.5 1480
## 5          1 13 2.59 2.9 21 118 2.80 2.69 0.39 1.82  4.3 1.04 2.9  735
## 6          1 14 1.76 2.5 15 112 3.27 3.39 0.34 1.97  6.8 1.05 2.9 1450
## 7          1 14 1.87 2.5 15  96 2.50 2.52 0.30 1.98  5.2 1.02 3.6 1290
## 8          1 14 2.15 2.6 18 121 2.60 2.51 0.31 1.25  5.0 1.06 3.6 1295
## 9          1 15 1.64 2.2 14  97 2.80 2.98 0.29 1.98  5.2 1.08 2.9 1045
## 10         1 14 1.35 2.3 16  98 2.98 3.15 0.22 1.85  7.2 1.01 3.5 1045
## 11         1 14 2.16 2.3 18 105 2.95 3.32 0.22 2.38  5.8 1.25 3.2 1510
## 12         1 14 1.48 2.3 17  95 2.20 2.43 0.26 1.57  5.0 1.17 2.8 1280
## 13         1 14 1.73 2.4 16  89 2.60 2.76 0.29 1.81  5.6 1.15 2.9 1320
## 14         1 15 1.73 2.4 11  91 3.10 3.69 0.43 2.81  5.4 1.25 2.7 1150
## 15         1 14 1.87 2.4 12 102 3.30 3.64 0.29 2.96  7.5 1.20 3.0 1547
## 16         1 14 1.81 2.7 17 112 2.85 2.91 0.30 1.46  7.3 1.28 2.9 1310
## 17         1 14 1.92 2.7 20 120 2.80 3.14 0.33 1.97  6.2 1.07 2.6 1280
## 18         1 14 1.57 2.6 20 115 2.95 3.40 0.40 1.72  6.6 1.13 2.6 1130
## 19         1 14 1.59 2.5 16 108 3.30 3.93 0.32 1.86  8.7 1.23 2.8 1680
## 20         1 14 3.10 2.6 15 116 2.70 3.03 0.17 1.66  5.1 0.96 3.4  845
## 21         1 14 1.63 2.3 16 126 3.00 3.17 0.24 2.10  5.7 1.09 3.7  780
## 22         1 13 3.80 2.6 19 102 2.41 2.41 0.25 1.98  4.5 1.03 3.5  770
## 23         1 14 1.86 2.4 17 101 2.61 2.88 0.27 1.69  3.8 1.11 4.0 1035
## 24         1 13 1.60 2.5 18  95 2.48 2.37 0.26 1.46  3.9 1.09 3.6 1015
## 25         1 14 1.81 2.6 20  96 2.53 2.61 0.28 1.66  3.5 1.12 3.8  845
## 26         1 13 2.05 3.2 25 124 2.63 2.68 0.47 1.92  3.6 1.13 3.2  830
## 27         1 13 1.77 2.6 16  93 2.85 2.94 0.34 1.45  4.8 0.92 3.2 1195
## 28         1 13 1.72 2.1 17  94 2.40 2.19 0.27 1.35  4.0 1.02 2.8 1285
## 29         1 14 1.90 2.8 19 107 2.95 2.97 0.37 1.76  4.5 1.25 3.4  915
## 30         1 14 1.68 2.2 16  96 2.65 2.33 0.26 1.98  4.7 1.04 3.6 1035
## 31         1 14 1.50 2.7 22 101 3.00 3.25 0.29 2.38  5.7 1.19 2.7 1285
## 32         1 14 1.66 2.4 19 106 2.86 3.19 0.22 1.95  6.9 1.09 2.9 1515
## 33         1 14 1.83 2.4 17 104 2.42 2.69 0.42 1.97  3.8 1.23 2.9  990
## 34         1 14 1.53 2.7 20 132 2.95 2.74 0.50 1.35  5.4 1.25 3.0 1235
## 35         1 14 1.80 2.6 19 110 2.35 2.53 0.29 1.54  4.2 1.10 2.9 1095
## 36         1 13 1.81 2.4 20 100 2.70 2.98 0.26 1.86  5.1 1.04 3.5  920
## 37         1 13 1.64 2.8 16 110 2.60 2.68 0.34 1.36  4.6 1.09 2.8  880
## 38         1 13 1.65 2.5 18  98 2.45 2.43 0.29 1.44  4.2 1.12 2.5 1105
## 39         1 13 1.50 2.1 16  98 2.40 2.64 0.28 1.37  3.7 1.18 2.7 1020
## 40         1 14 3.99 2.5 13 128 3.00 3.04 0.20 2.08  5.1 0.89 3.5  760
## 41         1 14 1.71 2.3 16 117 3.15 3.29 0.34 2.34  6.1 0.95 3.4  795
## 42         1 13 3.84 2.1 19  90 2.45 2.68 0.27 1.48  4.3 0.91 3.0 1035
## 43         1 14 1.89 2.6 15 101 3.25 3.56 0.17 1.70  5.4 0.88 3.6 1095
## 44         1 13 3.98 2.3 18 103 2.64 2.63 0.32 1.66  4.4 0.82 3.0  680
## 45         1 13 1.77 2.1 17 107 3.00 3.00 0.28 2.03  5.0 0.88 3.4  885
## 46         1 14 4.04 2.4 19 111 2.85 2.65 0.30 1.25  5.2 0.87 3.3 1080
## 47         1 14 3.59 2.3 16 102 3.25 3.17 0.27 2.19  4.9 1.04 3.4 1065
## 48         1 14 1.68 2.1 16 101 3.10 3.39 0.21 2.14  6.1 0.91 3.3  985
## 49         1 14 2.02 2.4 19 103 2.75 2.92 0.32 2.38  6.2 1.07 2.8 1060
## 50         1 14 1.73 2.3 17 108 2.88 3.54 0.32 2.08  8.9 1.12 3.1 1260
## 51         1 13 1.73 2.0 12  92 2.72 3.27 0.17 2.91  7.2 1.12 2.9 1150
## 52         1 14 1.65 2.6 17  94 2.45 2.99 0.22 2.29  5.6 1.24 3.4 1265
## 53         1 14 1.75 2.4 14 111 3.88 3.74 0.32 1.87  7.0 1.01 3.3 1190
## 54         1 14 1.90 2.7 17 115 3.00 2.79 0.39 1.68  6.3 1.13 2.9 1375
## 55         1 14 1.67 2.2 16 118 2.60 2.90 0.21 1.62  5.8 0.92 3.2 1060
## 56         1 14 1.73 2.5 20 116 2.96 2.78 0.20 2.45  6.2 0.98 3.0 1120
## 57         1 14 1.70 2.3 16 118 3.20 3.00 0.26 2.03  6.4 0.94 3.3  970
## 58         1 13 1.97 2.7 17 102 3.00 3.23 0.31 1.66  6.0 1.07 2.8 1270
## 59         1 14 1.43 2.5 17 108 3.40 3.67 0.19 2.04  6.8 0.89 2.9 1285
## 60         2 12 0.94 1.4 11  88 1.98 0.57 0.28 0.42  1.9 1.05 1.8  520
## 61         2 12 1.10 2.3 16 101 2.05 1.09 0.63 0.41  3.3 1.25 1.7  680
## 62         2 13 1.36 2.0 17 100 2.02 1.41 0.53 0.62  5.8 0.98 1.6  450
## 63         2 14 1.25 1.9 18  94 2.10 1.79 0.32 0.73  3.8 1.23 2.5  630
## 64         2 12 1.13 2.2 19  87 3.50 3.10 0.19 1.87  4.5 1.22 2.9  420
## 65         2 12 1.45 2.5 19 104 1.89 1.75 0.45 1.03  3.0 1.45 2.2  355
## 66         2 12 1.21 2.6 18  98 2.42 2.65 0.37 2.08  4.6 1.19 2.3  678
## 67         2 13 1.01 1.7 15  78 2.98 3.18 0.26 2.28  5.3 1.12 3.2  502
## 68         2 12 1.17 1.9 20  78 2.11 2.00 0.27 1.04  4.7 1.12 3.5  510
## 69         2 13 0.94 2.4 17 110 2.53 1.30 0.55 0.42  3.2 1.02 1.9  750
## 70         2 12 1.19 1.8 17 151 1.85 1.28 0.14 2.50  2.9 1.28 3.1  718
## 71         2 12 1.61 2.2 20 103 1.10 1.02 0.37 1.46  3.0 0.91 1.8  870
## 72         2 14 1.51 2.7 25  86 2.95 2.86 0.21 1.87  3.4 1.36 3.2  410
## 73         2 13 1.66 2.2 24  87 1.88 1.84 0.27 1.03  3.7 0.98 2.8  472
## 74         2 13 1.67 2.6 30 139 3.30 2.89 0.21 1.96  3.4 1.31 3.5  985
## 75         2 12 1.09 2.3 21 101 3.38 2.14 0.13 1.65  3.2 0.99 3.1  886
## 76         2 12 1.88 1.9 16  97 1.61 1.57 0.34 1.15  3.8 1.23 2.1  428
## 77         2 13 0.90 1.7 16  86 1.95 2.03 0.24 1.46  4.6 1.19 2.5  392
## 78         2 12 2.89 2.2 18 112 1.72 1.32 0.43 0.95  2.6 0.96 2.5  500
## 79         2 12 0.99 1.9 15 136 1.90 1.85 0.35 2.76  3.4 1.06 2.3  750
## 80         2 13 3.87 2.4 23 101 2.83 2.55 0.43 1.95  2.6 1.19 3.1  463
## 81         2 12 0.92 2.0 19  86 2.42 2.26 0.30 1.43  2.5 1.38 3.1  278
## 82         2 13 1.81 2.2 19  86 2.20 2.53 0.26 1.77  3.9 1.16 3.1  714
## 83         2 12 1.13 2.5 24  78 2.00 1.58 0.40 1.40  2.2 1.31 2.7  630
## 84         2 13 3.86 2.3 22  85 1.65 1.59 0.61 1.62  4.8 0.84 2.0  515
## 85         2 12 0.89 2.6 18  94 2.20 2.21 0.22 2.35  3.0 0.79 3.1  520
## 86         2 13 0.98 2.2 18  99 2.20 1.94 0.30 1.46  2.6 1.23 3.2  450
## 87         2 12 1.61 2.3 23  90 1.78 1.69 0.43 1.56  2.5 1.33 2.3  495
## 88         2 12 1.67 2.6 26  88 1.92 1.61 0.40 1.34  2.6 1.36 3.2  562
## 89         2 12 2.06 2.5 22  84 1.95 1.69 0.48 1.35  2.8 1.00 2.8  680
## 90         2 12 1.33 2.3 24  70 2.20 1.59 0.42 1.38  1.7 1.07 3.2  625
## 91         2 12 1.83 2.3 18  81 1.60 1.50 0.52 1.64  2.4 1.08 2.3  480
## 92         2 12 1.51 2.4 22  86 1.45 1.25 0.50 1.63  3.6 1.05 2.6  450
## 93         2 13 1.53 2.3 21  80 1.38 1.46 0.58 1.62  3.0 0.96 2.1  495
## 94         2 12 2.83 2.2 18  88 2.45 2.25 0.25 1.99  2.1 1.15 3.3  290
## 95         2 12 1.99 2.3 18  98 3.02 2.26 0.17 1.35  3.2 1.16 3.0  345
## 96         2 12 1.52 2.2 19 162 2.50 2.27 0.32 3.28  2.6 1.16 2.6  937
## 97         2 12 2.12 2.7 22 134 1.60 0.99 0.14 1.56  2.5 0.95 2.3  625
## 98         2 12 1.41 2.0 16  85 2.55 2.50 0.29 1.77  2.9 1.23 2.7  428
## 99         2 12 1.07 2.1 18  88 3.52 3.75 0.24 1.95  4.5 1.04 2.8  660
## 100        2 12 3.17 2.2 18  88 2.85 2.99 0.45 2.81  2.3 1.42 2.8  406
## 101        2 12 2.08 1.7 18  97 2.23 2.17 0.26 1.40  3.3 1.27 3.0  710
## 102        2 13 1.34 1.9 18  88 1.45 1.36 0.29 1.35  2.5 1.04 2.8  562
## 103        2 12 2.45 2.5 21  98 2.56 2.11 0.34 1.31  2.8 0.80 3.4  438
## 104        2 12 1.72 1.9 20  86 2.50 1.64 0.37 1.42  2.1 0.94 2.4  415
## 105        2 13 1.73 2.0 20  85 2.20 1.92 0.32 1.48  2.9 1.04 3.6  672
## 106        2 12 2.55 2.3 22  90 1.68 1.84 0.66 1.42  2.7 0.86 3.3  315
## 107        2 12 1.73 2.1 19  80 1.65 2.03 0.37 1.63  3.4 1.00 3.2  510
## 108        2 13 1.75 2.3 22  84 1.38 1.76 0.48 1.63  3.3 0.88 2.4  488
## 109        2 12 1.29 1.9 19  92 2.36 2.04 0.39 2.08  2.7 0.86 3.0  312
## 110        2 12 1.35 2.7 20  94 2.74 2.92 0.29 2.49  2.6 0.96 3.3  680
## 111        2 11 3.74 1.8 20 107 3.18 2.58 0.24 3.58  2.9 0.75 2.8  562
## 112        2 13 2.43 2.2 21  88 2.55 2.27 0.26 1.22  2.0 0.90 2.8  325
## 113        2 12 2.68 2.9 20 103 1.75 2.03 0.60 1.05  3.8 1.23 2.5  607
## 114        2 11 0.74 2.5 21  88 2.48 2.01 0.42 1.44  3.1 1.10 2.3  434
## 115        2 12 1.39 2.5 22  84 2.56 2.29 0.43 1.04  2.9 0.93 3.2  385
## 116        2 11 1.51 2.2 22  85 2.46 2.17 0.52 2.01  1.9 1.71 2.9  407
## 117        2 12 1.47 2.0 21  86 1.98 1.60 0.30 1.53  1.9 0.95 3.3  495
## 118        2 12 1.61 2.2 22 108 2.00 2.09 0.34 1.61  2.1 1.06 3.0  345
## 119        2 13 3.43 2.0 16  80 1.63 1.25 0.43 0.83  3.4 0.70 2.1  372
## 120        2 12 3.43 2.0 19  87 2.00 1.64 0.37 1.87  1.3 0.93 3.0  564
## 121        2 11 2.40 2.4 20  96 2.90 2.79 0.32 1.83  3.2 0.80 3.4  625
## 122        2 12 2.05 3.2 28 119 3.18 5.08 0.47 1.87  6.0 0.93 3.7  465
## 123        2 12 4.43 2.7 26 102 2.20 2.13 0.43 1.71  2.1 0.92 3.1  365
## 124        2 13 5.80 2.1 22  86 2.62 2.65 0.30 2.01  2.6 0.73 3.1  380
## 125        2 12 4.31 2.4 21  82 2.86 3.03 0.21 2.91  2.8 0.75 3.6  380
## 126        2 12 2.16 2.2 21  85 2.60 2.65 0.37 1.35  2.8 0.86 3.3  378
## 127        2 12 1.53 2.3 22  86 2.74 3.15 0.39 1.77  3.9 0.69 2.8  352
## 128        2 12 2.13 2.8 28  92 2.13 2.24 0.58 1.76  3.0 0.97 2.4  466
## 129        2 12 1.63 2.3 24  88 2.22 2.45 0.40 1.90  2.1 0.89 2.8  342
## 130        2 12 4.30 2.4 22  80 2.10 1.75 0.42 1.35  2.6 0.79 2.6  580
## 131        3 13 1.35 2.3 18 122 1.51 1.25 0.21 0.94  4.1 0.76 1.3  630
## 132        3 13 2.99 2.4 20 104 1.30 1.22 0.24 0.83  5.4 0.74 1.4  530
## 133        3 13 2.31 2.4 24  98 1.15 1.09 0.27 0.83  5.7 0.66 1.4  560
## 134        3 13 3.55 2.4 22 106 1.70 1.20 0.17 0.84  5.0 0.78 1.3  600
## 135        3 13 1.24 2.2 18  85 2.00 0.58 0.60 1.25  5.5 0.75 1.5  650
## 136        3 13 2.46 2.2 18  94 1.62 0.66 0.63 0.94  7.1 0.73 1.6  695
## 137        3 12 4.72 2.5 21  89 1.38 0.47 0.53 0.80  3.9 0.75 1.3  720
## 138        3 13 5.51 2.6 25  96 1.79 0.60 0.63 1.10  5.0 0.82 1.7  515
## 139        3 13 3.59 2.2 20  88 1.62 0.48 0.58 0.88  5.7 0.81 1.8  580
## 140        3 13 2.96 2.6 24 101 2.32 0.60 0.53 0.81  4.9 0.89 2.1  590
## 141        3 13 2.81 2.7 21  96 1.54 0.50 0.53 0.75  4.6 0.77 2.3  600
## 142        3 13 2.56 2.4 20  89 1.40 0.50 0.37 0.64  5.6 0.70 2.5  780
## 143        3 14 3.17 2.7 24  97 1.55 0.52 0.50 0.55  4.3 0.89 2.1  520
## 144        3 14 4.95 2.4 20  92 2.00 0.80 0.47 1.02  4.4 0.91 2.0  550
## 145        3 12 3.88 2.2 18 112 1.38 0.78 0.29 1.14  8.2 0.65 2.0  855
## 146        3 13 3.57 2.1 21 102 1.50 0.55 0.43 1.30  4.0 0.60 1.7  830
## 147        3 14 5.04 2.2 20  80 0.98 0.34 0.40 0.68  4.9 0.58 1.3  415
## 148        3 13 4.61 2.5 22  86 1.70 0.65 0.47 0.86  7.7 0.54 1.9  625
## 149        3 13 3.24 2.4 22  92 1.93 0.76 0.45 1.25  8.4 0.55 1.6  650
## 150        3 13 3.90 2.4 22 113 1.41 1.39 0.34 1.14  9.4 0.57 1.3  550
## 151        3 14 3.12 2.6 24 123 1.40 1.57 0.22 1.25  8.6 0.59 1.3  500
## 152        3 13 2.67 2.5 22 112 1.48 1.36 0.24 1.26 10.8 0.48 1.5  480
## 153        3 13 1.90 2.8 26 116 2.20 1.28 0.26 1.56  7.1 0.61 1.3  425
## 154        3 13 3.30 2.3 18  98 1.80 0.83 0.61 1.87 10.5 0.56 1.5  675
## 155        3 13 1.29 2.1 20 103 1.48 0.58 0.53 1.40  7.6 0.58 1.6  640
## 156        3 13 5.19 2.3 22  93 1.74 0.63 0.61 1.55  7.9 0.60 1.5  725
## 157        3 14 4.12 2.4 20  89 1.80 0.83 0.48 1.56  9.0 0.57 1.6  480
## 158        3 12 3.03 2.6 27  97 1.90 0.58 0.63 1.14  7.5 0.67 1.7  880
## 159        3 14 1.68 2.7 25  98 2.80 1.31 0.53 2.70 13.0 0.57 2.0  660
## 160        3 13 1.67 2.6 22  89 2.60 1.10 0.52 2.29 11.8 0.57 1.8  620
## 161        3 12 3.83 2.4 21  88 2.30 0.92 0.50 1.04  7.7 0.56 1.6  520
## 162        3 14 3.26 2.5 20 107 1.83 0.56 0.50 0.80  5.9 0.96 1.8  680
## 163        3 13 3.27 2.6 22 106 1.65 0.60 0.60 0.96  5.6 0.87 2.1  570
## 164        3 13 3.45 2.4 18 106 1.39 0.70 0.40 0.94  5.3 0.68 1.8  675
## 165        3 14 2.76 2.3 22  90 1.35 0.68 0.41 1.03  9.6 0.70 1.7  615
## 166        3 14 4.36 2.3 22  88 1.28 0.47 0.52 1.15  6.6 0.78 1.8  520
## 167        3 13 3.70 2.6 23 111 1.70 0.92 0.43 1.46 10.7 0.85 1.6  695
## 168        3 13 3.37 2.3 20  88 1.48 0.66 0.40 0.97 10.3 0.72 1.8  685
## 169        3 14 2.58 2.7 24 105 1.55 0.84 0.39 1.54  8.7 0.74 1.8  750
## 170        3 13 4.60 2.9 25 112 1.98 0.96 0.27 1.11  8.5 0.67 1.9  630
## 171        3 12 3.03 2.3 19  96 1.25 0.49 0.40 0.73  5.5 0.66 1.8  510
## 172        3 13 2.39 2.3 20  86 1.39 0.51 0.48 0.64  9.9 0.57 1.6  470
## 173        3 14 2.51 2.5 20  91 1.68 0.70 0.44 1.24  9.7 0.62 1.7  660
## 174        3 14 5.65 2.5 20  95 1.68 0.61 0.52 1.06  7.7 0.64 1.7  740
## 175        3 13 3.91 2.5 23 102 1.80 0.75 0.43 1.41  7.3 0.70 1.6  750
## 176        3 13 4.28 2.3 20 120 1.59 0.69 0.43 1.35 10.2 0.59 1.6  835
## 177        3 13 2.59 2.4 20 120 1.65 0.68 0.53 1.46  9.3 0.60 1.6  840
## 178        3 14 4.10 2.7 24  96 2.05 0.76 0.56 1.35  9.2 0.61 1.6  560

Look at the data.
Do PCA, and plot the results colored by vineyard.
What happens if you set cor=FALSE?
Which variables most strongly differentiate the three vineyards?

PCA: Squaring the math and the code:

The data matrix $X$ can be written using the singular value decomposition, as \[ X = U \Lambda V^T, \] where $U^T U = I$ and $V^T V = I$ and $\Lambda$ has the singular values $\lambda_1, \ldots, \lambda_m$ on the diagonal.

The best $k$-dimensional approximation to $X$ is \[ \hat X = \sum_{i=1}^k \lambda_i u_i v_i^T , \] where $u_i$ and $v_i$ are the $i$th columns of $U$ and $V$.

Furthermore, \[ \sum_{ij} X_{ij} = \sum_i \lambda_i^2 . \]

A translation table

$u_\ell$ is the $\ell$th PC, standardized.
$v_\ell$ gives the loadings of the $\ell$th PC.
$\lambda_\ell^2 / \sum_{ij} X_{ij}^2$ is the percent variation explained by the $\ell$th PC.

Furthermore, since $\Lambda U = X V$,

$\lambda_\ell u_\ell$ is the vector of values given by the linear combination of the variables with weights given by $v_\ell$.

Translation to `prcomp`:

X <- cc[,-(1:6)]
X.svd <- svd(X)
X.pca <- prcomp(X, center=FALSE, scale=FALSE)

# Singular values:
#  prcomps's sdevs are singular values / sqrt(n-1)
stopifnot(all(X.svd$d / sqrt(nrow(X) - 1) == X.pca$sdev))

# Loadings:
#  prcomp's loadings (returned as "rotation") are V
stopifnot(all(X.pca$rotation == X.svd$v))

# PCs:
#  prcomp's PCs (returned as "x")
#  are Lambda * U
stopifnot(all(abs(X.pca$x - X.svd$u * X.svd$d[col(X.svd$u)]) < 1e-10))

Translation to `princomp`:

princomp uses eigendecomposition of the covariance matrix, and so necessarily recenters the variables.

X.svd2 <- svd(scale(X, center=TRUE, scale=FALSE))
X.pca2 <- princomp(X, cor=FALSE, scores=TRUE)

# Singular values:
#  princomps's sdevs are singular values / sqrt(n)
stopifnot(all(abs(sqrt(eigen(cov(X))$values * (nrow(X)-1)/nrow(X)) - X.pca2$sdev) < 1e-10))
stopifnot(all(abs(X.svd2$d / sqrt(nrow(X)) - X.pca2$sdev) < 1e-10))

# Loadings:
#  princomp's loadings are V, up to sign
the_signs <- sign(X.pca2$loadings[1,] * X.svd2$v[1,])
stopifnot(all(abs(X.pca2$loadings - X.svd2$v * the_signs[col(X.svd2$v)]) < 1e-10))

# PCs:
#  princomp's PCs (returned as "scores")
#  are Lambda * U
stopifnot(all(abs(X.pca2$scores
                  - X.svd2$u * X.svd2$d[col(X.svd2$u)] * the_signs[col(X.svd2$u)]) < 1e-10))

t-SNE

http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

t-SNE is a recently published method for dimension reduction (visualization of struture in high-dimensional data).

It makes very nice visualizations.

The generic idea is to find a low dimensional representation (i.e., a picture) in which proximity of points reflects similarity of the corresponding (high dimensional) observations.

Suppose we have $n$ data points, $x_1, \ldots, x_n$ and each one has $p$ variables: $x_1 = (x_{11}, \ldots, x_{1p})$.

For each, we want to find a low-dimensional point $y$: \[\begin{aligned} x_i \mapsto y_i \end{aligned}\] similarity of $x_i$ and $x_j$ is (somehow) reflected by proximity of $y_i$ and $y_j$

We need to define:

“low dimension” (usually, $k=2$ or $3$)
“similarity” of $x_i$ and $x_j$
“reflected by proximity” of $y_i$ and $y_j$

Similarity

To measure similarity, we just use (Euclidean) distance: \[\begin{aligned} d(x_i, x_j) = \sqrt{\sum_{\ell=1}^n (x_{i\ell} - x_{j\ell})^2} , \end{aligned}\] after first normalizing variables to vary on comparable scales.

Proximity

A natural choice is to require distances in the new space ($d(y_i, y_j)$) to be as close as possible to distances in the original space ($d(x_i, x_j)$).

That’s a pairwise condition. t-SNE instead tries to measure how faithfully relative distances are represented in each point’s neighborhood.

Neighborhoods

For each data point $x_i$, define \[\begin{aligned} p_{ij} = \frac{ e^{-d(x_i, x_j)^2 / (2\sigma^2)} }{ \sum_{\ell \neq i} e^{-d(x_i, x_\ell)^2 / (2 \sigma^2)} } , \end{aligned}\] and $p_{ii} = 0$.

This is the probability that point $i$ would pick point $j$ as a neighbor if these are chosen according to the Gaussian density centered at $x_i$.

Similarly, in the output space, define \[\begin{aligned} q_{ij} = \frac{ (1 + d(y_i, y_j)^2)^{-1} }{ \sum_{\ell \neq i} (1 + d(y_i, y_\ell)^2)^{-1} } \end{aligned}\] and $q_{ii} = 0$.

This is the probability that point $i$ would pick point $j$ as a neighbor if these are chosen according to the Cauchy centered at $x_i$.

Similiarity of neighborhoods

We want neighborhoods to look similar, i.e., choose $y$ so that $q$ looks like $p$.

To do this, we minimize \[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

What’s KL()?

Kullback-Leibler divergence

\[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

This is the average log-likelihood ratio between $p$ and $q$ for a single observation drawn from $p$.

It is a measure of how “surprised” you would be by a sequence of samples from $p$ that you think should be coming from $q$.

Facts:

$\text{KL}(p \given q) \ge 0$
$\text{KL}(p \given q) = 0$ only if $p_i = q_i$ for all $i$.

Simulate data

A high-dimensional donut

Let’s first make some data. This will be some points distributed around a ellipse in $n$ dimensions.

n <- 20
npts <- 1e3
xy <- matrix(rnorm(n*npts), ncol=n)
theta <- runif(npts) * 2 * pi
ab <- matrix(rnorm(2*n), nrow=2)
ab[,2] <- ab[,2] - ab[,1] * sum(ab[,1] * ab[,2]) / sqrt(sum(ab[,1]^2))
ab <- sweep(ab, 1, sqrt(rowSums(ab^2)), "/")
for (k in 1:npts) {
    dxy <- 4 * c(cos(theta[k]), sin(theta[k])) %*% ab
    xy[k,] <- xy[k,] + dxy
}

Here’s what the data look like: plot of chunk plot_data

But there is hidden, two-dimensional structure: plot of chunk plot_ring

Now let’s build the distance matrix.

dmat <- as.matrix(dist(xy))

Implementation

A Stan block

tsne_block <- '
data {
    int N; // number of points
    int n; // input dimension
    int k; // output dimension
    matrix[N,N] dsq;  // distance matrix, squared
}
parameters {
    real<lower=0> sigma_sq; // in kernel for p 
    matrix[N-2,k] y1;
    real<lower=0> y21;
}
transformed parameters {
    matrix[N,k] y;
    y[1,] = rep_row_vector(0.0, k);
    y[3:N,] = y1;
    y[2,] = rep_row_vector(0.0, k);
    y[2,1] = y21;
}
model {
    matrix[N,N] q;
    real dt;
    matrix[N,N] p;
    q[N,N] = 0.0;
    for (i in 1:(N-1)) {
        q[i,i] = 0.0;
        for (j in (i+1):N) {
            q[i,j] = 1 / (1 + squared_distance(y[i], y[j]));
            q[j,i] = q[i,j];
        }
    }
    for (i in 1:N) {
        q[i] = q[i] / sum(q[i]);
    }
    // create p matrix
    p = exp(-dsq/(2*sigma_sq));
    for (i in 1:N) {
        p[i,i] = 0.0;
        p[i] = p[i] / sum(p[i]);
    }
    // compute the target
    for (i in 1:(N-1)) {
        for (j in (i+1):N) {
            target += (-1) * (p[i,j] .* log(p[i,j] ./ q[i,j]));
            target += (-1) * (p[j,i] .* log(p[j,i] ./ q[j,i]));
        }
    }
    sigma_sq ~ normal(0, 10);
}'


tk <- 2
tsne_model <- stan_model(model_code=tsne_block)

Run the model with `optimizing`

runtime <- system.time(tsne_optim <- optimizing(tsne_model,
                                 data=list(N=nrow(xy),
                                           n=ncol(xy),
                                           k=tk,
                                           dsq=(dmat/max(dmat))^2)))
runtime

##    user  system elapsed 
##  110.09    0.18  110.31

It works!

plot of chunk plot_tsne

Another case

High-dimensional random walk

n <- 40
npts <- 1e2
# rw <- colCumsums(matrix(rnorm(n*npts), ncol=n))
rw <- matrix(rnorm(n*npts), ncol=n)
for (i in 2:npts) {
    rw[i,] <- sqrt(.9) * rw[i-1,] + sqrt(.1) * rw[i,]
}

Here’s what the data look like: plot of chunk plot_rw

plot of chunk plot_rw_sub

Now let’s build the distance matrix.

rw_dmat <- as.matrix(dist(rw))

Run again

rw_runtime <- system.time(
     rw_tsne_optim <- optimizing(tsne_model,
                                 data=list(N=nrow(rw),
                                           n=ncol(rw),
                                           k=tk,
                                           dsq=(rw_dmat/max(rw_dmat))^2),
                                 tol_rel_grad=1e5))
rw_runtime

##    user  system elapsed 
##       1       0       1

It works!

plot of chunk plot_rw_tsne

Overview of dimension reduction

The menagerie

There are many dimension reduction methods, e.g.:

principal components analysis (PCA)
non-negative matrix factorization (NMF)
independent components analysis (ICA)
canonical correpondence analysis (CCA)
principal coordinates analysis (PCoA)
multidimensional scaling (MDS)
redundancy analysis (RDA)
Sammon mapping
kernel PCA
t-SNE
UMAP
locally linear embedding (LLE)
Laplacian eigenmaps
autoencoders

Using distances or similarities?

PCA uses the covariance matrix, which measures similarity.

t-SNE begins with the matrix of distances, measuring dissimilarity.

Metric or non-Metric?

Are distances interpretable?

metric: In PCA, each axis is a fixed linear combination of variables. So, distances always mean the same thing no matter where you are on the plot.

non-metric: In t-SNE, distances within different clusters are not comparable.

Why ordination?

From ordination.okstate.edu, about ordination in ecology:

Graphical results often lead to intuitive interpretations of species-environment relationships.
A single multivariate analysis saves time, in contrast to a separate univariate analysis for each species.
Ideally and typically, dimensions of this ‘low dimensional space’ will represent important and interpretable environmental gradients.
If statistical tests are desired, problems of multiple comparisons are diminished when species composition is studied in its entirety.
Statistical power is enhanced when species are considered in aggregate, because of redundancy.
By focusing on ‘important dimensions’, we avoid interpreting (and misinterpreting) noise.

Beware overinterpretation

Ordination methods are strongly influenced by sampling: ordination may ignore large-scale patterns in favor of describing variation within a highly oversampled area.
Ordination methods also describe patterns common to many variables: measuring the same thing many times may drive results.
Many methods are designed to find clusters, because our brain likes to categorize things. This doesn’t mean those clusters are well-separated in reality.

Dimension reduction: PCA, t-SNE and others

Multivariate data

The setting

Principal component analysis (PCA)

Primary aims of PCA

Quick example: diabetes data

Quick example: diabetes data

Quick example: cream cheese

Quick example: cream cheese

PCA: how it works

Reducing dimesions

Notation:

Geometric interpretation

A model for PCA

Also known as

Interpretation

What next?

Beer

Example: local beer

Goals:

PCA: how to do it

Practical considerations

?princomp

What it means

How many PCs should I use?

The scree plot

The top four PCs

Gotchas

Your turn: wine

PCA: Squaring the math and the code:

A translation table

Translation to prcomp:

Translation to princomp:

t-SNE

t-SNE

Similarity

Proximity

Neighborhoods

Similiarity of neighborhoods

Kullback-Leibler divergence

Simulate data

A high-dimensional donut

Implementation

A Stan block

Run the model with optimizing

It works!

Another case

High-dimensional random walk

Run again

It works!

Overview of dimension reduction

The menagerie

Using distances or similarities?

Metric or non-Metric?

Why ordination?

Beware overinterpretation

`?princomp`

Translation to `prcomp`:

Translation to `princomp`:

Run the model with `optimizing`