Clustering and dimension reduction

Peter Ralph

25 February 2019 – Advanced Biological Statistics

Visualizing expression space

A conceptual model

Let’s build a conceptual model for descriptive analysis of “mixture” expression data.

Data: expression data from tissue samples that consist of various mixtures of different cell types.

Goal: identify shared coexpression patterns corresponding to cell type.

Similar situations: identify different developmental stages from whole-organism expression; common community structures from metagenomic data.

Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.

Mean expression by cell type.
Cell type proportions by sample.

$x_{kj}$ : Mean expression of gene $j$ in cell type $k$.
$w_{ik}$ : Proportion of sample $i$ of cell type $k$.

$Z_{ij}$ : expression level in sample $i$ of gene $j$.

\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]

Nonnegative matrix factorization

… aka “NMF”

We are decomposing $Z$ into the product of two lower-dimensional, nonnegative factors:

\[\begin{aligned} Z_{ij} &\approx \sum_k w_{ik} x_{kj} \\ w_{ik} &\ge 0 \\ x_{kj} &\ge 0 . \end{aligned}\]

A simple NMF model

simple_nmf <- stan_model(model_code="
data {
    int N; // samples
    int L; // variables
    int K; // factors
    real Z[L,N];
}
parameters {
    matrix<lower=0>[L,K] x;
    matrix<lower=0>[K,N] w;
    real<lower=0> sigma;
}
model {
    for (j in 1:L) {
        Z[j] ~ normal(x[j] * w, sigma);
    }
}
")

Relationship to PCA

PCA finds $w$ and $z$ to minimize \[\begin{aligned} \sum_{ij} \| Z_{ij} - \sum_k w_{ik} x_{kj} \|^2 . \end{aligned}\]

In other words, it is the maximum-likelihood solution to \[\begin{aligned} Z_{ij} &\sim \Normal(\sum_k w_{ik} x_{kj}, \sigma^2) . \end{aligned}\] (The eigenvectors are the columns of $x$, and the eigenvectors are related to the size of $w$ and $x$.)

PCA, in Stan

stan_pca <- stan_model(model_code="
data {
    int N; // samples
    int L; // variables
    int K; // factors
    real Z[L,N];
}
parameters {
    matrix[L,K] x;
    matrix[K,N] w;
    real<lower=0> sigma;
}
model {
    for (j in 1:L) {
        Z[j] ~ normal(x[j] * w, sigma);
    }
}
")

(note: needs some priors to work well; see here.)

Stochastic minute

the Dirichlet distribution

A random set of $k$ proportions $0 \le P_i \le 1$ has a $\Dirichlet(\alpha_1, \ldots, \alpha_k)$ if it has probability density \[\begin{aligned} \frac{1}{B(\alpha)} \prod_{i=1}^k p_i^{\alpha_i} \end{aligned}\] over the set of possible values \[\begin{aligned} P_1 + \cdots + P_k = 1 . \end{aligned}\]

This is useful as a prior on proportions.
The mean is \[ \left( \frac{\alpha_1}{\sum_j \alpha_j}, \frac{\alpha_2}{\sum_j \alpha_j}, \cdots, \frac{\alpha_k}{\sum_j \alpha_j} \right) . \]
This generalized the Beta: if $X \sim \Beta(a, b)$ then $(X, 1-X) \sim \Dirichlet(a, b)$.
Marginal distributions are Beta distributed: $P_i \sim \Beta(\alpha_i, \sum_{j=1}^k \alpha_j - \alpha_i)$.

If $X_i \sim \Gamma(\text{shape}=\alpha_i)$, and \[\begin{aligned} P_i = X_i / \sum_{j=1}^k X_j \end{aligned}\] then $P \sim \Dirichlet(\alpha)$.

“Simplex” parameters

“The $k$-simplex” is the set of proportions, i.e., nonnegative numbers $p$ satisfying \[\begin{aligned} p_1 + \cdots p_k = 1 . \end{aligned}\]

parameters {
    simplex[K] p;
}
model {
    p ~ dirichlet(alpha);
}

Back to expression space

Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.

Mean expression by cell type.
Cell type proportions by sample.

$x_{kj}$ : Mean expression of gene $j$ in cell type $k$.
$w_{ik}$ : Proportion of sample $i$ of cell type $k$.

$Z_{ij}$ : expression in sample $i$ of gene $j$.

\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]

Mean expression by cell type.
Cell type proportions by sample.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.

$Z_{ij}$ : expression level in sample $i$ of gene $j$.

\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]

$y_j$, $\eta_j$ : mean and SD of expression of gene $j$ across all cell types; shrink $x_{kj}$ towards $y_j$.
(omit this)

data {
  int N; // # samples
  int L; // # genes
  int K; // # cell types
  int Z[N,L];
}

parameters {
  matrix<lower=0>[L,K] x;
  vector[L] y;
  simplex[K] w[N];
  vector<lower=0>[L] eta;
  vector<lower=0>[K] alpha;
  real<lower=0> d_alpha;
}

model {
  for (i in 1:N) {
      Z[i] ~ poisson(eta .* (x * w[i]));
      w[i] ~ dirichlet(d_alpha * alpha);
  }
  for (j in 1:K) 
      { x[,j] ~ normal(y ./ eta, 1); }
  y ~ normal(0, 20);
  alpha ~ normal(0, 1);
  d_alpha ~ exponential(0.2);
  eta ~ cauchy(0, 10);
}

$x_{kj}$ : Mean expression of gene $j$ in cell type $k$.
$w_{ik}$ : Proportion of sample $i$ of cell type $k$.

\[\begin{aligned} Z_{ij} \approx \sum_k w_{ik} x_{kj} . \end{aligned}\]

$y_j$, $\eta_j$ : mean and SD of expression of gene $j$ across all cell types; shrink $x_{kj}$ towards $y_j$.

Testing: compiles?

nmf_model <- stan_model(model_code=nmf_block)

Testing: runs?

sampling(nmf_model,
         data=list(N=10,
                   L=5,
                   K=2,
                   Z=matrix(rpois(50, 100), ncol=5)),
         chains=1, iter=100)

## 
## SAMPLING FOR MODEL '027c9a0b92082dc735bfb6350861f7ee' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 5.7e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.57 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: WARNING: There aren't enough warmup iterations to fit the
## Chain 1:          three stages of adaptation as currently configured.
## Chain 1:          Reducing each adaptation stage to 15%/75%/10% of
## Chain 1:          the given number of warmup iterations:
## Chain 1:            init_buffer = 7
## Chain 1:            adapt_window = 38
## Chain 1:            term_buffer = 5
## Chain 1: 
## Chain 1: Iteration:  1 / 100 [  1%]  (Warmup)
## Chain 1: Iteration: 10 / 100 [ 10%]  (Warmup)
## Chain 1: Iteration: 20 / 100 [ 20%]  (Warmup)
## Chain 1: Iteration: 30 / 100 [ 30%]  (Warmup)
## Chain 1: Iteration: 40 / 100 [ 40%]  (Warmup)
## Chain 1: Iteration: 50 / 100 [ 50%]  (Warmup)
## Chain 1: Iteration: 51 / 100 [ 51%]  (Sampling)
## Chain 1: Iteration: 60 / 100 [ 60%]  (Sampling)
## Chain 1: Iteration: 70 / 100 [ 70%]  (Sampling)
## Chain 1: Iteration: 80 / 100 [ 80%]  (Sampling)
## Chain 1: Iteration: 90 / 100 [ 90%]  (Sampling)
## Chain 1: Iteration: 100 / 100 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.710808 seconds (Warm-up)
## Chain 1:                1.04427 seconds (Sampling)
## Chain 1:                1.75508 seconds (Total)
## Chain 1:

## Warning: There were 3 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

## Warning: There were 31 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: There were 1 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low

## Warning: Examine the pairs() plot to diagnose sampling problems

## Inference for Stan model: 027c9a0b92082dc735bfb6350861f7ee.
## 1 chains, each with iter=100; warmup=50; thin=1; 
## post-warmup draws per chain=50, total post-warmup draws=50.
## 
##              mean se_mean     sd     2.5%      25%      50%      75%
## x[1,1]       1.19    0.12   0.84     0.18     0.62     0.95     1.52
## x[1,2]       0.93    0.14   0.79     0.11     0.45     0.65     1.19
## x[2,1]       1.39    0.47   1.10     0.11     0.54     1.07     2.11
## x[2,2]       1.05    0.12   0.77     0.04     0.47     0.90     1.28
## x[3,1]       1.49    0.29   0.98     0.13     0.58     1.36     2.10
## x[3,2]       1.16    0.15   0.81     0.13     0.53     1.12     1.52
## x[4,1]       1.33    0.10   0.72     0.16     0.91     1.33     1.62
## x[4,2]       1.05    0.25   0.89     0.00     0.31     0.86     1.59
## x[5,1]       1.94    0.33   1.23     0.33     1.07     1.54     2.60
## x[5,2]       1.62    0.17   1.03     0.08     0.93     1.37     2.54
## y[1]         9.77    6.92  16.48   -19.69    -4.64    12.59    24.19
## y[2]         9.75   13.07  25.19   -28.61    -6.47     5.02    34.16
## y[3]        24.10    9.26  21.39    -8.88     7.76    18.70    47.09
## y[4]         5.94    6.41  15.09   -21.34    -3.24     5.40    19.94
## y[5]        40.01    3.96  14.87    14.17    27.96    41.65    51.49
## w[1,1]       0.77    0.16   0.32     0.08     0.65     0.97     0.99
## w[1,2]       0.23    0.16   0.32     0.00     0.01     0.03     0.35
## w[2,1]       0.76    0.16   0.31     0.08     0.65     0.93     1.00
## w[2,2]       0.24    0.16   0.31     0.00     0.00     0.07     0.35
## w[3,1]       0.76    0.17   0.30     0.13     0.64     0.92     1.00
## w[3,2]       0.24    0.17   0.30     0.00     0.00     0.08     0.36
## w[4,1]       0.77    0.16   0.31     0.08     0.68     0.94     1.00
## w[4,2]       0.23    0.16   0.31     0.00     0.00     0.06     0.32
## w[5,1]       0.76    0.17   0.32     0.06     0.65     0.93     1.00
## w[5,2]       0.24    0.17   0.32     0.00     0.00     0.07     0.35
## w[6,1]       0.77    0.16   0.31     0.16     0.68     0.96     1.00
## w[6,2]       0.23    0.16   0.31     0.00     0.00     0.04     0.32
## w[7,1]       0.77    0.16   0.31     0.12     0.61     0.95     1.00
## w[7,2]       0.23    0.16   0.31     0.00     0.00     0.05     0.39
## w[8,1]       0.78    0.16   0.31     0.12     0.61     0.97     1.00
## w[8,2]       0.22    0.16   0.31     0.00     0.00     0.03     0.39
## w[9,1]       0.77    0.16   0.30     0.10     0.60     0.96     1.00
## w[9,2]       0.23    0.16   0.30     0.00     0.00     0.04     0.40
## w[10,1]      0.75    0.17   0.32     0.08     0.61     0.94     0.99
## w[10,2]      0.25    0.17   0.32     0.00     0.01     0.06     0.39
## eta[1]     131.11   17.50 104.77    31.83    68.49   104.77   169.52
## eta[2]     113.86   17.72 124.60    26.25    46.53    85.96   142.42
## eta[3]     105.16   23.88  91.35    32.74    47.74    80.49   128.30
## eta[4]      96.10   12.16  93.52    29.53    59.67    72.76    97.18
## eta[5]      70.69   10.19  44.27    20.59    36.89    58.75    90.14
## alpha[1]     1.09    0.10   0.73     0.20     0.54     0.88     1.61
## alpha[2]     0.53    0.35   0.73     0.02     0.05     0.18     0.72
## d_alpha      8.14    2.70   5.71     0.69     3.69     6.42    12.55
## lp__     18013.92    8.93  14.57 17994.67 17999.31 18011.47 18029.52
##             97.5% n_eff Rhat
## x[1,1]       3.15    46 0.99
## x[1,2]       2.81    32 1.04
## x[2,1]       3.59     5 1.49
## x[2,2]       2.41    39 1.02
## x[3,1]       3.14    11 1.28
## x[3,2]       2.87    28 0.99
## x[4,1]       3.22    53 0.98
## x[4,2]       2.78    12 1.17
## x[5,1]       4.95    14 0.98
## x[5,2]       3.51    36 1.04
## y[1]        30.90     6 1.06
## y[2]        45.45     4 1.89
## y[3]        60.20     5 1.31
## y[4]        30.95     6 1.00
## y[5]        66.78    14 0.99
## w[1,1]       1.00     4 1.60
## w[1,2]       0.92     4 1.60
## w[2,1]       1.00     4 1.71
## w[2,2]       0.92     4 1.71
## w[3,1]       1.00     3 1.81
## w[3,2]       0.87     3 1.81
## w[4,1]       1.00     4 1.57
## w[4,2]       0.92     4 1.57
## w[5,1]       1.00     3 1.73
## w[5,2]       0.94     3 1.73
## w[6,1]       1.00     4 1.65
## w[6,2]       0.84     4 1.65
## w[7,1]       1.00     3 1.72
## w[7,2]       0.88     3 1.72
## w[8,1]       1.00     4 1.66
## w[8,2]       0.88     4 1.66
## w[9,1]       1.00     4 1.65
## w[9,2]       0.90     4 1.65
## w[10,1]      1.00     4 1.68
## w[10,2]      0.92     4 1.68
## eta[1]     344.99    36 1.00
## eta[2]     247.98    49 1.00
## eta[3]     334.45    15 1.15
## eta[4]     332.98    59 1.02
## eta[5]     182.97    19 1.01
## alpha[1]     2.66    52 0.98
## alpha[2]     2.40     4 1.50
## d_alpha     20.43     4 1.72
## lp__     18034.18     3 2.98
## 
## Samples were drawn using NUTS(diag_e) at Tue Mar  5 22:41:23 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Simulate data

Outline

How many cell types?
How many genes?
How many samples?
How much noise in expression?
How many genes distinguish cell types, and by how much relative to expression?
How many “noisy” genes? How many “noisy” samples?
How much variation in mixture proportions?

Easier case

Sample mixture proportions (the rows of w), from the Dirichlet distribution with parameters $= 0.3333333, 0.3333333, 0.3333333 $.

w <- matrix(rgamma(nsamples * ntypes, rate=1, 
                   shape=rep(mean_props, each=nsamples)), 
            ncol=ntypes)
w <- w / rowSums(w)
stopifnot(all(abs(rowSums(w) - 1) < 1e-8))

Sample cell type expression levels:

mean_x <- abs(100 * rnorm(ngenes))
x <- matrix(rep(mean_x, each=ntypes), nrow=ntypes)
stopifnot(all(colMeans(x) == mean_x))
for (k in 1:ntypes) {
    diff_genes <- sample.int(ngenes, ndiff_genes)
    x[k, diff_genes] <- rgamma(ndiff_genes, shape=0.5, scale=2) * x[k, diff_genes]
}

Simulate expression:

mean_z <- w %*% x
noisy_samples <- sample.int(nsamples, num_noisy_samples)
mean_z[noisy_samples,] <- (mean_z[noisy_samples,] 
                           * exp(rnorm(noisy_samples * ngenes, mean=0, sd=0.1)))
noisy_mean_z <- mean_z * exp(rnorm(nsamples * ngenes, mean=0, sd=indiv_sd))
Z <- matrix(rpois(nsamples * ngenes, lambda=noisy_mean_z), ncol=ngenes)

Run Stan:

ez_results <- optimizing(nmf_model,
                          data=list(N=nsamples,
                                    L=ngenes,
                                    K=ntypes,
                                    Z=Z))

Look at output:

est_w <- matrix(ez_results$par[grepl("^w", names(ez_results$par))], nrow=nsamples)
est_x <- t(matrix(ez_results$par[grepl("^x", names(ez_results$par))], nrow=ngenes))
est_x <- sweep(est_x, 2, ez_results$par[grepl("^eta", names(ez_results$par))], "*")

Here is the correlation between estimated w and observed w:

cor(est_w, w)

##            [,1]       [,2]       [,3]
## [1,] -0.4479292  0.9974603 -0.5419578
## [2,]  0.9991217 -0.4967387 -0.5112457
## [3,] -0.5412021 -0.4560523  0.9986858

This is very good! We are estimating relative mixtures very well. (But, note they are not in the same order.)

Here is the correlation between estimated x and observed x:

cor(t(est_x), t(x))

##           [,1]      [,2]      [,3]
## [1,] 0.6392301 0.9986823 0.4427321
## [2,] 0.9948616 0.5748984 0.5025609
## [3,] 0.5244367 0.4837158 0.9999782

This is also very good!

In-class implementation (harder case)

Sample mixture proportions (the rows of w), from the Dirichlet distribution with parameters $= 0.0277778, 0.0555556, 0.0833333, 0.1111111, 0.1388889, 0.1666667, 0.1944444, 0.2222222 $.

w <- matrix(rgamma(nsamples * ntypes, rate=1, 
                   shape=rep(mean_props, each=nsamples)), 
            ncol=ntypes)
w <- w / rowSums(w)
stopifnot(all(abs(rowSums(w) - 1) < 1e-8))

Sample cell type expression levels:

mean_x <- abs(100 * rcauchy(ngenes))
x <- matrix(rep(mean_x, each=ntypes), nrow=ntypes)
stopifnot(all(colMeans(x) == mean_x))
for (k in 1:ntypes) {
    diff_genes <- sample.int(ngenes, ndiff_genes)
    x[k, diff_genes] <- rgamma(ndiff_genes, shape=0.5, scale=diff_size) * x[k, diff_genes]
}

Simulate expression:

mean_z <- w %*% x
noisy_samples <- sample.int(nsamples, num_noisy_samples)
mean_z[noisy_samples,] <- (mean_z[noisy_samples,] 
                           * exp(rnorm(noisy_samples * ngenes, mean=0, sd=0.1)))
noisy_mean_z <- mean_z * exp(rnorm(nsamples * ngenes, mean=0, sd=indiv_sd))
Z <- matrix(rpois(nsamples * ngenes, lambda=noisy_mean_z), ncol=ngenes)

Is there signal? It looks like there is.

diff8 <- x[8,] - matrixStats::colMedians(x)
image(log(t(Z[order(w[,8]),order(abs(diff8))])))

plot of chunk image

Run Stan:

opt_results <- optimizing(nmf_model,
                          data=list(N=nsamples,
                                    L=ngenes,
                                    K=ntypes,
                                    Z=Z))

Look at output:

est_w <- matrix(opt_results$par[grepl("^w", names(opt_results$par))], nrow=nsamples)
est_x <- t(matrix(opt_results$par[grepl("^x", names(opt_results$par))], nrow=ngenes))
est_x <- sweep(est_x, 2, opt_results$par[grepl("^eta", names(opt_results$par))], "*")

Here is the correlation between estimated w and observed w:

cor(est_w, w)

##             [,1]        [,2]        [,3]        [,4]       [,5]
## [1,] -0.06511751 -0.08548045 -0.12145422  0.99779515 -0.1207468
## [2,] -0.08882160 -0.10209420 -0.17533424 -0.15455090 -0.2230952
## [3,] -0.01964063 -0.06896749  0.99732363 -0.12371316 -0.1475214
## [4,]  0.99886737 -0.03238973 -0.02588937 -0.06523943 -0.0677335
## [5,] -0.06712511 -0.10251975 -0.14701999 -0.11864029  0.9990249
## [6,] -0.07116101 -0.11989702 -0.13666643 -0.15799797 -0.1881662
## [7,] -0.03917093  0.99906854 -0.06805167 -0.08390533 -0.1050867
## [8,] -0.10300235 -0.11957976 -0.15060561 -0.17508342 -0.2109410
##             [,6]        [,7]        [,8]
## [1,] -0.16227159 -0.15239993 -0.17445705
## [2,] -0.22241746  0.99900054 -0.24794625
## [3,] -0.13752298 -0.17502168 -0.14926717
## [4,] -0.07129504 -0.09014155 -0.09375712
## [5,] -0.18860476 -0.22231870 -0.21174459
## [6,]  0.99877567 -0.22528295 -0.23389558
## [7,] -0.11215778 -0.10425900 -0.11965410
## [8,] -0.23685561 -0.24854604  0.99801810

This is very good! We are estimating relative mixtures very well.

Here is the correlation between estimated x and observed x:

cor(t(est_x), t(x), method='spearman')

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.6794778 0.6597489 0.6875596 0.9654864 0.6601344 0.6379879 0.6936902
## [2,] 0.6294459 0.6552449 0.6702929 0.6746403 0.6604643 0.6561367 0.9637838
## [3,] 0.6842969 0.6876411 0.9850094 0.7000887 0.6933407 0.7085986 0.6933872
## [4,] 0.9471063 0.7060617 0.6755484 0.6769583 0.7027036 0.6669958 0.6787414
## [5,] 0.6461627 0.6386977 0.6547104 0.6382085 0.9589411 0.6283625 0.6382968
## [6,] 0.6026265 0.5693803 0.6429443 0.6218183 0.6138628 0.9346406 0.6169345
## [7,] 0.6502485 0.9406994 0.6676147 0.6750623 0.6652670 0.6344473 0.6785245
## [8,] 0.6419115 0.6039992 0.6440546 0.6293570 0.6225426 0.6058642 0.6328787
##           [,8]
## [1,] 0.6886014
## [2,] 0.6930210
## [3,] 0.7292227
## [4,] 0.7381844
## [5,] 0.6623493
## [6,] 0.6557481
## [7,] 0.6609027
## [8,] 0.9395845

This is also amazingly good! We can reconstruct gene expression profiles almost perfectly, despite the substantial overdispersion in expression that we didn’t model using Stan.

t-SNE

http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

t-SNE is a recently published method for dimension reduction (visualization of struture in high-dimensional data).

It makes very nice visualizations.

The generic idea is to find a low dimensional representation (i.e., a picture) in which proximity of points reflects similarity of the corresponding (high dimensional) observations.

Suppose we have $n$ data points, $x_1, \ldots, x_n$ and each one has $p$ variables: $x_1 = (x_{11}, \ldots, x_{1p})$.

For each, we want to find a low-dimensional point $y$: \[\begin{aligned} x_i \mapsto y_i \end{aligned}\] similarity of $x_i$ and $x_j$ is (somehow) reflected by proximity of $y_i$ and $y_j$

We need to define:

“low dimension” (usually, $k=2$ or $3$)
“similarity” of $x_i$ and $x_j$
“reflected by proximity” of $y_i$ and $y_j$

Similarity

To measure similarity, we just use (Euclidean) distance: \[\begin{aligned} d(x_i, x_j) = \sqrt{\sum_{\ell=1}^n (x_{i\ell} - x_{j\ell})^2} , \end{aligned}\] after first normalizing variables to vary on comparable scales.

Proximity

A natural choice is to require distances in the new space ($d(y_i, y_j)$) to be as close as possible to distances in the original space ($d(x_i, x_j)$).

That’s a pairwise condition. t-SNE instead tries to measure how faithfully relative distances are represented in each point’s neighborhood.

Neighborhoods

For each data point $x_i$, define \[\begin{aligned} p_{ij} = \frac{ e^{-d(x_i, x_j)^2 / (2\sigma^2)} }{ \sum_{\ell \neq i} e^{-d(x_i, x_\ell)^2 / (2 \sigma^2)} } , \end{aligned}\] and $p_{ii} = 0$.

This is the probability that point $i$ would pick point $j$ as a neighbor if these are chosen according to the Gaussian density centered at $x_i$.

Similarly, in the output space, define \[\begin{aligned} q_{ij} = \frac{ (1 + d(y_i, y_j)^2)^{-1} }{ \sum_{\ell \neq i} (1 + d(y_i, y_\ell)^2)^{-1} } \end{aligned}\] and $q_{ii} = 0$.

This is the probability that point $i$ would pick point $j$ as a neighbor if these are chosen according to the Cauchy centered at $x_i$.

Similiarity of neighborhoods

We want neighborhoods to look similar, i.e., choose $y$ so that $q$ looks like $p$.

To do this, we minimize \[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

What’s KL()?

Kullback-Leibler divergence

\[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

This is the average log-likelihood ratio between $p$ and $q$ for a single observation drawn from $p$.

It is a measure of how “surprised” you would be by a sequence of samples from $p$ that you think should be coming from $q$.

Facts:

$\text{KL}(p \given q) \ge 0$
$\text{KL}(p \given q) = 0$ only if $p_i = q_i$ for all $i$.

Simulate data

A high-dimensional donut

Let’s first make some data. This will be some points distributed around a ellipse in $n$ dimensions.

n <- 20
npts <- 1e3
xy <- matrix(rnorm(n*npts), ncol=n)
theta <- runif(npts) * 2 * pi
ab <- matrix(rnorm(2*n), nrow=2)
ab[,2] <- ab[,2] - ab[,1] * sum(ab[,1] * ab[,2]) / sqrt(sum(ab[,1]^2))
ab <- sweep(ab, 1, sqrt(rowSums(ab^2)), "/")
for (k in 1:npts) {
    dxy <- 4 * c(cos(theta[k]), sin(theta[k])) %*% ab
    xy[k,] <- xy[k,] + dxy
}

Here’s what the data look like: plot of chunk plot_data

But there is hidden, two-dimensional structure: plot of chunk plot_ring

Now let’s build the distance matrix.

dmat <- as.matrix(dist(xy))

Implementation

A Stan block

tsne_block <- '
data {
    int N; // number of points
    int n; // input dimension
    int k; // output dimension
    matrix[N,N] dsq;  // distance matrix, squared
}
parameters {
    real<lower=0> sigma_sq; // in kernel for p 
    matrix[N-2,k] y1;
    real<lower=0> y21;
}
transformed parameters {
    matrix[N,k] y;
    y[1,] = rep_row_vector(0.0, k);
    y[3:N,] = y1;
    y[2,] = rep_row_vector(0.0, k);
    y[2,1] = y21;
}
model {
    matrix[N,N] q;
    real dt;
    matrix[N,N] p;
    q[N,N] = 0.0;
    for (i in 1:(N-1)) {
        q[i,i] = 0.0;
        for (j in (i+1):N) {
            q[i,j] = 1 / (1 + squared_distance(y[i], y[j]));
            q[j,i] = q[i,j];
        }
    }
    for (i in 1:N) {
        q[i] = q[i] / sum(q[i]);
    }
    // create p matrix
    p = exp(-dsq/(2*sigma_sq));
    for (i in 1:N) {
        p[i,i] = 0.0;
        p[i] = p[i] / sum(p[i]);
    }
    // compute the target
    for (i in 1:(N-1)) {
        for (j in (i+1):N) {
            target += (-1) * (p[i,j] .* log(p[i,j] ./ q[i,j]));
            target += (-1) * (p[j,i] .* log(p[j,i] ./ q[j,i]));
        }
    }
    sigma_sq ~ normal(0, 10);
}'


tk <- 2
tsne_model <- stan_model(model_code=tsne_block)

Run the model with `optimizing`

runtime <- system.time(tsne_optim <- optimizing(tsne_model,
                                 data=list(N=nrow(xy),
                                           n=ncol(xy),
                                           k=tk,
                                           dsq=(dmat/max(dmat))^2)))
runtime

##    user  system elapsed 
## 117.588   0.180 117.799

It works!

plot of chunk plot_tsne

Another case

High-dimensional random walk

n <- 40
npts <- 1e2
# rw <- colCumsums(matrix(rnorm(n*npts), ncol=n))
rw <- matrix(rnorm(n*npts), ncol=n)
for (i in 2:npts) {
    rw[i,] <- sqrt(.9) * rw[i-1,] + sqrt(.1) * rw[i,]
}

Here’s what the data look like: plot of chunk plot_rw

plot of chunk plot_rw_sub

Now let’s build the distance matrix.

rw_dmat <- as.matrix(dist(rw))

Run again

rw_runtime <- system.time(
     rw_tsne_optim <- optimizing(tsne_model,
                                 data=list(N=nrow(rw),
                                           n=ncol(rw),
                                           k=tk,
                                           dsq=(rw_dmat/max(rw_dmat))^2),
                                 tol_rel_grad=1e5))
rw_runtime

##    user  system elapsed 
##   1.038   0.000   1.038

It works!

plot of chunk plot_rw_tsne

Clustering and dimension reduction

Visualizing expression space

A conceptual model

Nonnegative matrix factorization

… aka “NMF”

A simple NMF model

Relationship to PCA

PCA, in Stan

Stochastic minute

the Dirichlet distribution

“Simplex” parameters

Back to expression space

Testing: compiles?

Testing: runs?

Simulate data

Outline

Easier case

In-class implementation (harder case)

t-SNE

t-SNE

Similarity

Proximity

Neighborhoods

Similiarity of neighborhoods

Kullback-Leibler divergence

Simulate data

A high-dimensional donut

Implementation

A Stan block

Run the model with optimizing

It works!

Another case

High-dimensional random walk

Run again

It works!

Run the model with `optimizing`