\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\SE}{SE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\cor}{cor} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\MVN}{MVN} \DeclareMathOperator{\LogNormal}{logNormal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \DeclareMathOperator{\Wishart}{Wishart} \DeclareMathOperator{\StudentsT}{StudentsT} \DeclareMathOperator{\Weibull}{Weibull} \newcommand{\given}{\;\vert\;} \]

Dimension reduction: t-SNE

Peter Ralph

Advanced Biological Statistics

t-SNE

t-SNE

http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

t-SNE is a new-ish method for dimension reduction (visualization of struture in high-dimensional data).

It makes very nice visualizations.

The generic idea is to find a low dimensional representation (i.e., a picture) in which proximity of points reflects similarity of the corresponding (high dimensional) observations.

Suppose we have \(n\) data points, \(x_1, \ldots, x_n\) and each one has \(p\) variables: \(x_1 = (x_{11}, \ldots, x_{1p})\).

For each, we want to find a low-dimensional point \(y\): \[\begin{aligned} x_i \mapsto y_i \end{aligned}\] similarity of \(x_i\) and \(x_j\) is (somehow) reflected by proximity of \(y_i\) and \(y_j\)

We need to define:

  1. “low dimension” (usually, \(k=2\) or \(3\))
  2. “similarity” of \(x_i\) and \(x_j\)
  3. “reflected by proximity” of \(y_i\) and \(y_j\)

Similarity

To measure similarity, we just use (Euclidean) distance: \[\begin{aligned} d(x_i, x_j) = \sqrt{\sum_{\ell=1}^n (x_{i\ell} - x_{j\ell})^2} , \end{aligned}\] after first normalizing variables to vary on comparable scales.

Proximity

A natural choice is to require distances in the new space (\(d(y_i, y_j)\)) to be as close as possible to distances in the original space (\(d(x_i, x_j)\)).

That’s a pairwise condition. t-SNE instead tries to measure how faithfully relative distances are represented in each point’s neighborhood.

Neighborhoods

For each data point \(x_i\), define \[\begin{aligned} p_{ij} = \frac{ e^{-d(x_i, x_j)^2 / (2\sigma^2)} }{ \sum_{\ell \neq i} e^{-d(x_i, x_\ell)^2 / (2 \sigma^2)} } , \end{aligned}\] and \(p_{ii} = 0\).

This is the probability that point \(i\) would pick point \(j\) as a neighbor if these are chosen according to the Gaussian density centered at \(x_i\).

Exercise:

  1. Draw 9 points in a 10m x 10m square. Label the points 1 to 9.
  2. Draw another square, and try to draw the points 1 to 9 in that one so that the neighborhood of point 1 is similar to the first picture, but the neighborhood of point 9 is not.

Reminder: the “neighborhood” is defined by relative distances: for point \(i\) it is \[\begin{aligned} p_{ij} = \frac{ e^{-d(x_i, x_j)^2 / (2\sigma^2)} }{ \sum_{\ell \neq i} e^{-d(x_i, x_\ell)^2 / (2 \sigma^2)} } , \end{aligned}\] and \(p_{ii} = 0\).

Similarly, in the output space, define \[\begin{aligned} q_{ij} = \frac{ (1 + d(y_i, y_j)^2)^{-1} }{ \sum_{\ell \neq i} (1 + d(y_i, y_\ell)^2)^{-1} } \end{aligned}\] and \(q_{ii} = 0\).

This is the probability that point \(i\) would pick point \(j\) as a neighbor if these are chosen according to the Cauchy centered at \(x_i\).

Similiarity of neighborhoods

We want neighborhoods to look similar, i.e., choose \(y\) so that \(q\) looks like \(p\).

To do this, we minimize \[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

What’s KL()?

Kullback-Leibler divergence

\[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

This is the average log-likelihood ratio between \(p\) and \(q\) for a single observation drawn from \(p\).

It is a measure of how “surprised” you would be by a sequence of samples from \(p\) that you think should be coming from \(q\).

Facts:

  1. \(\text{KL}(p \given q) \ge 0\)

  2. \(\text{KL}(p \given q) = 0\) only if \(p_i = q_i\) for all \(i\).

Simulate data

A high-dimensional donut

Let’s first make some data. This will be some points distributed around a ellipse in \(n\) dimensions.

n <- 20
npts <- 1e3
xy <- matrix(rnorm(n*npts), ncol=n)
theta <- runif(npts) * 2 * pi
ab <- matrix(rnorm(2*n), nrow=2)
ab[,2] <- ab[,2] - ab[,1] * sum(ab[,1] * ab[,2]) / sqrt(sum(ab[,1]^2))
ab <- sweep(ab, 1, sqrt(rowSums(ab^2)), "/")
for (k in 1:npts) {
    dxy <- 4 * c(cos(theta[k]), sin(theta[k])) %*% ab
    xy[k,] <- xy[k,] + dxy
}

Here’s what the data look like: plot of chunk r plot_data

But there is hidden, two-dimensional structure: plot of chunk r plot_ring

Now let’s build the distance matrix.

dmat <- as.matrix(dist(xy)) 

Implementation

The quantity to be minimized is Kullback-Leibler distance between \(p\) and \(q\), defined as \[\begin{aligned} \text{KL}(p|q) &= \sum_x p_x \log(p_x/q_x) . \end{aligned}\]

A Stan block

tsne_block <- '
data {
    int N; // number of points
    int n; // input dimension
    int k; // output dimension
    matrix[N,N] dsq;  // distance matrix, squared
}
parameters {
    real<lower=0> sigma_sq; // in kernel for p 
    matrix[N-2,k] y1;
    real<lower=0> y21;
}
transformed parameters {
    matrix[N,k] y;
    y[1,] = rep_row_vector(0.0, k);
    y[3:N,] = y1;
    y[2,] = rep_row_vector(0.0, k);
    y[2,1] = y21;
}
model {
    matrix[N,N] q;
    real dt;
    matrix[N,N] p;
    q[N,N] = 0.0;
    for (i in 1:(N-1)) {
        q[i,i] = 0.0;
        for (j in (i+1):N) {
            q[i,j] = 1 / (1 + squared_distance(y[i], y[j]));
            q[j,i] = q[i,j];
        }
    }
    for (i in 1:N) {
        q[i] = q[i] / sum(q[i]);
    }
    // create p matrix
    p = exp(-dsq/(2*sigma_sq));
    for (i in 1:N) {
        p[i,i] = 0.0;
        p[i] = p[i] / sum(p[i]);
    }
    // compute the target
    for (i in 1:(N-1)) {
        for (j in (i+1):N) {
            target += (-1) * (p[i,j] .* log(p[i,j] ./ q[i,j]));
            target += (-1) * (p[j,i] .* log(p[j,i] ./ q[j,i]));
        }
    }
    sigma_sq ~ normal(0, 10);
}'


tk <- 2
tsne_model <- stan_model(model_code=tsne_block)

Run the model with optimizing

runtime <- system.time(tsne_optim <- optimizing(tsne_model,
                                 data=list(N=nrow(xy),
                                           n=ncol(xy),
                                           k=tk,
                                           dsq=(dmat/max(dmat))^2)))
runtime
##    user  system elapsed 
## 132.490   0.144 132.951

It works!

plot of chunk r plot_tsne

Another case

High-dimensional random walk

n <- 40
npts <- 1e2
# rw <- colCumsums(matrix(rnorm(n*npts), ncol=n))
rw <- matrix(rnorm(n*npts), ncol=n)
for (i in 2:npts) {
    rw[i,] <- sqrt(.9) * rw[i-1,] + sqrt(.1) * rw[i,]
}

Here’s what the data look like: plot of chunk r plot_rw

plot of chunk r plot_rw_sub

Now let’s build the distance matrix.

rw_dmat <- as.matrix(dist(rw)) 

Run again

rw_runtime <- system.time(
     rw_tsne_optim <- optimizing(tsne_model,
                                 data=list(N=nrow(rw),
                                           n=ncol(rw),
                                           k=tk,
                                           dsq=(rw_dmat/max(rw_dmat))^2),
                                 tol_rel_grad=1e5))
rw_runtime
##    user  system elapsed 
##   1.375   0.000   1.374

It works!

plot of chunk r plot_rw_tsne