Peter Ralph
Advanced Biological Statistics
t-SNE is a new-ish method for dimension reduction (visualization of struture in high-dimensional data).
The generic idea is to find a low dimensional representation (i.e., a picture) in which proximity of points reflects similarity of the corresponding (high dimensional) observations.
Suppose we have
For each, we want to find a low-dimensional point
We need to define:
To measure similarity, we just use (Euclidean) distance:
A natural choice is to require distances in the new space (
That’s a pairwise condition. t-SNE instead tries to measure how faithfully relative distances are represented in each point’s neighborhood.
For each data point
This is the probability that point
Exercise:
1
is
similar to the first picture, but the neighborhood of point
9
is not.Reminder: the “neighborhood” is defined by relative
distances: for point
Similarly, in the output space, define
This is the probability that point
We want neighborhoods to look similar, i.e., choose
To do this, we minimize
What’s KL()?
This is the average log-likelihood ratio between
It is a measure of how “surprised” you would be by a sequence of
samples from
Facts:
Let’s first make some data. This will be some points distributed
around a ellipse in
n <- 20
npts <- 1e3
xy <- matrix(rnorm(n*npts), ncol=n)
theta <- runif(npts) * 2 * pi
ab <- matrix(rnorm(2*n), nrow=2)
ab[,2] <- ab[,2] - ab[,1] * sum(ab[,1] * ab[,2]) / sqrt(sum(ab[,1]^2))
ab <- sweep(ab, 1, sqrt(rowSums(ab^2)), "/")
for (k in 1:npts) {
dxy <- 4 * c(cos(theta[k]), sin(theta[k])) %*% ab
xy[k,] <- xy[k,] + dxy
}
Here’s what the data look like:
But there is hidden, two-dimensional structure:
Now let’s build the distance matrix.
The quantity to be minimized is Kullback-Leibler distance between
tsne_block <- '
data {
int N; // number of points
int n; // input dimension
int k; // output dimension
matrix[N,N] dsq; // distance matrix, squared
}
parameters {
real<lower=0> sigma_sq; // in kernel for p
matrix[N-2,k] y1;
real<lower=0> y21;
}
transformed parameters {
matrix[N,k] y;
y[1,] = rep_row_vector(0.0, k);
y[3:N,] = y1;
y[2,] = rep_row_vector(0.0, k);
y[2,1] = y21;
}
model {
matrix[N,N] q;
real dt;
matrix[N,N] p;
q[N,N] = 0.0;
for (i in 1:(N-1)) {
q[i,i] = 0.0;
for (j in (i+1):N) {
q[i,j] = 1 / (1 + squared_distance(y[i], y[j]));
q[j,i] = q[i,j];
}
}
for (i in 1:N) {
q[i] = q[i] / sum(q[i]);
}
// create p matrix
p = exp(-dsq/(2*sigma_sq));
for (i in 1:N) {
p[i,i] = 0.0;
p[i] = p[i] / sum(p[i]);
}
// compute the target
for (i in 1:(N-1)) {
for (j in (i+1):N) {
target += (-1) * (p[i,j] .* log(p[i,j] ./ q[i,j]));
target += (-1) * (p[j,i] .* log(p[j,i] ./ q[j,i]));
}
}
sigma_sq ~ normal(0, 10);
}'
tk <- 2
tsne_model <- stan_model(model_code=tsne_block)
optimizing
runtime <- system.time(tsne_optim <- optimizing(tsne_model,
data=list(N=nrow(xy),
n=ncol(xy),
k=tk,
dsq=(dmat/max(dmat))^2)))
runtime
## user system elapsed
## 132.490 0.144 132.951
Here’s what the data look like:
Now let’s build the distance matrix.
rw_runtime <- system.time(
rw_tsne_optim <- optimizing(tsne_model,
data=list(N=nrow(rw),
n=ncol(rw),
k=tk,
dsq=(rw_dmat/max(rw_dmat))^2),
tol_rel_grad=1e5))
rw_runtime
## user system elapsed
## 1.375 0.000 1.374