I had heard of t-SNE a while back, but I didn’t have a clear reason to learn about it until I starting playing with Google’s TensorFlow.  Specifically, TensorBoard is visualization tool in TensorFlow that is downright beautiful, and t-SNE is one option for visualizing high dimensional data.  Googlers really like their artsy design choices – check out this presentation from a Google developer, complete with hat and hoodie.  I stuck with some well-trodden ground by creating a neural net to classify MNIST characters.  This is all secondary to the point here, however, because I really want to focus on t-SNE.  Specifically, I want to know the math behind it while using MNIST and TensorBoard for some pretty visuals.  For a purely practical/application perspective of t-SNE, check out this cool article.  The original t-SNE paper is here.

Suppose we have points $\{x_n\}_{n=1}^N$ in $\mathbb{R}^M$.  The first step of t-distributed stochastic neighbor embedding (t-SNE) is to view each datapoint as the center of a Gaussian distribution, and then ask how likely each remaining point would be chosen according to this distribution.  More specifically, centering the Gaussian at say $x_i$, we use the pdf of the Gaussian to define a probability distribution over the remaining points:

$p_{j|i} = c [e^{-||x_i-x_j||^2 / 2\sigma_i^2}]$

where $c$ is the appropriately chosen scalar.  We also set $p_{i|i}=0$.  It’s completely possible to continue by using these conditional probability distributions, however, we will eventually need to perform an optimization based on these distributions which can be expensive.  It is computationally simpler to work with a single distribution over all pairs of points, so often

$p_{i,j}=\frac{p_{i|j}+p_{j|i}}{2N}$

is considered instead.  We scaled again here to make this a probability distribution.  Note $p_{i|j}$ may not equal $p_{j|i}$ as $\sigma_i$ may not equal $\sigma_j$.  We haven’t actually discussed how to choose the variance for the Gaussian around each point. More on this later.  Regardless, clearly $p_{i,j}=p_{j,i}$, and we have a probability distribution over $(x_i,x_j)_{i,j=1}^N$, call it $P$.

The general idea is then randomly assign points $\{y_n\}_{n=1}^N$ in a low $d$-dimensional space (usually $d=2$ or $d=3$), define a similar probability distribution over pairs of points, and move the low dimensional points to make the distributions “close” to one another.  Instead of building from Gaussian distributions as before, a t-distribution with a single degree of freedom is used (hence the “t” in t-SNE).  In other words we set

$q_{i|j}=c [1+||y_i-y_j||^2]^{-(1+d)/2}$

where $c$ is again the appropriate constant.  We set $q_{i|i} = 0$ and define $q_{i,j}$ similarly to $p_{i,j}$:

$q_{i,j}=\frac{q_{i|j} + q_{j|i}}{2N}$.

The reason for using a t-distribution is that the heavy tails keep the final distribution $\{y_n\}_{n=1}^N$ from  being too densely concentrated – as far as I can tell, SNE was a technique using Gaussians for the target distribution, and t-SNE is a later variation that produced better results.  There isn’t really a theoretical foundation for using the t-distribution over Gaussians – it’s purely about how the final distribution looks.

Now, if we randomly place $\{y_n\}_{n=1}^N$ in $\mathbb{R}^d$, $q_{i,j}$ defines a distribution over pairs of points $(y_i,y_j)_{i,j=1}^N$, call it $Q$.  Finally, we need to move the points $\{y_n\}_{n=1}^N$ so that $Q$ is “close” to $P$.  The measure of closeness applied is the Kullback-Leibler divergence of $Q$ from $P$.  That is, we minimize

$KL(P||Q) = \sum_{i\neq j} p_{i,j}\log\left(\frac{p_{i,j}}{q_{i,j}}\right)$

via gradient descent.  Note this objective in non-convex so running t-SNE multiple times with the same parameters (such parameters not discussed yet) and data may lead to different results.

We still haven’t discussed how to pick the variances for the conditional distributions denoted previously as $p_{j|i}$.  These are chosen via a user-selected parameter.  Intuitively, for smaller variance, t-SNE will focus on local structure, and the larger the variance, the more global the t-SNE procedure becomes.  It makes sense that you would also like this to depend on your data.  If the data is dense around $x_i$, you would like $\sigma_i$ to be small, and conversely $\sigma_i$ should be large when the data around $x_i$ is more sparse.  This is accomplished by choosing each $\sigma_i$ so the perplexity of each conditional distribution is some user-defined constant.  Perplexity is a measure of how well a distribution predicts a sample – so this adjusts the variances of the Gaussians around each point so the remaining points fit each chosen Gaussian equally well in some sense.  More succinctly, given prescribed perplexity $\theta$, $\sigma_i$ is chosen so this perplexity is achieved:

$0 = \theta - 2^{-\sum_{i\neq j}p_{j|i}\log_2p_{j|i}}$.

You can solve this via your favorite root-finding algorithm.

…and that’s how t-SNE works.  Most tools that I know for data visualization serve other purposes too such as dimensionality reduction.  t-SNE is different in that it seems purely suited for data visualization.  Picking perplexity and interpreting the results can be a bit of an art, but it works very well.

### One thought on “Let’s talk about t-SNE”

1. Dustin G. Mixon December 3, 2018 / 5:03 am

> There isn’t really a theoretical foundation for using the t-distribution over Gaussians – it’s purely about how the final distribution looks.

The reason for using a Gaussian upstairs and a t-distribution downstairs has to do with high-dimensional geometry. You can pack O((r/eps)^d) points in a d-dimensional ball of radius r with pairwise separation eps. Since this grows exponentially with d, nearby points in 10 dimensions aren’t nearly as special as nearby points in 2 dimensions. In order to visualize high-dimensional data in 2 dimensions, we need to be extremely selective when deciding which points are allowed to be close together in the embedding. The disparity between Gaussian and t-distribution accomplishes this. In fact, they take t=1, making the disparity as extreme as possible. Presumably, if the original data is D-dimensional, the “right” distribution to select upstairs is a t-distribution with t=D, but this is essentially Gaussian when D is large.

Like