Let’s talk about t-SNE

I had heard of t-SNE a while back, but I didn’t have a clear reason to learn about it until I starting playing with Google’s TensorFlow.  Specifically, TensorBoard is visualization tool in TensorFlow that is downright beautiful, and t-SNE is one option for visualizing high dimensional data.  Googlers really like their artsy design choices – check out this presentation from a Google developer, complete with hat and hoodie.  I stuck with some well-trodden ground by creating a neural net to classify MNIST characters.  This is all secondary to the point here, however, because I really want to focus on t-SNE.  Specifically, I want to know the math behind it while using MNIST and TensorBoard for some pretty visuals.  For a purely practical/application perspective of t-SNE, check out this cool article.  The original t-SNE paper is here.

t-SNE applied to 10,000 character MNIST test set.  Left: t-SNE applied to original 784-dimensional data.  Even in the raw data, t-SNE can illustrate the similar structure in similar characters.  The two clusters of yellow are slanty “1”s and vertical “1”s.  Right: t-SNE applied to 10-dimensional feature vectors extracted from a convolutional neural net (1.2% misclass rate).  It’s interesting that t-SNE and my classifier did not group the “1”s tightly, especially since I found them to be grouped well when projected onto a plane.  The smear is structure showing transition from the slanty “1”s to vertical “1”s.  Also for sake of interpretation, due to the non-linear nature of t-SNE, distances do not have a clear connection to Euclidean distance in the original space.

Just for illustration, these are the misclassified digits from the test set.  We had 117 digits misclassified from a test set of size 10,000: misclass rate ~1.2%.

Suppose we have points \{x_n\}_{n=1}^N in \mathbb{R}^M.  The first step of t-distributed stochastic neighbor embedding (t-SNE) is to view each datapoint as the center of a Gaussian distribution, and then ask how likely each remaining point would be chosen according to this distribution.  More specifically, centering the Gaussian at say x_i, we use the pdf of the Gaussian to define a probability distribution over the remaining points:

p_{j|i} = c [e^{-||x_i-x_j||^2 / 2\sigma_i^2}]

where c is the appropriately chosen scalar.  We also set p_{i|i}=0.  It’s completely possible to continue by using these conditional probability distributions, however, we will eventually need to perform an optimization based on these distributions which can be expensive.  It is computationally simpler to work with a single distribution over all pairs of points, so often


is considered instead.  We scaled again here to make this a probability distribution.  Note p_{i|j} may not equal p_{j|i} as \sigma_i may not equal \sigma_j.  We haven’t actually discussed how to choose the variance for the Gaussian around each point. More on this later.  Regardless, clearly p_{i,j}=p_{j,i}, and we have a probability distribution over (x_i,x_j)_{i,j=1}^N, call it P.

The general idea is then randomly assign points \{y_n\}_{n=1}^N in a low d-dimensional space (usually d=2 or d=3), define a similar probability distribution over pairs of points, and move the low dimensional points to make the distributions “close” to one another.  Instead of building from Gaussian distributions as before, a t-distribution with a single degree of freedom is used (hence the “t” in t-SNE).  In other words we set

q_{i|j}=c [1+||y_i-y_j||^2]^{-(1+d)/2}

where c is again the appropriate constant.  We set q_{i|i} = 0 and define q_{i,j} similarly to p_{i,j}:

q_{i,j}=\frac{q_{i|j} + q_{j|i}}{2N}.

The reason for using a t-distribution is that the heavy tails keep the final distribution \{y_n\}_{n=1}^N from  being too densely concentrated – as far as I can tell, SNE was a technique using Gaussians for the target distribution, and t-SNE is a later variation that produced better results.  There isn’t really a theoretical foundation for using the t-distribution over Gaussians – it’s purely about how the final distribution looks.

Now, if we randomly place \{y_n\}_{n=1}^N in \mathbb{R}^d, q_{i,j} defines a distribution over pairs of points (y_i,y_j)_{i,j=1}^N, call it Q.  Finally, we need to move the points \{y_n\}_{n=1}^N so that Q is “close” to P.  The measure of closeness applied is the Kullback-Leibler divergence of Q from P.  That is, we minimize

KL(P||Q) = \sum_{i\neq j} p_{i,j}\log\left(\frac{p_{i,j}}{q_{i,j}}\right)

via gradient descent.  Note this objective in non-convex so running t-SNE multiple times with the same parameters (such parameters not discussed yet) and data may lead to different results.

We still haven’t discussed how to pick the variances for the conditional distributions denoted previously as p_{j|i}.  These are chosen via a user-selected parameter.  Intuitively, for smaller variance, t-SNE will focus on local structure, and the larger the variance, the more global the t-SNE procedure becomes.  It makes sense that you would also like this to depend on your data.  If the data is dense around x_i, you would like \sigma_i to be small, and conversely \sigma_i should be large when the data around x_i is more sparse.  This is accomplished by choosing each \sigma_i so the perplexity of each conditional distribution is some user-defined constant.  Perplexity is a measure of how well a distribution predicts a sample – so this adjusts the variances of the Gaussians around each point so the remaining points fit each chosen Gaussian equally well in some sense.  More succinctly, given prescribed perplexity \theta, \sigma_i is chosen so this perplexity is achieved:

0 = \theta - 2^{-\sum_{i\neq j}p_{j|i}\log_2p_{j|i}}.

You can solve this via your favorite root-finding algorithm.

…and that’s how t-SNE works.  Most tools that I know for data visualization serve other purposes too such as dimensionality reduction.  t-SNE is different in that it seems purely suited for data visualization.  Picking perplexity and interpreting the results can be a bit of an art, but it works very well.






One thought on “Let’s talk about t-SNE

  1. Dustin G. Mixon December 3, 2018 / 5:03 am

    > There isn’t really a theoretical foundation for using the t-distribution over Gaussians – it’s purely about how the final distribution looks.

    The reason for using a Gaussian upstairs and a t-distribution downstairs has to do with high-dimensional geometry. You can pack O((r/eps)^d) points in a d-dimensional ball of radius r with pairwise separation eps. Since this grows exponentially with d, nearby points in 10 dimensions aren’t nearly as special as nearby points in 2 dimensions. In order to visualize high-dimensional data in 2 dimensions, we need to be extremely selective when deciding which points are allowed to be close together in the embedding. The disparity between Gaussian and t-distribution accomplishes this. In fact, they take t=1, making the disparity as extreme as possible. Presumably, if the original data is D-dimensional, the “right” distribution to select upstairs is a t-distribution with t=D, but this is essentially Gaussian when D is large.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s