Visualizing Data with tSNE, part 1
16 Aug 2016Watching Katie Ledecky beat the field by 11 seconds in the 800M swim at the Olympics has led me to the conclusion that I am out of shape. I will never physically be where Katie Ledecky is, but I can use her work ethic as an inspiration to try a little harder academically. Using the Morning Paper as an example, I am going to try to Ledeckify my life a little bit and work hard at becoming more (intellectually) fit every day.
Today’s paper is ‘Visualizing Data with tSNE’, and given my lack of expertise, I have decided to split this paper into posts. This post will cover:
Stochastic Neighbor Embedding (SNE)
Which requires us to talk about:
 Data Visualization
 Various Probability Distributions
 The Crowding Problem
 Gradient Descent
 Shannon Entropy
 KullbackLeibler Divergence
As a refresher, the problem being addressed here is extracting structure from a set of data. Van der Maaten refers to the dimensionreduced data as a Map $Y$, while the dimension reduction applied to a particular point in the dataset is referred to as a map point $y_i$. The number of features in a map point can be be very high, while its intrinsic dimensionality can be much lower.
Van der Maaten brings up a typical handwritten digit dataset; a set of sizenormalized (mbyn pixel) images will be a set of points in a $R^{mn}$ dimensional vector space. The intrinsic dimensionality of this data should be 10, reflecting the digits 0 through 9. Extracting this structure is hard for computational algorithms and easy for humans.
For the purpose of visualizing data, it really only makes sense to reduce data to 2 or 3 dimensions. (If you’re a contrarian you’re thinking about color as an extra dimension, tesseracts and Trafalmadorians…) In this 2D or 3D representation, we would like points that are similar to be cluster. Using a knearest neighbor classifier on the dimensionreduced data should ideally accurately predict the what a new digit actually is.
How does Stochastic Neighbor Embedding try to solve this problem?
Stochastic Neighbor Embedding (SNE) starts by converting the highdimensional Euclidean distances between datapoints into conditional probabilities that represent similarities.
This conditional probability is calculated via the standard way:
In this case:
A crucial part is finding the appropriate variance $\sigma_i$ to describe the data.
For the low dimensional counterparts $y_i$ and $y_j$ of the highdimensional datapoints $x_i$ and $x_j$ it is possible to compute a similar probability.
Because $\sigma_i$ is just some number, we can set it to $\frac{1}{\sqrt{2}}$ and things simplify to:
SNE finds a positioning of points in the low dimensional representation Map that attempts to minimize the difference between $q$ and $p$. It does this by minimizing the sum of the KullbackLeibler divergences using a gradient descent method.
What is KullbackLeibler divergence you ask? This is the expectation of the logarithmic distance between two probabilities. Using this as a cost function is good because it punishes Map that misrepresents points that are close in $X$ as distant in $Y$, but doesn’t do much for points that are close in $Y$ but not so in $X$.
If you’re confused reading the paper, good, because so am I… Van der Maaten’s (VDM) explanation explains the algorithm out of order in a way that is very confusing to me. The order of the algorithm is:

Use a binary search to find the appropriate $\sigma_i$ such that the probability distribution is equal to a Perplexity specified by the user

Use a stochastic gradient descent to organize the map $Y$.

Repeat and choose the optimal result
I can’t speak for you, but the explanation in this paper doesn’t do a great job at explaining the algorithm. I had to search out the original paper for a linear explanation. In the paper the formula for $p_{i \lvert j}$ is given and $\sigma_i$ is defined. VDM says “The method for determining the value of $\sigma_i$ is presented later in the section.” I feel like if I was writing this paper I’d want to make it explicitly clear that this step is performed prior to the steps that are mentioned following that quote, but this fact is omitted and expected to be understood by the reader. (Not the way my mind works…)
In this paper, VDM states that the gradient descent minimization samples across ‘reasonable values of the variance of the Gaussian in the high dimensional space $\sigma_i$’, tomorrow I am going to go in and try grok some SNE code to get a better idea of how this is done in real life.
The heart of this algorithm is explained by the gradient, and VDM presents this sickeningly interesting way about thinking about the gradient:
Physically, the gradient may be interpreted as the resultant force created by a set of springs between the map point $y_i$ and other map points $y_j$. All springs exert a force along the direction $(y_i y_j)$. The spring between $y_i$ and $y_j$ repels or attracts the map points depending on whether the distance between the two in the map is too small or too large to represent the similarities between the two highdimensional data points. The force exerted by the spring between $y_i$ and $y_j$ is proportional to its length, and also proportional to its stiffness, which is the mismatch $(p_{j \lvert i} −q_{j \lvert i} + p_{i \lvert j} −q_{i \lvert j})$ between the pairwise similarities of the data points and the map points.
This gradient descent is not a convex function, each time we reach a minimum, we don’t know its global. In addition, in order to ensure that the same crappy local minima are’t reached by the descent, a momentum term and perturbative Gaussian noise is added to the gradient. The paper continues to be annoying vague in this algorithmic description, (it isn’t really an SNE paper after all, I shouldn’t get too mad…) The gradient descent initializes a set of map points sampled from an isotropic Gaussian with a small variance centered around the origin, this variance has nothing to do with the variance given.
Thats it for today… it was a bit of a struggle, and I didn’t even cover all the points I wanted to, but alas, I’m tired. In the next posts, I will address some other parts of SNE, cover my efforts to understand some stuff with linear programming RMSD, and how to make cythonized code into a working Python module.
Thoughts, interesting quotes:
“A protein has a lot more ways to be close to another protein than a line has ways to be close to another line”
“A point on a two dimensional manifold has 3 mutually equidistant points, a point on an mdimensional manifold has m+1 equidistant points”