# Visualizing Data with t-SNE, part 1

Watching Katie Ledecky beat the field by 11 seconds in the 800M swim at the Olympics has led me to the conclusion that I am out of shape. I will never physically be where Katie Ledecky is, but I can use her work ethic as an inspiration to try a little harder academically. Using the Morning Paper as an example, I am going to try to Ledeckify my life a little bit and work hard at becoming more (intellectually) fit every day.

Today’s paper is ‘Visualizing Data with t-SNE’, and given my lack of expertise, I have decided to split this paper into posts. This post will cover:

## Stochastic Neighbor Embedding (SNE)

Which requires us to talk about:

• Data Visualization
• Various Probability Distributions
• The Crowding Problem
• Shannon Entropy
• Kullback-Leibler Divergence

As a refresher, the problem being addressed here is extracting structure from a set of data. Van der Maaten refers to the dimension-reduced data as a Map $Y$, while the dimension reduction applied to a particular point in the dataset is referred to as a map point $y_i$. The number of features in a map point can be be very high, while its intrinsic dimensionality can be much lower.

Van der Maaten brings up a typical handwritten digit dataset; a set of size-normalized (m-by-n pixel) images will be a set of points in a $R^{mn}$ dimensional vector space. The intrinsic dimensionality of this data should be 10, reflecting the digits 0 through 9. Extracting this structure is hard for computational algorithms and easy for humans.

For the purpose of visualizing data, it really only makes sense to reduce data to 2 or 3 dimensions. (If you’re a contrarian you’re thinking about color as an extra dimension, tesseracts and Trafalmadorians…) In this 2D or 3D representation, we would like points that are similar to be cluster. Using a k-nearest neighbor classifier on the dimension-reduced data should ideally accurately predict the what a new digit actually is.

## How does Stochastic Neighbor Embedding try to solve this problem?

Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances between datapoints into conditional probabilities that represent similarities.

This conditional probability is calculated via the standard way:

In this case:

A crucial part is finding the appropriate variance $\sigma_i$ to describe the data.

For the low dimensional counterparts $y_i$ and $y_j$ of the high-dimensional datapoints $x_i$ and $x_j$ it is possible to compute a similar probability.

Because $\sigma_i$ is just some number, we can set it to $\frac{1}{\sqrt{2}}$ and things simplify to:

SNE finds a positioning of points in the low dimensional representation Map that attempts to minimize the difference between $q$ and $p$. It does this by minimizing the sum of the Kullback-Leibler divergences using a gradient descent method.

What is Kullback-Leibler divergence you ask? This is the expectation of the logarithmic distance between two probabilities. Using this as a cost function is good because it punishes Map that misrepresents points that are close in $X$ as distant in $Y$, but doesn’t do much for points that are close in $Y$ but not so in $X$.

If you’re confused reading the paper, good, because so am I… Van der Maaten’s (VDM) explanation explains the algorithm out of order in a way that is very confusing to me. The order of the algorithm is:

1. Use a binary search to find the appropriate $\sigma_i$ such that the probability distribution is equal to a Perplexity specified by the user

2. Use a stochastic gradient descent to organize the map $Y$.

3. Repeat and choose the optimal result

I can’t speak for you, but the explanation in this paper doesn’t do a great job at explaining the algorithm. I had to search out the original paper for a linear explanation. In the paper the formula for $p_{i \lvert j}$ is given and $\sigma_i$ is defined. VDM says “The method for determining the value of $\sigma_i$ is presented later in the section.” I feel like if I was writing this paper I’d want to make it explicitly clear that this step is performed prior to the steps that are mentioned following that quote, but this fact is omitted and expected to be understood by the reader. (Not the way my mind works…)

In this paper, VDM states that the gradient descent minimization samples across ‘reasonable values of the variance of the Gaussian in the high dimensional space $\sigma_i$’, tomorrow I am going to go in and try grok some SNE code to get a better idea of how this is done in real life.

Physically, the gradient may be interpreted as the resultant force created by a set of springs between the map point $y_i$ and other map points $y_j$. All springs exert a force along the direction $(y_i -y_j)$. The spring between $y_i$ and $y_j$ repels or attracts the map points depending on whether the distance between the two in the map is too small or too large to represent the similarities between the two high-dimensional data points. The force exerted by the spring between $y_i$ and $y_j$ is proportional to its length, and also proportional to its stiffness, which is the mismatch $(p_{j \lvert i} −q_{j \lvert i} + p_{i \lvert j} −q_{i \lvert j})$ between the pairwise similarities of the data points and the map points.