# Visualizing Data with t-SNE, part 2

17 Aug 2016Soldiering on!

Yesterday I covered the barebone essentials behind the SNE algorithm. (We haven’t actually arrived at t-SNE yet.)

Before I move on, there were some bullet points that I didn’t cover:

- What does it mean to induce a probability distribution?
- What is Shannon entropy?
- What is a smooth measure?
- What is the crowding problem?

The answer to the first question is pretty simple. Given some $\sigma_i$, this is established a variance for our data. This establishes a set probability for every other point in the set, when every point is accounted for in a discrete set, we have a probability distribution $P_i$.

So I come from a physical chemistry background, and if you’re like me, when you think entropy, you think the Second Law and Third Law of Thermodynamics. Informally, entropy quantifies the order of a molecular system. A perfectly arranged crystal has entropy equal to 0, while the plasma swirling around in the universe has something much higher.

From wikipedia:

Shannon Entropyis a measure of unpredictability of information content.

…

English text, treated as a string of characters, has fairly low entropy, i.e., is fairly predictable. Even if we do not know exactly what is going to come next, we can be fairly certain that, for example, there will be many more e’s than z’s, that the combination ‘qu’ will be much more common than any other combination with a ‘q’ in it, and that the combination ‘th’ will be more common than ‘z’, ‘q’, or ‘qu’. After the first few letters one can often guess the rest of the word.

The Shannon Entropy of our probability distribution $P_i$ over the discrete set
of data is defined as the expectation value of the *Information Content* of X.
The information content is always a logarithm of a probability, for Shannon
Entropy, this is log base 2. This is considered the entropy measured in base 2
bits.

There might be some more formal definition somewhere, but when I think
expectation value of some random variable, in this case the *Information content*
of X, I just think multiply at each point by the probability at that point.
Summing over all points in a discrete set, the Shannon Entropy $H(P_i)$ is:

The *Perplexity* is simply 2 raised to this number, and is considered a ‘smooth
measure of the effective number of neighbors’.

What is a smooth measure? Well, this is a whole can of worms. If you’re a measure theorist I have the utmost respect for you. In my limited exposure from some brief introductions from kind Math professors and Wikipedia, a measure allows you to ascribe and compare sizes of subsets in a set. I think you need to take a few measure theory classes, read some books, maybe even get a Ph.D to have a strong understanding of ‘smoothness’.

Last stop on the SNE train before we get onto part 3 (we’re almost at t-SNE I promise!)

## The crowding problem!

Before I start, I have to define what it means for points to be mutually equidistant. Consider the two dimensional plane, and an equilateral triangle in the plane. For each vertex, the distance to the other two vertices is the same. A square can never have this property, a corner vertex will always be further away from an opposite corner than it will be from an adjacent corner.

It turns out that this a property that extends to n-dimensional Euclidean spaces. Objects with equidistance vertices can be constructed when the number of vertices is less than or equal to n+1.

When reducing the dimensionality of a dataset, this fact is problematic. There are more ways for things to be close in high-dimensions than there are in lower dimensions. When a bunch of data points are close together in some high-dimensional manifold, the area available for points to spread out on is much higher than in any 2D representation. Close points will occupy too much space in 2D representations. (I really wish I had an iPad so I could easily sketch some examples of what I’m talking about, but I think this is supremely cool.)

The notion of a set of springs finds a great use here; crowding will cause non-neighbors to be positioned in spots that are way too far away. This ruins the gradient descent, the ‘spring forces’ in the gradient are proportional to the distance, and the large number of distance points accumulate, pulling all the data into one ugly cluster. (At least that’s how I understand it…)

The next post will cover t-SNE in all its glory, thanks for reading!