Is it possible to represent high-dimensional data in 2D? More details on t-Student Stochastic Neighbor Embedding (t-SNE)

Last week I have read an article on Stochastic Neighbor Embedding; it was quite old an a more recent and longer one was available. I decided to complete the reading to understand a bit better how it is possible to represent in a 2D plot data that have 10 or more dimensions.

This week article is Visualizing Data using t-SNE by Laurens van der Maaten, Geoffrey Hinton, edited by Yoshua Bengio appeared in the Journal of Machine Learning Research: here is the link and the pdf https://www.jmlr.org/papers/v9/vandermaaten08a.html

Using this technique you can for instance: take the pictures in the MNIST data set (handwritten digit pictures) and from pixel to pixel data be able to put each picture in a plane putting the 7 all together, close to nines etc… something quite magical.

Well, it is possible but it will require a lot of CPU to be performed because: it is really complex to achieve, and requires optimizing a non-convex function, which is a painful task that requires trying many parameters before getting the good picture. Anyway is quite magical seeing the pictures arranged together by visual similarity.

How does it works? Well it is all about the probability that one data point will pick another one as its neighbor, because they are near in the hyper-space. Yes, you need to compare each datapoint with all the others, you have a n^2 complexity, so you will not be able to plot a map of a million datapoints.

Each point X will have associated a position x in the 2D picture: ideally we would like that when the X points are far, also the x points do the same and vice versa. A phrase in the paper is illuminating: “in 10 dimensions it is possible to have 11 equidistant neighbors, and there is no way to model this faithfully in a two-dimensional map”. So what t-SNE does is privileging a bit the concept of keeping closer the points that are similar, the other challenge of keeping apart dissimilar points is less important. The idea is that in the hyper-space you have clusters of data, and you want to see that in the 2D plot.

Coming to math, what is the difference between SNE and t-SNE?

p is the probability that the i point chooses j as its neighbor in SNE. Notice that p_ij is not equal to p_ji, with t-SNE this is simplified using:

Where n is the number of datapoints. Doing this operation also imposes a minimum value on p_ij, which is good to handle cases such as outliers, that are far from all the other points. The σ is chosen point by point, using a meta-parameter called perplexity, that more or less specify how many neighbors you want for each point (5-50 seems a good range). σ has to vary because the density varies from place to place.

The other difference between SNE and t-SNE is the probability of 2 points to be closer in 2D:

the i and j point have p_ij probability of being neighbors in the high dimensional space, they are mapped to 2D image points and there they will have q_ij probability of being neighbors. The first equation is for SNE, and again is asymmetrical, the second one for t-SNE and has been obtained using the t-Student distribution. The second equation solves various problems:

Helps having always a slight repulsion between distant points, and this helps scattering the point in the 2D space. SNE had “crowding” problem, the points were too often placed near the image center.
Forces points distant in the high dimensional space to move away from each other also in the 2D space. This was not the case with SNE, as is shown in figure 1 in the paper.
Makes the gradient used to optimize the 2D points location faster to compute – no exponentials there.

The paper presents also a comparison between t-SNE and other methods on a couple of testing datasets. t-SNE works better, still the author reports some issues:

as the complexity is quadratic, you can’t apply it directly to huge data sets
the choice of t-Student distribution may not generalize well to other dimensions other than 2 or 3.
using euclidean distance in the high dimensional space can work poorly for many data sets
The cost function to optimize is non-convex: you will need to choose some hyper-parameters and wait maybe some hours to see the result

Written by Giovanni

June 17, 2023 at 5:39 pm

Posted in Varie

One Response

Subscribe to comments with RSS.

[…] start from Principal Component Analysis (PCA). I have found a reference to it in the t-Stochastic Neighbor Embedding paper. In that context it has been used to reduce the number of dimensions to analyze, as that […]

Principal Component Analysis – reducing data dimensionality | Giovanni Bricconi

July 2, 2023 at 7:34 am

Reply

Giovanni Bricconi