Is it possible to represent high-dimensional data in 2D?

Last week I was reading an article on Semantic Hashing and I have found an interesting picture: it was representing in 2D some cluster of text documents. Each dot in the picture was one document, some dots were near, some far, each cluster had dots of different colors. It does not seems much, but how do you decide where to put a document on a 2D plane? There were no axes at all in the picture! Moreover in that paper each document was considered equivalent to a vector of word counts, and the vector length was the size of the dictionary. To resume, the picture was projecting point from a space with thousands of dimension onto a 2D surface; a remarkable achievement!

R. Salakhutdinov, G. Hinton / International Journal of Approximate Reasoning 50 (2009) 969–978

Fortunately in the paper there was a reference to another one: Stochastic Neighbor Embedding (SNA) by Geoffrey Hinton and Sam Roweis, appeared in Advances in Neural Information Processing Systems 15 (NIPS 2002). Here is the PDF link https://papers.nips.cc/paper_files/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf . If you search for Stochastic Neighbor Embedding on Wikipedia you will find also a reference to a more recent and longer document on t-SNA – an evolution of SNA using the t-student distribution.

I did not know, but there are many techniques that try to address the problem of displaying high dimensional data in 2D/3D. In the end even a painter, using the prospective, is representing 3D data on a 2D surface; of course here the problem is much more complex because you have to deal with hundreds of dimensions. In machine learning problems often you have clusters of data that you want to identify, they live in high dimensional spaces but they are somehow restricted to stay only in a region. Here comes the idea, you want to find a criteria to maps closer data to closer dots in 2D. In the SNA paper the authors starts from the usual euclidean distance to derive some probabilistic estimation of the neighbour-ness between data pairs.

here d is the distance between the i and j documents, divided by a sigma constant – x can be word count vectors. Notice that sigma has the i index – it varies from document to document. The reason of this choice seems to be the different data density: somewhere you have closer documents, and in other space regions the documents are much sparser. The sigmas should be chosen to have a good number of neighbors for each point. I understood that this is not easy to be done, and also that d_ij will be different from d_ji with this model.

From the distance they derive a probability that document i chooses document j as its neighbor. Notice that the exponential as the -1 inside, so for high distance you will have a very low probability, and again as d_ij <> d_ij you will also have p_ij<>p_ji.

Each x document will be mapped on an y point in the cartesian plane. You can more or less define the neighbor-probability in the plane with the same formula used for p_ij. In this case they decided to have a single sigma, and not one for each data point. From this they define an optimization problem:

If p_ij is small (i and j are not neighbors), you do not care much what will be the distance between the i and j points in the cartesian plane (q_ij can be small or high, it won’t change much). If instead p_ij is high – i and j are close – you want also q_ij to be high: the dots in the cartesian plane should be close. The usual x and y axis here have absolutely no meaning, just the distance matters.

In the end, to obtain the amazing picture presented in the paper, the method starts assigning a random place to each document in the plane. Then, using a gradient descent technique, the y coordinates are gradually changed to reduce the C function: this means that the method will not be super fast, as it has to work for a while to find a good representation. Also finding some good sigmas seems not easy. But this was just the original idea, I started reading the paper on the t-SNA and seems many of these difficulties were addressed. I plan to finish reading it next week and blog about my findings, stay tuned.

Written by Giovanni

June 11, 2023 at 6:43 pm

Posted in Varie

Giovanni Bricconi