Giovanni Bricconi

My site on WordPress.com

Archive for June 2023

Is it possible to represent high-dimensional data in 2D? More details on t-Student Stochastic Neighbor Embedding (t-SNE)

with one comment

Last week I have read an article on Stochastic Neighbor Embedding; it was quite old an a more recent and longer one was available. I decided to complete the reading to understand a bit better how it is possible to represent in a 2D plot data that have 10 or more dimensions.

This week article is Visualizing Data using t-SNE by Laurens van der Maaten, Geoffrey Hinton, edited by Yoshua Bengio appeared in the Journal of Machine Learning Research: here is the link and the pdf https://www.jmlr.org/papers/v9/vandermaaten08a.html

Using this technique you can for instance: take the pictures in the MNIST data set (handwritten digit pictures) and from pixel to pixel data be able to put each picture in a plane putting the 7 all together, close to nines etc… something quite magical.

A zoom on figure 7 from the article

Well, it is possible but it will require a lot of CPU to be performed because: it is really complex to achieve, and requires optimizing a non-convex function, which is a painful task that requires trying many parameters before getting the good picture. Anyway is quite magical seeing the pictures arranged together by visual similarity.

How does it works? Well it is all about the probability that one data point will pick another one as its neighbor, because they are near in the hyper-space. Yes, you need to compare each datapoint with all the others, you have a n^2 complexity, so you will not be able to plot a map of a million datapoints.

Each point X will have associated a position x in the 2D picture: ideally we would like that when the X points are far, also the x points do the same and vice versa. A phrase in the paper is illuminating: “in 10 dimensions it is possible to have 11 equidistant neighbors, and there is no way to model this faithfully in a two-dimensional map”. So what t-SNE does is privileging a bit the concept of keeping closer the points that are similar, the other challenge of keeping apart dissimilar points is less important. The idea is that in the hyper-space you have clusters of data, and you want to see that in the 2D plot.

Coming to math, what is the difference between SNE and t-SNE?

p is the probability that the i point chooses j as its neighbor in SNE. Notice that p_ij is not equal to p_ji, with t-SNE this is simplified using:

Where n is the number of datapoints. Doing this operation also imposes a minimum value on p_ij, which is good to handle cases such as outliers, that are far from all the other points. The σ is chosen point by point, using a meta-parameter called perplexity, that more or less specify how many neighbors you want for each point (5-50 seems a good range). σ has to vary because the density varies from place to place.

The other difference between SNE and t-SNE is the probability of 2 points to be closer in 2D:

the i and j point have p_ij probability of being neighbors in the high dimensional space, they are mapped to 2D image points and there they will have q_ij probability of being neighbors. The first equation is for SNE, and again is asymmetrical, the second one for t-SNE and has been obtained using the t-Student distribution. The second equation solves various problems:

  • Helps having always a slight repulsion between distant points, and this helps scattering the point in the 2D space. SNE had “crowding” problem, the points were too often placed near the image center.
  • Forces points distant in the high dimensional space to move away from each other also in the 2D space. This was not the case with SNE, as is shown in figure 1 in the paper.
  • Makes the gradient used to optimize the 2D points location faster to compute – no exponentials there.

The paper presents also a comparison between t-SNE and other methods on a couple of testing datasets. t-SNE works better, still the author reports some issues:

  • as the complexity is quadratic, you can’t apply it directly to huge data sets
  • the choice of t-Student distribution may not generalize well to other dimensions other than 2 or 3.
  • using euclidean distance in the high dimensional space can work poorly for many data sets
  • The cost function to optimize is non-convex: you will need to choose some hyper-parameters and wait maybe some hours to see the result

Written by Giovanni

June 17, 2023 at 5:39 pm

Posted in Varie

Is it possible to represent high-dimensional data in 2D?

leave a comment »

Last week I was reading an article on Semantic Hashing and I have found an interesting picture: it was representing in 2D some cluster of text documents. Each dot in the picture was one document, some dots were near, some far, each cluster had dots of different colors. It does not seems much, but how do you decide where to put a document on a 2D plane? There were no axes at all in the picture! Moreover in that paper each document was considered equivalent to a vector of word counts, and the vector length was the size of the dictionary. To resume, the picture was projecting point from a space with thousands of dimension onto a 2D surface; a remarkable achievement!

R. Salakhutdinov, G. Hinton / International Journal of Approximate Reasoning 50 (2009) 969–978

Fortunately in the paper there was a reference to another one: Stochastic Neighbor Embedding (SNA) by Geoffrey Hinton and Sam Roweis, appeared in Advances in Neural Information Processing Systems 15 (NIPS 2002). Here is the PDF link https://papers.nips.cc/paper_files/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf . If you search for Stochastic Neighbor Embedding on Wikipedia you will find also a reference to a more recent and longer document on t-SNA – an evolution of SNA using the t-student distribution.

I did not know, but there are many techniques that try to address the problem of displaying high dimensional data in 2D/3D. In the end even a painter, using the prospective, is representing 3D data on a 2D surface; of course here the problem is much more complex because you have to deal with hundreds of dimensions. In machine learning problems often you have clusters of data that you want to identify, they live in high dimensional spaces but they are somehow restricted to stay only in a region. Here comes the idea, you want to find a criteria to maps closer data to closer dots in 2D. In the SNA paper the authors starts from the usual euclidean distance to derive some probabilistic estimation of the neighbour-ness between data pairs.

here d is the distance between the i and j documents, divided by a sigma constant – x can be word count vectors. Notice that sigma has the i index – it varies from document to document. The reason of this choice seems to be the different data density: somewhere you have closer documents, and in other space regions the documents are much sparser. The sigmas should be chosen to have a good number of neighbors for each point. I understood that this is not easy to be done, and also that d_ij will be different from d_ji with this model.

From the distance they derive a probability that document i chooses document j as its neighbor. Notice that the exponential as the -1 inside, so for high distance you will have a very low probability, and again as d_ij <> d_ij you will also have p_ij<>p_ji.

Each x document will be mapped on an y point in the cartesian plane. You can more or less define the neighbor-probability in the plane with the same formula used for p_ij. In this case they decided to have a single sigma, and not one for each data point. From this they define an optimization problem:

If p_ij is small (i and j are not neighbors), you do not care much what will be the distance between the i and j points in the cartesian plane (q_ij can be small or high, it won’t change much). If instead p_ij is high – i and j are close – you want also q_ij to be high: the dots in the cartesian plane should be close. The usual x and y axis here have absolutely no meaning, just the distance matters.

In the end, to obtain the amazing picture presented in the paper, the method starts assigning a random place to each document in the plane. Then, using a gradient descent technique, the y coordinates are gradually changed to reduce the C function: this means that the method will not be super fast, as it has to work for a while to find a good representation. Also finding some good sigmas seems not easy. But this was just the original idea, I started reading the paper on the t-SNA and seems many of these difficulties were addressed. I plan to finish reading it next week and blog about my findings, stay tuned.

Written by Giovanni

June 11, 2023 at 6:43 pm

Posted in Varie

What is Semantic Hashing?

with one comment

Suppose that you have a large collection of text documents, how could you identify those that are very similar to a given one? Traditional text hashing does not help, as it assigns a completely different value to texts with small differences; you would instead want a magic function that gives similar codes to similar contents. In 2008 Ruslan Salakhutdinov and Geoffrey Hinton published the “Semantic Hashing” paper on the “International Journal of Approximate Reasoning”, showing how deep neural network can solve this problem.

Suppose that you have a text document: it is composed of words, and these words can be repeated many times. Some words are very similar, some are synonyms, there are many cases and stemming can be useful; to a certain degree of approximation cat, cats and catty can be considered the same word. A typical way to analyze texts is to create a vocabulary and count for each document how many times a term appears in each document. This is the basic of TFIDF Term Frequency – Inverse Document Frequency technique. Counting the words is not enough, rare words – those appearing in fewer documents – are more valuable than common words: paleontology makes a document much more specific than cat. Methods based on this concept are the reference in text retrieval applications.

In my previous posts I have described Autoencoders and Deep belief networks, these concepts are needed to understand how semantic hashing works. Let’s keep from TFIDF the vocabulary concept and let’s represent each document as a vector of numbers v, where each element represent the number of times that word appears in the text. The larger the vocabulary is, the longer will be the vectors v. Here we use v for visible and not w as word, because semantic hashing is based on Restricted Boltzman Machines, where we have visible and hidden variables.

With autoencoders we where interested in starting from visible input, mapping it to a short representation with a restricted number of hidden variables, which is good enough to reconstruct the provided input. In our context, we will start from word counts, pass through an hidden h hash vector that represents the semantic content, and we want to make sure that h is good enough to recover the word counts provided in input.

To be useful these h variables must have also another property: they must be binary. For instance we will map every text into a 128 binary value vector – and this will represent its semantic content. 128 bit is just 2 memory words for many microprocessors, so with a ridiculous quantity of memory you get a very good representation of the documents. Moreover CPU manipulates bits at a tremendous speed, so comparing billions of such 128 bit codes can happen in milliseconds.

Even more, if you choose smaller sizes, 20 bits for instance, you have 1048576 combinations. This is small enough to be used as an address: to each number you assign a list, containing the documents associated to that semantic hash. Small bit changes to an address identify similar semantics, so you can decide to collect all the documents that differs of just 4 bits and find very similar documents with nearly no effort.

The actual values of each h element have no human understandable meaning, they are just cells in the deep belief network memory. We can map a document to some useful code, that we are not able to understand, but anyway this will make our task given document x give me those more similar achievable in very short time.

Now let’s have a look at the paper content. The tests were made on 2 text corpora: 20-newsgroups corpus contains 18,845 postings taken from the Usenet newsgroup collection, the Reuters Corpus Volume I is an archive of 804,414 newswire stories5 that have been manually categorized into 103 topics. Each text has been converted into a v vector of inputs, and used to train Deep Belief Networks like the one below:

The bottom layer receives the word count, and the layers above perform the non linear mapping that allow passing from a 2000 dimension vocabulary to a 128 bit semantic hash. Using just 2 layer, as in the restricted Boltzmann machine, is not enough to have a good mapping. The weights in the layers are initialized randomly, but as you can see the architecture is a 4 RBMs stacked one above the others. In this case it is possible to use the layer-by-layer learning procedure, to speed-up the learning. Done this, the parameters learned are good enough for a fine tuning. The above architecture is converted into an autoencoder:

As before the original word counts enters from the bottom, and they are propagated via neuron up to the 128 bit layer. At this stage some noise is added: this is a trick, as they want h to be of bits, they add this noise to force the sigmoid neurons to work far from the 0 value, where there is a linear transition. The injected noise makes the sigmoid works ethier close to 0 or close to 1.

In the post on autoencoder I said the decoder part should be simple, because otherwise the autoencoder can learn and identity function. Here there is a transition from 2000 integer number space to a 128 bit space and then back to 2000 interger, and moreover the neurons coefficients W1 W2 and W3 are kept the same value for the decoder (top 3 layers) and encoder part (bottom 3 layers). The results produced are very good.

For the model they are using these formulas:

The first one is linked to the text analysis problem. Word counts are considered as independent arrival events, so they have chosen to use the Poisson distribution. They apply some normalization, multiplying by N, that is the document length. They say this is important to be able to deal with documents of disparate lengths. Ps is the Possion distribution formula. For the h parameter, they just use the sigmoid, usually used in neural networks and dealing well with binary values. When they moved from 2 layer to 4 layers, they used the Poisson model just for the input layer. The lambdas and b coefficients are the biases, as usual the w are the neuron weights.

They evaluated Semantic Hashing performances using Precision and Recall, showing that it works very well. Also, combining Semantic Hashing with TF-IDF, it is possible to improve performance and retrieval time. The article provides much more details and is not to difficult to read it, this is its link https://www.cs.cmu.edu/~rsalakhu/papers/sdarticle.pdf if you want to read it.

Written by Giovanni

June 4, 2023 at 11:28 am

Posted in Varie