What is Semantic Hashing?
Suppose that you have a large collection of text documents, how could you identify those that are very similar to a given one? Traditional text hashing does not help, as it assigns a completely different value to texts with small differences; you would instead want a magic function that gives similar codes to similar contents. In 2008 Ruslan Salakhutdinov and Geoffrey Hinton published the “Semantic Hashing” paper on the “International Journal of Approximate Reasoning”, showing how deep neural network can solve this problem.
Suppose that you have a text document: it is composed of words, and these words can be repeated many times. Some words are very similar, some are synonyms, there are many cases and stemming can be useful; to a certain degree of approximation cat, cats and catty can be considered the same word. A typical way to analyze texts is to create a vocabulary and count for each document how many times a term appears in each document. This is the basic of TFIDF Term Frequency – Inverse Document Frequency technique. Counting the words is not enough, rare words – those appearing in fewer documents – are more valuable than common words: paleontology makes a document much more specific than cat. Methods based on this concept are the reference in text retrieval applications.
In my previous posts I have described Autoencoders and Deep belief networks, these concepts are needed to understand how semantic hashing works. Let’s keep from TFIDF the vocabulary concept and let’s represent each document as a vector of numbers v, where each element represent the number of times that word appears in the text. The larger the vocabulary is, the longer will be the vectors v. Here we use v for visible and not w as word, because semantic hashing is based on Restricted Boltzman Machines, where we have visible and hidden variables.
With autoencoders we where interested in starting from visible input, mapping it to a short representation with a restricted number of hidden variables, which is good enough to reconstruct the provided input. In our context, we will start from word counts, pass through an hidden h hash vector that represents the semantic content, and we want to make sure that h is good enough to recover the word counts provided in input.
To be useful these h variables must have also another property: they must be binary. For instance we will map every text into a 128 binary value vector – and this will represent its semantic content. 128 bit is just 2 memory words for many microprocessors, so with a ridiculous quantity of memory you get a very good representation of the documents. Moreover CPU manipulates bits at a tremendous speed, so comparing billions of such 128 bit codes can happen in milliseconds.
Even more, if you choose smaller sizes, 20 bits for instance, you have 1048576 combinations. This is small enough to be used as an address: to each number you assign a list, containing the documents associated to that semantic hash. Small bit changes to an address identify similar semantics, so you can decide to collect all the documents that differs of just 4 bits and find very similar documents with nearly no effort.
The actual values of each h element have no human understandable meaning, they are just cells in the deep belief network memory. We can map a document to some useful code, that we are not able to understand, but anyway this will make our task given document x give me those more similar achievable in very short time.
Now let’s have a look at the paper content. The tests were made on 2 text corpora: 20-newsgroups corpus contains 18,845 postings taken from the Usenet newsgroup collection, the Reuters Corpus Volume I is an archive of 804,414 newswire stories5 that have been manually categorized into 103 topics. Each text has been converted into a v vector of inputs, and used to train Deep Belief Networks like the one below:

The bottom layer receives the word count, and the layers above perform the non linear mapping that allow passing from a 2000 dimension vocabulary to a 128 bit semantic hash. Using just 2 layer, as in the restricted Boltzmann machine, is not enough to have a good mapping. The weights in the layers are initialized randomly, but as you can see the architecture is a 4 RBMs stacked one above the others. In this case it is possible to use the layer-by-layer learning procedure, to speed-up the learning. Done this, the parameters learned are good enough for a fine tuning. The above architecture is converted into an autoencoder:

As before the original word counts enters from the bottom, and they are propagated via neuron up to the 128 bit layer. At this stage some noise is added: this is a trick, as they want h to be of bits, they add this noise to force the sigmoid neurons to work far from the 0 value, where there is a linear transition. The injected noise makes the sigmoid works ethier close to 0 or close to 1.
In the post on autoencoder I said the decoder part should be simple, because otherwise the autoencoder can learn and identity function. Here there is a transition from 2000 integer number space to a 128 bit space and then back to 2000 interger, and moreover the neurons coefficients W1 W2 and W3 are kept the same value for the decoder (top 3 layers) and encoder part (bottom 3 layers). The results produced are very good.
For the model they are using these formulas:

The first one is linked to the text analysis problem. Word counts are considered as independent arrival events, so they have chosen to use the Poisson distribution. They apply some normalization, multiplying by N, that is the document length. They say this is important to be able to deal with documents of disparate lengths. Ps is the Possion distribution formula. For the h parameter, they just use the sigmoid, usually used in neural networks and dealing well with binary values. When they moved from 2 layer to 4 layers, they used the Poisson model just for the input layer. The lambdas and b coefficients are the biases, as usual the w are the neuron weights.
They evaluated Semantic Hashing performances using Precision and Recall, showing that it works very well. Also, combining Semantic Hashing with TF-IDF, it is possible to improve performance and retrieval time. The article provides much more details and is not to difficult to read it, this is its link https://www.cs.cmu.edu/~rsalakhu/papers/sdarticle.pdf if you want to read it.
[…] week I was reading an article on Semantic Hashing and I have found an interesting picture: it was representing in 2D some cluster of text documents. […]
Is it possible to represent high-dimensional data in 2D? | Giovanni Bricconi
June 11, 2023 at 6:43 pm