Language models based on automatized feature extraction
Can a neural network, analyzing a large text corpus, group on its own together words with similar meaning or roles in a phrase?
Last week I was reading an article on Curriculum Learning, and one of the examples was about predicting the next word in a phrase, knowing the previous five words. The example was telling that each word was mapped on an m-dimensional vector, and these vectors where used to predict the next word. I did not know about this technique and this week I have searched a paper to learn more. Nowadays there are many models already available to do this, but I wanted to understand the origin of the technique, and I have read the referenced article:
A Neural Probabilistic Language Model. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin.
Journal of Machine Learning Research 3 (2003) 1137–1155
Ok, dating 2003 it is for sure not the most recent article on the subject, but it explains clearly and step by step this technique. Language corpora are composed of streams of million words, taken from news, from Wikipedia, or other sources. It is possible to analyze these streams and come up with some statistics: how many words are known, which are the rarest or the more common, what are the words that follow one specific word, and so on. Usually the more rare words are dropped – or better replaced with a special symbol, word with uppercase and lowercase letter are considered different and punctuation signs are take into account. Usually, with this information, it is possible to build an n-grams model: let V be the Vocabulary set, you can compute
P(w_t = v in V) = P(w_t | w_(t-1), w_(t-2), …)
Where w_t is the word at position t; you use the context of the last n-1 words to predict the next one. Of course it is a simplification and you cannot push n to a big number, because the problem will become not manageable and the samples will be very sparse. Another important problem is that this model is not generalizable: if in the corpus you have “the dog bites the sheep” nothing in the n-gram model tells you that “the dog bites the goat” is also a valid phrase.
For the n-gram model, each word is different and there is no relation between goat and sheep; we would like to have better model where you can represent a sort of distance between words, so that you can represent “the dog bites the seep” with “the dog bites X” where X is similar to “sheep”. We know from school that we have grammar classes such as nouns and verbs, and words may be classified in many ways, so that you have words that represents animals, things, concepts, etc. These classifications comes from human experience and it would be extremely cumbersome to redact a database with these relations. The core idea presented in “A neural probabilistic language model” is that:
You can fix in advance a number of dimensions m, and let a neural network learn how to classify each word in these dimensions. This is achieve maximizing the probability of predicting the next word in a phrase, given the n previous words. The m dimensions will be real numbers and not discrete values, in this way it is possible to compute word similarities using a distance formula.
You will not end with clearly understandable dimensions – there will not be the “noun” or “verb” dimension, because these are concepts in our culture and this m-dimensional space just comes from a mathematical optimization. Indeed you could think to initialize these m-dimensional vectors with part of speak information, but this for sure requires a lot of work.
The paper presents experiments with m=30, m=60 or m=100 dimensions, this is by far smaller that the number of words present in a Vocabulary (for instance 17’000). You can thing that you have a table with 17’000 rows, one for each word, and 30/60 real number columns that maps the word in the m-space. Notice that this mapping is not changing with the word position in the stream, its a kind of convolutional operation done on words, so the number of parameters to be trained is treatable. The next word prediction is then made using all the m vectors for each of the last n words in the stream, and combining them with a shallow neural network with a softmax layer on the top to choose the next word.
Notice that, for the training to converge, a regularization function has to be introduced: the learned parameters will not diverge to infinite values because high-module parameters will be penalized.
Another interesting feature is that you could manage also new unknown words with this technique, words that never appear in the V training vocabulary. You predict with the model which word should appear instead of the unknown one: let’s say you predict 10 words. You then use the m dimension vectors of these 10 words and mix them together, averaging them with the word probabilities. This becomes the candidate representation for the new unknown word. In this way you could inject this new representation back in the model and use it in the future.
Nowadays other techniques are used for this prediction problem. Using LSTM or GRU in the neural layer allow to remember for a long time – a whole long phrase – some state, and this improves the performances when dealing with masculine/feminine accordance and so on. This allows to extend the number n of words taken into account in the prediction.
Leave a comment