Again on Gated Recurrent Units (GRU)

The article and the explanations I found last week were quite clear, but I wanted to read the paper that introduced GRU:

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics.

I was expecting to see a paper telling all the good things a GRU can do, but I was quite surprised to see a paper on automatic language translation. The GRU has been introduced in that context, and then it has become a useful tool in many other scenarios.

It has been a good opportunity to learn something about automatic language translations, at least how it was done 10 years ago. At that moment the most effective way of doing translation was using Statistical Machine Translation (SMT): these systems need tens of gigabytes of memory to hold the language models. In the paper a neural approach was instead used: neural networks can learn how to map a phrase from language A to language B. According to the authors, the memory required by the neural models is much smaller than that SMT requires. The neuro-translation approach was not introduced in this paper, but by some other authors a couple of years before.

The translation works in this way: you have as input a sequence of words in a source language (English). The input phrase has a variable length, but it ends with a dot or a question mark, etc. A neuro-encoder module maps this phrase to a fixed-length representation, a vector of size d. This vector is used by a neuro-decoder module that maps it to a sequence of words in the target language (French). It’s like all the possible knowledge in a phrase could be mapped onto a finite set of variables z_i…z_d and this vector could be universal to all the languages: a fascinating idea! Of course, if the d size is small the machine will not be able to work with long and complex phrases. This is what the authors have found: the translators are quite good with 10 to 20 word phrases, then the performances start to decrease. Also, whenever the machine encounters a new word it does not know, it will not be able to produce a good translation. The machine has to be trained with pairs of phases in the source and target languages, therefore new words are a problem.

Luckily for English and French speakers, there are good corpora that can be used to train the neural networks: I hardly doubt it would be possible to train in this way a German to Greek model, or whichever other pair of European languages excluding English. This is just another example of cultural bias in AI.

Another interesting aspect is how to evaluate the translation quality: a method called BLEU is the standard in this field. I had a quick look at “understand the BLEU Score“, from Google Cloud documentation. You need to have a source phrase and a reference translation, then the candidate translation is compared to the target phrase searching single words, pair of words, up to groups of 4 words. The groups must appear both in the candidate and in the reference phrase; there is also a penalty if the candidate translation is too short. An understandable translation should have a score of 30 to 40, a good one is above 40. The results in the paper score about 30, when the phrases are not too long.

Coming back to the paper, they actually introduce “Gated Recursive Convolutional Neural Network”: so there are gated units as I expected, but I did not expect to see them composed into convolutional networks. There are two reasons for this: the first one is that the input phrases have variable lengths, and to handle this the GRU units are composed in a sort of pyramid. The second one is that convolutional networks share the same parameters in the units, reducing drastically the number of parameters to be learned. Citing the authors “…another natural approach to dealing with variable-length sequences is to use a recursive convolutional neural network where the parameters at each level are shared through the whole network (see Fig. 2 (a)).” In the picture below you can see an example of how the GRU are laid out:

Another interesting aspect is that in the paper the GRU units have a left and right input, the left word and the right word. The hidden state h (or c as cell state) is updated by choosing from the left input, or the right input, or it is reset to a combination of the 2.

The coefficients in the first equation are normalized so that the next c state depends mostly on one of the 3 terms. The j coefficient is a positional counter left to right, the t coefficient is as usual the time evolution.
As you can see in the pyramid example above, some lines between the GRU units are missing: their W coefficients are too low with the current input, and that contribution is not taken into account. In a certain way, the structure is parsing the phrase “Obama is the President of the United States.” as “<Obama, <is the President, of the United States.>>”. The structure has been derived uniquely from the training, and nothing has been done to instruct the system about English grammar! Very remarkable.

In the paper nothing is said about the decoder unit, so I don’t know if it just a plain neural network or a recurrent one. In the end the state vector Z should represent the phrase meaning, and there should be no need of recur on the previous state.

Written by Giovanni

February 28, 2023 at 11:44 am

Posted in Varie

Giovanni Bricconi