Paying attention: but to what?
One technique of generating language translations using neural networks consists in using two stages, one called encoder and one decoder. The encoder synthesizes the input text into a fixed size vector, the context, that is then used by the decoder to decide the next word to be produced in the translation. It is intuitive that the size of the vector will limit the translation quality, but actually we do not need to pay attention to the same parts of the input text for each word we are going to produce. The translator will have to pay attention to singular-plural accordance, pronouns, etc. In “NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE” Dzmitry Bahdanau, Kyung Hyun Cho and Yoshua Bengio show how the decoder can focus on different part of the source text to improve the translation quality.
The model has been trained on a set of English and French phrases of different sizes, 50 words maximum in the paper. A comparison plot in the paper shows that the new model is able to perform well translations of even 60 words phrase with no degradation as the phases becomes longer.

How is it possible that the model focuses on different things while writing the text translation? Let’s look a bit into the formulas. The input text is composed of x1, x2, x3… tokens (words or punctuation). The input encoder will produce an hidden state h1, h2, … associated to each single input token. These h elements will be of fixed size. Usually just the last h element is used by the decoder to decide the next output word y, but we can thing about using a more complicated state.


Instead of using the last know h, we can use a function of all the h states, here c stays for the context and i is the index of the current output word y. The context used for the output word yi will depend on all the h encoder states weighted with the alpha weights. Here j stand for the index in the input text x. The important thing is that the context will depends on the i position. The decoder, which produces the translation, will use this context ci, the last produced word y and its own internal state s to decide the next output.


Si is the internal decoder state. Looking at the end of the paper we see how the alpha weights can be computed



Where a is the alignment model. v, W and U will be learned by the neural network. For the output word yi the context used ci will depend all the hidden states for all the inputs, but selecting some of them as the most prominent using alpha i,j. The alpha coefficients will depend on decoder states s and the encoder states h. Notice that it depends only on the last s state, but on all the h states – where some of them will be more important.
Another interesting thing in the paper is that they consider bidirectionally the input text x. So they do not just compute h left to right from the first to the last token, but they consider also the opposite direction, from the text end to the beginning. I suppose this because not just the word before the input token are important, but also those that follows it.

The paper appendix reports many details about the formulas, showing the activation functions used, the fact that Gated Recurrent Units where used, details on the learning procedures used and so on. Useful to understand what these abstract function f, g, and a really are.
[…] what attention is an I have read other papers and summarized them in these 2 posts: “Paying attention, but to what” and “Transformer: attention is all you need“. This time I searched more about […]
Zero-shot learning | Giovanni Bricconi
September 3, 2023 at 12:17 pm