Giovanni Bricconi

My site on WordPress.com

Archive for August 2023

Transformer: Attention is all you need

with one comment

I have spent some time reading about Transformers, a neural network architecture first described in the paper “Attention is all you need” presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) by many researchers at Google and University of Toronto. The original paper just 9 pages, but many concepts described require further investigation.

During the past weeks I was learning about automatic translations: reading papers about LSTM (long short term memories) and GRU (gated recurrent units) used to compute an internal state h, which is then used by a second neural network layer (decoder) to write the text translation. These systems are effective but they are serial: they scan the source text from the beginning to the end, step by step, to compute intermediate states. This limits theirs scalability, ideally we would like to process the inputs in parallel to achieve higher computing performances.

For instance in “Convolutional Sequence to Sequence Learning” by Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin of Facebook AI Research, a different approach is followed. Convolutional networks can analyze overlapping sub-sequences of the input text. Convolution layers can also be stacked one on top of the others, so that the higher layer can access information about a wider and wider set of original inputs. If the first layers considers 3 inputs at the time, and the second layers 3 inputs of the first layer, these units in the second layer have actually access to 5 input tokens. The memory state computed by GRU or LSTM is actually replaced by the capability of looking at many inputs simultaneously. The results reported in the paper have been obtained with 15 layers in the encoder and 15 layers in the decoder (see paragraph 5.1), for sure it is needed to process long phrases.

This second paper contains also 2 other interesting points: the need of position encoding and the importance of attention.

Positional embedding example: the x axis is a dimension in the embedding vector. The y axis is the position in the input text length. The values runs from -1 to +1 as they are generated by sinusoids. (image from wikipedia)

The convolution does not takes appropriately into account the position of a token in a phase: words at the beginning will have different role than words at the end. To solve this issue they used this approach: each input word is mapped to an embedding representation, a vector of some length f. For each position in the input sequence they computed also a position vector of the same f length that is added to the original embedding. In this way the same input vector carries both information on the token and its position. Seems quite weird not to just keep the position as some new dimension in the input vectors, but according to them it is an effective approach (as shown in the following table).

without position embeddings the network loses 0.5 BLEU points. Position embeddings can be used on both input tokens, but also already generated output tokens.

The first paper describes further how position embeddings can be generated: a really surprising formula that is based on sinusoids.

Position embeddings used in “Attention is all you need”

Using relative position in the phrase, or using the absolute position does not work well: the variance of these vector continues to change for different lengths and this is not good for learning. The sinusoids have the advantage of carrying information on both the absolute an relative position, keeping the variance bounded. A gentle introduction to positional encoding in transformer models provides many useful pictures and code examples to understand the concept.

The second important point is the attention impact. Attention requires a computation effort quadratic with the input length, in case of translation this is not a bottleneck; the Facebook team reported learning performances degraded of just 4% when introducing attention in 4 layers. The table 5 they report show that the more you introduce attention layers the more the translation score improves. With attention in 5 layers they achieve a BLEU score of 21.63, while using only one attention layer gives scores between 20.24 and 21.31.

The Google team decided to introduce an architecture without recurrent neural networks or convolutional neural networks: the results they provide show that the attention mechanism is enough to obtain very good results.

Transformer architecture. Yuening Jia, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons

The encoder and decoders are composed of 6 identical layers as those shown above. The inputs a the previous outputs are passed in after adding the positional encoding, the sinusoidal function described few paragraphs before. Each encoder unit is composed of 2 sublayers: one is using attention with residual connections, the other one a feed forward network again with residual connections. Residual connection means taking the input and connecting it to the output adding it to a new component calculated by the neural network; each unit can forward the input as is or be trained to change it a bit by the neural component. The decoder units are more complicated because past output have to be processed with attention, and this is then used mixing it with the inputs. Also the output must be masked because the decoder must use just the previous outputs to compute the new value.

The multi-head attention component is the core concept but reading the paper was not enough to understand well. You can compute different attentions in parallel, each one will deal with a different aspect of you data. To understand it, it is better to watch this video “Getting meaning from text self attention step by step“, the video is very well done and describe also the BERT model. Different attentions are calculated in parallel, it is like the model could access many different simpler models and mixing them together make possible to achieve outstanding results. For instance in the paper 8 parallel attention “heads” are used.

Each input word is associated to an embedding vector describing its meaning. These embedding vectors are compared with all the others present in the input phrase, giving a score that is then normalised by the softmax operation. These values are the attention scores that are then used to create new embedding vectors used by the next layers.

The paper also uses the terms key, values and queries when speaking about attention. The terms origin is not explained, and it has been necessary to search a bit to get an explanation. This conversation on stack exchange provides some clarity: the terms comes from the search engine world, query stands for the input coming from the user, the keys are indexed elements that allows to find some values – the links that are shown in the search engine results. In the end this is not really important, it is better to watch the “attention step-by-step” video to get the idea of how it works. When the attention is calculated the embedding for each word are not the same, there can be query embeddings, key embeddings and values embedding for each word. This gives the model a large flexibility, and event because the model has multiple-heads, there will be multiple key-query-value vectors for each word for each layer.

Concluding with author’s words “For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.

Written by Giovanni

August 19, 2023 at 10:10 am

Posted in Varie

Paying attention: but to what?

with one comment

One technique of generating language translations using neural networks consists in using two stages, one called encoder and one decoder. The encoder synthesizes the input text into a fixed size vector, the context, that is then used by the decoder to decide the next word to be produced in the translation. It is intuitive that the size of the vector will limit the translation quality, but actually we do not need to pay attention to the same parts of the input text for each word we are going to produce. The translator will have to pay attention to singular-plural accordance, pronouns, etc. In “NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATEDzmitry Bahdanau, Kyung Hyun Cho and Yoshua Bengio show how the decoder can focus on different part of the source text to improve the translation quality.

The model has been trained on a set of English and French phrases of different sizes, 50 words maximum in the paper. A comparison plot in the paper shows that the new model is able to perform well translations of even 60 words phrase with no degradation as the phases becomes longer.

Figure 2 in the paper. The proposed RNNsearch-50 model produces translations of the same quality (bleu score) regardless of the increasing input phrase length. The RNNenc model did not use the attention mechanism. 50 and 30 represent the maximum phrases length used during the training.

How is it possible that the model focuses on different things while writing the text translation? Let’s look a bit into the formulas. The input text is composed of x1, x2, x3… tokens (words or punctuation). The input encoder will produce an hidden state h1, h2, … associated to each single input token. These h elements will be of fixed size. Usually just the last h element is used by the decoder to decide the next output word y, but we can thing about using a more complicated state.

Instead of using the last know h, we can use a function of all the h states, here c stays for the context and i is the index of the current output word y. The context used for the output word yi will depend on all the h encoder states weighted with the alpha weights. Here j stand for the index in the input text x. The important thing is that the context will depends on the i position. The decoder, which produces the translation, will use this context ci, the last produced word y and its own internal state s to decide the next output.

Si is the internal decoder state. Looking at the end of the paper we see how the alpha weights can be computed

Where a is the alignment model. v, W and U will be learned by the neural network. For the output word yi the context used ci will depend all the hidden states for all the inputs, but selecting some of them as the most prominent using alpha i,j. The alpha coefficients will depend on decoder states s and the encoder states h. Notice that it depends only on the last s state, but on all the h states – where some of them will be more important.

Another interesting thing in the paper is that they consider bidirectionally the input text x. So they do not just compute h left to right from the first to the last token, but they consider also the opposite direction, from the text end to the beginning. I suppose this because not just the word before the input token are important, but also those that follows it.

The paper appendix reports many details about the formulas, showing the activation functions used, the fact that Gated Recurrent Units where used, details on the learning procedures used and so on. Useful to understand what these abstract function f, g, and a really are.

Written by Giovanni

August 6, 2023 at 8:20 am

Posted in Varie