Transformer: Attention is all you need

I have spent some time reading about Transformers, a neural network architecture first described in the paper “Attention is all you need” presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) by many researchers at Google and University of Toronto. The original paper just 9 pages, but many concepts described require further investigation.

During the past weeks I was learning about automatic translations: reading papers about LSTM (long short term memories) and GRU (gated recurrent units) used to compute an internal state h, which is then used by a second neural network layer (decoder) to write the text translation. These systems are effective but they are serial: they scan the source text from the beginning to the end, step by step, to compute intermediate states. This limits theirs scalability, ideally we would like to process the inputs in parallel to achieve higher computing performances.

For instance in “Convolutional Sequence to Sequence Learning” by Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin of Facebook AI Research, a different approach is followed. Convolutional networks can analyze overlapping sub-sequences of the input text. Convolution layers can also be stacked one on top of the others, so that the higher layer can access information about a wider and wider set of original inputs. If the first layers considers 3 inputs at the time, and the second layers 3 inputs of the first layer, these units in the second layer have actually access to 5 input tokens. The memory state computed by GRU or LSTM is actually replaced by the capability of looking at many inputs simultaneously. The results reported in the paper have been obtained with 15 layers in the encoder and 15 layers in the decoder (see paragraph 5.1), for sure it is needed to process long phrases.

This second paper contains also 2 other interesting points: the need of position encoding and the importance of attention.

Positional embedding example: the x axis is a dimension in the embedding vector. The y axis is the position in the input text length. The values runs from -1 to +1 as they are generated by sinusoids. (image from wikipedia)

The convolution does not takes appropriately into account the position of a token in a phase: words at the beginning will have different role than words at the end. To solve this issue they used this approach: each input word is mapped to an embedding representation, a vector of some length f. For each position in the input sequence they computed also a position vector of the same f length that is added to the original embedding. In this way the same input vector carries both information on the token and its position. Seems quite weird not to just keep the position as some new dimension in the input vectors, but according to them it is an effective approach (as shown in the following table).

without position embeddings the network loses 0.5 BLEU points. Position embeddings can be used on both input tokens, but also already generated output tokens.

The first paper describes further how position embeddings can be generated: a really surprising formula that is based on sinusoids.

Position embeddings used in “Attention is all you need”

Using relative position in the phrase, or using the absolute position does not work well: the variance of these vector continues to change for different lengths and this is not good for learning. The sinusoids have the advantage of carrying information on both the absolute an relative position, keeping the variance bounded. A gentle introduction to positional encoding in transformer models provides many useful pictures and code examples to understand the concept.

The second important point is the attention impact. Attention requires a computation effort quadratic with the input length, in case of translation this is not a bottleneck; the Facebook team reported learning performances degraded of just 4% when introducing attention in 4 layers. The table 5 they report show that the more you introduce attention layers the more the translation score improves. With attention in 5 layers they achieve a BLEU score of 21.63, while using only one attention layer gives scores between 20.24 and 21.31.

The Google team decided to introduce an architecture without recurrent neural networks or convolutional neural networks: the results they provide show that the attention mechanism is enough to obtain very good results.

Transformer architecture. Yuening Jia, CC BY-SA 3.0 https://creativecommons.org/licenses/by-sa/3.0, via Wikimedia Commons

The encoder and decoders are composed of 6 identical layers as those shown above. The inputs a the previous outputs are passed in after adding the positional encoding, the sinusoidal function described few paragraphs before. Each encoder unit is composed of 2 sublayers: one is using attention with residual connections, the other one a feed forward network again with residual connections. Residual connection means taking the input and connecting it to the output adding it to a new component calculated by the neural network; each unit can forward the input as is or be trained to change it a bit by the neural component. The decoder units are more complicated because past output have to be processed with attention, and this is then used mixing it with the inputs. Also the output must be masked because the decoder must use just the previous outputs to compute the new value.

The multi-head attention component is the core concept but reading the paper was not enough to understand well. You can compute different attentions in parallel, each one will deal with a different aspect of you data. To understand it, it is better to watch this video “Getting meaning from text self attention step by step“, the video is very well done and describe also the BERT model. Different attentions are calculated in parallel, it is like the model could access many different simpler models and mixing them together make possible to achieve outstanding results. For instance in the paper 8 parallel attention “heads” are used.

Each input word is associated to an embedding vector describing its meaning. These embedding vectors are compared with all the others present in the input phrase, giving a score that is then normalised by the softmax operation. These values are the attention scores that are then used to create new embedding vectors used by the next layers.

The paper also uses the terms key, values and queries when speaking about attention. The terms origin is not explained, and it has been necessary to search a bit to get an explanation. This conversation on stack exchange provides some clarity: the terms comes from the search engine world, query stands for the input coming from the user, the keys are indexed elements that allows to find some values – the links that are shown in the search engine results. In the end this is not really important, it is better to watch the “attention step-by-step” video to get the idea of how it works. When the attention is calculated the embedding for each word are not the same, there can be query embeddings, key embeddings and values embedding for each word. This gives the model a large flexibility, and event because the model has multiple-heads, there will be multiple key-query-value vectors for each word for each layer.

Concluding with author’s words “For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.“

Written by Giovanni

August 19, 2023 at 10:10 am

Posted in Varie

One Response

Subscribe to comments with RSS.

[…] papers and summarized them in these 2 posts: “Paying attention, but to what” and “Transformer: attention is all you need“. This time I searched more about the model implementation and I have found this article […]

Zero-shot learning | Giovanni Bricconi

September 3, 2023 at 12:17 pm

Reply

Giovanni Bricconi