Vision Transformers ViT
The concept of Transformer is very successful in natural language processing applications, but it is not limited to that domain. With “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE” Alexey Dosovitskiy, Lucas Beyer et al. prove that transformers can be applied to image recognition. This paper has been presented in 2021 at the ICLR (International Conference on Learning Representations) and can be found on arXiv.
Usually CNN (convolutional neural networks) are used for image processing, this because they can learn some local image features that simplify higher level tasks. The article reports that, when trained with a large amount of data (>14 million images), transformers can do as well or better than CNN. Learning local image features, at few pixels of distance, is not that important.
In NLP a transformer operates on a stream of tokens; with images the authors have done something similar. Each image is divided into patches, and the patches are fed to the transformer. The image from the paper explains clearly what happens.

The image on the bottom left is divided into 9 patches that are given to the transformer as input, just like a phrase in NLP. The first token is special, it is not an image patch, but represent the image class to be learned. The patches pass through a linear projection (learned during the training) that transforms them into D sized latent vectors.
Positional encoding needs to be added to the patches, to allow the transformer to learn about inter patches relations. According to the authors it is not needed to use a complex bi-dimensional encoding, a simple 1D schema is enough. It is possible to deal with images of different resolutions, the transformer will have to deal with a longer stream of patches.
Vision Transformers are usually referred to using a notation like ViT-Base, ViT-Large or ViT-Huge. this refers to the number of layers used. In the paper a ViT-Base has 12 layers and uses an hidden size D of 768, with 12 attention heads, giving 86 millions parameters. The ViT-Huge has instead 632 millions of parameters. Sometimes also the size of the input patch is specified, ViT-L/16 means that a large vision transformer is used with 16×16 image patches. To train such big models it is important to use an huge quantity of images, 300 millions in case of ViT-huge models, otherwise theirs performance will be inferior to other simpler models. The figure 3 in the paper shows that the corpus size is important to achieve good results.
Leave a comment