Archive for September 2023
Vision Transformers ViT
The concept of Transformer is very successful in natural language processing applications, but it is not limited to that domain. With “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE” Alexey Dosovitskiy, Lucas Beyer et al. prove that transformers can be applied to image recognition. This paper has been presented in 2021 at the ICLR (International Conference on Learning Representations) and can be found on arXiv.
Usually CNN (convolutional neural networks) are used for image processing, this because they can learn some local image features that simplify higher level tasks. The article reports that, when trained with a large amount of data (>14 million images), transformers can do as well or better than CNN. Learning local image features, at few pixels of distance, is not that important.
In NLP a transformer operates on a stream of tokens; with images the authors have done something similar. Each image is divided into patches, and the patches are fed to the transformer. The image from the paper explains clearly what happens.

The image on the bottom left is divided into 9 patches that are given to the transformer as input, just like a phrase in NLP. The first token is special, it is not an image patch, but represent the image class to be learned. The patches pass through a linear projection (learned during the training) that transforms them into D sized latent vectors.
Positional encoding needs to be added to the patches, to allow the transformer to learn about inter patches relations. According to the authors it is not needed to use a complex bi-dimensional encoding, a simple 1D schema is enough. It is possible to deal with images of different resolutions, the transformer will have to deal with a longer stream of patches.
Vision Transformers are usually referred to using a notation like ViT-Base, ViT-Large or ViT-Huge. this refers to the number of layers used. In the paper a ViT-Base has 12 layers and uses an hidden size D of 768, with 12 attention heads, giving 86 millions parameters. The ViT-Huge has instead 632 millions of parameters. Sometimes also the size of the input patch is specified, ViT-L/16 means that a large vision transformer is used with 16×16 image patches. To train such big models it is important to use an huge quantity of images, 300 millions in case of ViT-huge models, otherwise theirs performance will be inferior to other simpler models. The figure 3 in the paper shows that the corpus size is important to achieve good results.
Zero-shot learning
While reading the “Joint-Embedding Predictive Architecture (JEPA)” paper I have found that Vision Transformer (ViT) models are used as model building blocks. I wanted then to understand what attention is an I have read other papers and summarized them in these 2 posts: “Paying attention, but to what” and “Transformer: attention is all you need“. This time I searched more about the model implementation and I have found this article about OpenAI CLIP: a model that can be instructed in natural language to perform a great variety of classification benchmarks on images, without directly optimizing for the benchmark’s performance. This is one example taken from the CLIP web page:

In the above example CLIP has been trained using pairs of images and labels taken from the web: for instance it has been give a dog picture paired with “Pepper the aussie pup” label, and 400 Millions other images. The text-to-picture learned model can then be used to predict that a new unseen image belongs to the “a photo of a dog” class versus the “a photo of a plane” class. Given the “a photo of a dog” and “a photo of a plane” CLIP can create two internal representation of these phrases and match them with the features extracted from a new input image. It chooses the maximum probability phrase as the correct one. The purpose is to have a system that can be instructed with natural language to perform a wide range of tasks on images.
The article is very long and it will require reading a lot of references to understand it; for the time being it guided me learning about Zero-shot learning (ZSL) concept.
In supervised machine learning you have a set of input examples paired with classes labels. Creating such input sets is very expensive and has one limitation: you cannot deal with new classes unseen at training time, or deal poorly classes with very few instances. With Zero-shot learning you want to use a pre-trained model on a new different set of classes not known at training time. For instance you want to use a model that has been trained to recognize “horses” and “stripes” to recognized zebras: it should be possible because the system already has some knowledge about the two concepts and you could define a zebra like “an horse with stripes”. This is called zero-shot learning because the original model has not been given a zebra picture in input during the training. Similarly you could define One-shot and Few-shot problems where the original model has been exposed on just few pictures of that class. Of course this concept is not limited to image recognition and you can think about zero-shot problems for text or audio processing models.
For ZLS to be possible the model must be able to deal with reusable features that can be composed to describe the new target classes that you want to recognize. CLIP is using natural language to convey such kind of features.
As ZSL uses a different set of classes at training and at classification time, it can be seen as a case of Transfer learning: the knowledge build on a task is re-purposed on another different task, avoiding an expensive training from scratch. This is especially important on domains, such as image recognition, where training a new model is particularly expensive as you need to process million of images. Here the classification domains are different at training and run time, so ZSL is a case of Heterogeneous transfer learning.
Usually transfer learning is achieved taking an original deep-learning model, and removing some of the top layers: for instance those that are full connected, followed by the soft-max stage. These layers are then replaced with new one that are trained on the new “transfer” task, while all the other layers remain untouched (frozen). A model can be finally fine tuned to a task, changing all layers weights as a final training strep.
If you want to learn more on transfer learning you can read “transfer learning guide“. If you are interested in ZSL you can read “zero shot learning guide” which provides a classification of different ZSL approaches and the description of many real application of Zero-Shot learning.