Giovanni Bricconi

My site on WordPress.com

Zero-shot learning

leave a comment »

While reading the “Joint-Embedding Predictive Architecture (JEPA)” paper I have found that Vision Transformer (ViT) models are used as model building blocks. I wanted then to understand what attention is an I have read other papers and summarized them in these 2 posts: “Paying attention, but to what” and “Transformer: attention is all you need“. This time I searched more about the model implementation and I have found this article about OpenAI CLIP: a model that can be instructed in natural language to perform a great variety of classification benchmarks on images, without directly optimizing for the benchmark’s performance. This is one example taken from the CLIP web page:

In the above example CLIP has been trained using pairs of images and labels taken from the web: for instance it has been give a dog picture paired with “Pepper the aussie pup” label, and 400 Millions other images. The text-to-picture learned model can then be used to predict that a new unseen image belongs to the “a photo of a dog” class versus the “a photo of a plane” class. Given the “a photo of a dog” and “a photo of a plane” CLIP can create two internal representation of these phrases and match them with the features extracted from a new input image. It chooses the maximum probability phrase as the correct one. The purpose is to have a system that can be instructed with natural language to perform a wide range of tasks on images.

The article is very long and it will require reading a lot of references to understand it; for the time being it guided me learning about Zero-shot learning (ZSL) concept.

In supervised machine learning you have a set of input examples paired with classes labels. Creating such input sets is very expensive and has one limitation: you cannot deal with new classes unseen at training time, or deal poorly classes with very few instances. With Zero-shot learning you want to use a pre-trained model on a new different set of classes not known at training time. For instance you want to use a model that has been trained to recognize “horses” and “stripes” to recognized zebras: it should be possible because the system already has some knowledge about the two concepts and you could define a zebra like “an horse with stripes”. This is called zero-shot learning because the original model has not been given a zebra picture in input during the training. Similarly you could define One-shot and Few-shot problems where the original model has been exposed on just few pictures of that class. Of course this concept is not limited to image recognition and you can think about zero-shot problems for text or audio processing models.

For ZLS to be possible the model must be able to deal with reusable features that can be composed to describe the new target classes that you want to recognize. CLIP is using natural language to convey such kind of features.

As ZSL uses a different set of classes at training and at classification time, it can be seen as a case of Transfer learning: the knowledge build on a task is re-purposed on another different task, avoiding an expensive training from scratch. This is especially important on domains, such as image recognition, where training a new model is particularly expensive as you need to process million of images. Here the classification domains are different at training and run time, so ZSL is a case of Heterogeneous transfer learning.

Usually transfer learning is achieved taking an original deep-learning model, and removing some of the top layers: for instance those that are full connected, followed by the soft-max stage. These layers are then replaced with new one that are trained on the new “transfer” task, while all the other layers remain untouched (frozen). A model can be finally fine tuned to a task, changing all layers weights as a final training strep.

If you want to learn more on transfer learning you can read “transfer learning guide“. If you are interested in ZSL you can read “zero shot learning guide” which provides a classification of different ZSL approaches and the description of many real application of Zero-Shot learning.

Written by Giovanni

September 3, 2023 at 12:17 pm

Posted in Varie

Leave a comment