Joint-Embedding Predictive Architecture (JEPA): efficient learning of highly semantic image representations?

Some weeks ago I saw on Yann LeCun thread on Linkedin a note on this new paper from Meta: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture, by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas, available on Meta AI site. The paper is about learning highly semantic representations, but what does it actually mean? First of all JEPA uses the self-supervised learning approach: it learns to capture relationships between its inputs, said in other therms learn one part of the input from another part. More concretely JEPA will be trained on a set of images, each image will be split into one context and some targets. Targets are rectangular regions on various sizes that do not overlap with the context (would be too easy to predict targets). Using the context, JEPA will learn a context embedded representation that will predict well the embedded representation of its targets. The key point here is that JEPA will not try to predict pixel-by-pixel what is the target’s content, but instead some low dimensional new representation. Working on this embedding representation makes JEPA much faster than other techniques, and more semantical too.

Part of figure 4 in the paper: given the image 4 targets are extracted, what remains – the context – is used to predict targets representation. The paper discusses different strategies to generate targets and context

The authors included one picture that helps comparing JEPA with other existing techniques:

Figure 2 in the paper https://arxiv.org/abs/2301.08243.

In joint-embedding architecture two inputs are encoded into embedding representations s, and the system learns to use similar representations for similar inputs. By converse in generative architectures an image is encoded and then a decoder, with some control input z, to predict pixel by pixel the other image y. JEPA is the third picture: both the original and target image gets encoded in the embeddings s, and the system learns to represent similar inputs with similar s coordinates.

Other invariance-based methods exists for this kind of task, but they require to provide some hand-crafted similar images to be trained. As you saw JEPA just ask for one single image and generates randomly the context and the targets, which is much more manageable.

Working on semantic representation makes JEPA also much faster, on paper’s figure 1 you can see its performance compared to other approaches such as iBOT, CAE, data2vec etc: we are anyway speaking of thousands of GPU training hours.

Once JEPA is pretrained, it can then be reused as building block to implement specific tasks such as object counting in a scene, depth prediction, etc. It can also be used as input of a generative model, to allow comparing visually the targets with some fakes based on the embedding representation, such as in the picture below:

Part of figure 6: the first picture is the original one. The second is the context used. All the others are deep-fake generated images based on the context’s embedding representation.

The paper provides many details on how JEPA was implemented and trained, and many references to other approaches, in just 17 pages.

Written by Giovanni

July 14, 2023 at 10:04 am

Posted in Varie

One Response

Subscribe to comments with RSS.

[…] reading the “Joint-Embedding Predictive Architecture (JEPA)” paper I have found that Vision Transformer (ViT) models are used as model building blocks. I […]

Zero-shot learning | Giovanni Bricconi

September 3, 2023 at 12:17 pm

Reply

Giovanni Bricconi