Joint-Embedding Predictive Architecture (JEPA): efficient learning of highly semantic image representations?
Some weeks ago I saw on Yann LeCun thread on Linkedin a note on this new paper from Meta: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture, by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas, available on Meta AI site. The paper is about learning highly semantic representations, but what does it actually mean? First of all JEPA uses the self-supervised learning approach: it learns to capture relationships between its inputs, said in other therms learn one part of the input from another part. More concretely JEPA will be trained on a set of images, each image will be split into one context and some targets. Targets are rectangular regions on various sizes that do not overlap with the context (would be too easy to predict targets). Using the context, JEPA will learn a context embedded representation that will predict well the embedded representation of its targets. The key point here is that JEPA will not try to predict pixel-by-pixel what is the target’s content, but instead some low dimensional new representation. Working on this embedding representation makes JEPA much faster than other techniques, and more semantical too.

The authors included one picture that helps comparing JEPA with other existing techniques:

In joint-embedding architecture two inputs are encoded into embedding representations s, and the system learns to use similar representations for similar inputs. By converse in generative architectures an image is encoded and then a decoder, with some control input z, to predict pixel by pixel the other image y. JEPA is the third picture: both the original and target image gets encoded in the embeddings s, and the system learns to represent similar inputs with similar s coordinates.
Other invariance-based methods exists for this kind of task, but they require to provide some hand-crafted similar images to be trained. As you saw JEPA just ask for one single image and generates randomly the context and the targets, which is much more manageable.
Working on semantic representation makes JEPA also much faster, on paper’s figure 1 you can see its performance compared to other approaches such as iBOT, CAE, data2vec etc: we are anyway speaking of thousands of GPU training hours.
Once JEPA is pretrained, it can then be reused as building block to implement specific tasks such as object counting in a scene, depth prediction, etc. It can also be used as input of a generative model, to allow comparing visually the targets with some fakes based on the embedding representation, such as in the picture below:

The paper provides many details on how JEPA was implemented and trained, and many references to other approaches, in just 17 pages.
[…] reading the “Joint-Embedding Predictive Architecture (JEPA)” paper I have found that Vision Transformer (ViT) models are used as model building blocks. I […]
Zero-shot learning | Giovanni Bricconi
September 3, 2023 at 12:17 pm