In an article recently posted to the Meta Research website, researchers explored the effectiveness of feature prediction as a standalone objective for unsupervised learning (UL) of visual representations from video and proposed video joint-embedding predictive architecture (V-JEPA), a family of vision models trained only using a feature prediction objective.
Background
Humans can map retina-originated low-level signals into a semantic spatiotemporal understanding of the world, enabling them to synthesize notions like global motion and objects. Identifying the objectives/principles that guide such UL in humans is an established goal of the machine learning (ML) community. One related predictive feature principle-based hypothesis postulates that temporally adjacent sensory stimuli representations must be predictive of each other.
The study
In this study, researchers investigated the effectiveness of feature prediction as a standalone objective for UL of visual representations from video using modern tools. They introduced the V-JEPA, a group of vision models solely trained using a feature prediction objective, without using any source of supervision like pre-trained image encoders, negative examples, human annotations, pixel-level reconstruction, and text.
A collection of V-JEPA models was pre-trained using two million videos obtained from publicly available datasets by merging a masked modeling prediction task with a JEPA. The trained models were assessed on downstream video and image tasks using end-to-end fine-tuning and frozen evaluation. Researchers integrated several advances in this field, including larger datasets, JEPA, query-based feature pooling, the standard utilization of transformer architectures in vision, and the masked autoencoding framework maturing into a conceptually simple and modern V-JEPA.
V-JEPA methodology
The key concept of a JEPA is to learn by predicting an input's representation from another input's representation. The basic architecture consisted of a predictor and an encoder. A predictor predicts the input y's representation from the input x's representation, conditioned on a variable z that indicates the transformation between x and y, while the representation of the inputs is computed by an encoder. Conditioning on z enables distinct prediction generation for different transformations of x.
Researchers trained the visual encoder to satisfy the constraint that representations computed from one video part (y) must be predictable from representations computed from another video part (x). Additionally, the predictor network was simultaneously trained with the encoder and was provided with specifications of y's spatiotemporal positions through the conditioning variable z.
A masked modeling formulation was used for the feature prediction task, with y and x regions from the video being sampled using masking. Two mask types, including long-range masks and short-range masks, were leveraged for this purpose. Moreover, a vision transformer (ViT) was employed as the video backbone.
Pretraining and evaluation
Videos from multiple public datasets, such as Kinetics-400/600/700 (K710) and HowTo100M (HT), were combined to create an unsupervised video pre-training dataset, VideoMix2M, containing two million videos. A ViT-H/16384, a ViT-H/16, and a ViT-L/16 transformer model were trained using the VideoMix2M dataset. A batch size of 2400 was utilized for the ViT-H/16384 model, while a batch size of 3072 was utilized for the ViT-H/16 and ViT-L/16 models.
A video clip containing 16 frames sampled with four frame skip was used to input every model. The ViT-H/16 and ViT-L/16 processed the video at 224 spatial resolution, while the ViT-H/16384 used 384 input resolution. Pre-trained models were evaluated on both downstream image and video tasks. A subset of the VideoGLUE benchmark was employed to evaluate video tasks to assess different capabilities.
Specifically, researchers investigated action localization, motion classification, and action recognition on AVA, Something-Something-v2 (SSv2), and Kinetics400, respectively. Additionally, object recognition, scene classification, and fine-grained recognition were investigated on ImageNet, Places205, and iNaturalist 2021 for static image tasks, respectively.
Significance of the study
Results demonstrated that learning by predicting video features led to versatile visual representations that performed efficiently on both appearance- and motion-based tasks without adaptation of the model's parameters/weights, like using a frozen backbone. V-JEPA achieved the best performance among all methods considered in this study, including MVD, VideoMAE, OmniMAE, and DINOv2, on the SSv2 task, which requires fine-grained temporal understanding.
V-JEPA was also competitive on Kinetics400 tasks, where appearance-based features were sufficient, with state-of-the-art image model DINOv2 displaying the best performance. The largest model, a ViT-H/16 trained solely on videos, achieved a score of 77.9% on ImageNet1K, 72.2% on SSv2, and 81.9% on Kinetics-400.
Models trained using feature prediction displayed better performance than pixel prediction approaches under attentive probing/a frozen evaluation protocol and were competitive with pixel prediction approaches under full fine-tuning while utilizing substantially shorter training schedules.
Moreover, models trained using feature prediction were more label-efficient compared to pixel prediction approaches. Specifically, reducing the available number of labeled examples increased the performance gap between pixel-reconstruction and V-JEPA models. Overall, the findings of this study demonstrated that feature prediction could serve as an effective stand-alone objective for UL from video.
Journal reference:
- Bardes, A., Garrido, Q., Chen, X., Rabbat, M., LeCun,Y., Assran, M., Ballas, N., Ponce, J. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. Meta Research website. https://ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/