Feature Prediction: Key to Effective Unsupervised Learning from Video

In an article recently posted to the Meta Research website, researchers explored the effectiveness of feature prediction as a standalone objective for unsupervised learning (UL) of visual representations from video and proposed video joint-embedding predictive architecture (V-JEPA), a family of vision models trained only using a feature prediction objective.

Study: Feature Prediction: Key to Effective Unsupervised Learning from Video. Image credit: Quardia/Shutterstock
Study: Feature Prediction: Key to Effective Unsupervised Learning from Video. Image credit: Quardia/Shutterstock

Background

Humans can map retina-originated low-level signals into a semantic spatiotemporal understanding of the world, enabling them to synthesize notions like global motion and objects. Identifying the objectives/principles that guide such UL in humans is an established goal of the machine learning (ML) community. One related predictive feature principle-based hypothesis postulates that temporally adjacent sensory stimuli representations must be predictive of each other.

The study

In this study, researchers investigated the effectiveness of feature prediction as a standalone objective for UL of visual representations from video using modern tools. They introduced the V-JEPA, a group of vision models solely trained using a feature prediction objective, without using any source of supervision like pre-trained image encoders, negative examples, human annotations, pixel-level reconstruction, and text.

A collection of V-JEPA models was pre-trained using two million videos obtained from publicly available datasets by merging a masked modeling prediction task with a JEPA. The trained models were assessed on downstream video and image tasks using end-to-end fine-tuning and frozen evaluation. Researchers integrated several advances in this field, including larger datasets, JEPA, query-based feature pooling, the standard utilization of transformer architectures in vision, and the masked autoencoding framework maturing into a conceptually simple and modern V-JEPA.

V-JEPA methodology

The key concept of a JEPA is to learn by predicting an input's representation from another input's representation. The basic architecture consisted of a predictor and an encoder. A predictor predicts the input y's representation from the input x's representation, conditioned on a variable z that indicates the transformation between x and y, while the representation of the inputs is computed by an encoder. Conditioning on z enables distinct prediction generation for different transformations of x.

Researchers trained the visual encoder to satisfy the constraint that representations computed from one video part (y) must be predictable from representations computed from another video part (x). Additionally, the predictor network was simultaneously trained with the encoder and was provided with specifications of y's spatiotemporal positions through the conditioning variable z.

A masked modeling formulation was used for the feature prediction task, with y and x regions from the video being sampled using masking. Two mask types, including long-range masks and short-range masks, were leveraged for this purpose. Moreover, a vision transformer (ViT) was employed as the video backbone.

Pretraining and evaluation

Videos from multiple public datasets, such as Kinetics-400/600/700 (K710) and HowTo100M (HT), were combined to create an unsupervised video pre-training dataset, VideoMix2M, containing two million videos. A ViT-H/16384, a ViT-H/16, and a ViT-L/16 transformer model were trained using the VideoMix2M dataset. A batch size of 2400 was utilized for the ViT-H/16384 model, while a batch size of 3072 was utilized for the ViT-H/16 and ViT-L/16 models.

A video clip containing 16 frames sampled with four frame skip was used to input every model. The ViT-H/16 and ViT-L/16 processed the video at 224 spatial resolution, while the ViT-H/16384 used 384 input resolution. Pre-trained models were evaluated on both downstream image and video tasks. A subset of the VideoGLUE benchmark was employed to evaluate video tasks to assess different capabilities.

Specifically, researchers investigated action localization, motion classification, and action recognition on AVA, Something-Something-v2 (SSv2), and Kinetics400, respectively. Additionally, object recognition, scene classification, and fine-grained recognition were investigated on ImageNet, Places205, and iNaturalist 2021 for static image tasks, respectively.

Significance of the study

Results demonstrated that learning by predicting video features led to versatile visual representations that performed efficiently on both appearance- and motion-based tasks without adaptation of the model's parameters/weights, like using a frozen backbone. V-JEPA achieved the best performance among all methods considered in this study, including MVD, VideoMAE, OmniMAE, and DINOv2, on the SSv2 task, which requires fine-grained temporal understanding.

V-JEPA was also competitive on Kinetics400 tasks, where appearance-based features were sufficient, with state-of-the-art image model DINOv2 displaying the best performance. The largest model, a ViT-H/16 trained solely on videos, achieved a score of 77.9% on ImageNet1K, 72.2% on SSv2, and 81.9% on Kinetics-400.

Models trained using feature prediction displayed better performance than pixel prediction approaches under attentive probing/a frozen evaluation protocol and were competitive with pixel prediction approaches under full fine-tuning while utilizing substantially shorter training schedules.

Moreover, models trained using feature prediction were more label-efficient compared to pixel prediction approaches. Specifically, reducing the available number of labeled examples increased the performance gap between pixel-reconstruction and V-JEPA models. Overall, the findings of this study demonstrated that feature prediction could serve as an effective stand-alone objective for UL from video.

Journal reference:
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2024, March 01). Feature Prediction: Key to Effective Unsupervised Learning from Video. AZoAi. Retrieved on December 11, 2024 from https://www.azoai.com/news/20240301/Feature-Prediction-Key-to-Effective-Unsupervised-Learning-from-Video.aspx.

  • MLA

    Dam, Samudrapom. "Feature Prediction: Key to Effective Unsupervised Learning from Video". AZoAi. 11 December 2024. <https://www.azoai.com/news/20240301/Feature-Prediction-Key-to-Effective-Unsupervised-Learning-from-Video.aspx>.

  • Chicago

    Dam, Samudrapom. "Feature Prediction: Key to Effective Unsupervised Learning from Video". AZoAi. https://www.azoai.com/news/20240301/Feature-Prediction-Key-to-Effective-Unsupervised-Learning-from-Video.aspx. (accessed December 11, 2024).

  • Harvard

    Dam, Samudrapom. 2024. Feature Prediction: Key to Effective Unsupervised Learning from Video. AZoAi, viewed 11 December 2024, https://www.azoai.com/news/20240301/Feature-Prediction-Key-to-Effective-Unsupervised-Learning-from-Video.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Boost Machine Learning Trust With HEX's Human-in-the-Loop Explainability