SAM 2 Enhances Real-Time Video Segmentation

In an article recently posted to the Meta Research website, researchers introduced segment anything model 2 (SAM 2), a transformer-based model with streaming memory for real-time video segmentation. By leveraging a user-interactive data engine, SAM 2 achieved superior performance with 3× fewer interactions in video tasks and 6× faster image segmentation than its predecessor. They released the model, dataset, and an interactive demo to advance video segmentation and related tasks.

Study: SAM 2 Enhances Real-Time Video Segmentation. Image Credit: DC Studio/Shutterstock.com
Study: SAM 2 Enhances Real-Time Video Segmentation. Image Credit: DC Studio/Shutterstock.com

Related Work

Past work in image segmentation includes the SAM, which enabled flexible, zero-shot segmentation with various prompts and has been extended by methods like high-quality SAM (HQ-SAM) and efficiency-focused models such as efficientSAM and mobileSAM.

Interactive video object segmentation (VOS) has evolved with user input methods to propagate masks across frames, including graph-based optimization and modular designs. Semi-supervised VOS relies on initial frame masks and has seen advancements in neural network-based approaches and vision transformers.

Model Overview

The model extends SAM to handle image and video inputs by integrating a memory-based approach for real-time processing. SAM 2 uses point, box, and mask prompts on individual frames to define object boundaries and refines masks iteratively. Unlike SAM, SAM 2 leverages memories from past predictions and prompted frames, including those from future frames, to enhance its segmentation capabilities.

The architecture includes an image encoder for generating feature embeddings, memory attention for conditioning current frame features on past frames and prompts, and a prompt encoder and mask decoder like SAM’s but with added components to handle frame ambiguities and occlusions.

The memory encoder and memory bank manage past predictions and prompts, while training involves simulating interactive prompting on image and video data to predict ground-truth masklets sequentially.

Data Collection

To enable comprehensive video segmentation, a large and diverse dataset, segmentation anything in video (SA-V), was developed through a multi-phase data engine. Initially, SAM was used for per-frame annotations, yielding high-quality masks but with a slower annotation process (37.8 seconds per frame).

The second phase introduced SAM 2, which sped up the process to 7.4 seconds per frame by temporally propagating masks and collecting 63.5K masklets. In the final phase, the fully featured SAM 2 was employed, integrating various prompts and temporal memory, which reduced the annotation time to 4.5 seconds per frame and resulted in the collection of 197.0K masklets.

The data engine phases resulted in a dataset of 50.9K videos with 642.6K masklets featuring diverse scenes and object types. The dataset includes manually annotated and automatically generated masklets to enhance coverage and identify model failures. With significantly more annotations than existing datasets, the SA-V dataset is split into training, validation, and test sets to cover challenging scenarios, ensuring robustness in video segmentation tasks.

SAM 2 Advancements

In evaluating SAM 2's performance on zero-shot video and image tasks, SAM 2 demonstrates notable improvements over previous methods. For video tasks, SAM 2 was tested in both offline and online settings for promptable video segmentation, using up to 3 clicks per frame.

Results showed that SAM 2 outperformed the previous baselines, SAM cross memory enhancement++ (SAM+XMem++), and SAM contrastive universe training for improved efficiency (SAM+Cutie), with significant gains in junction and feature (J&F) accuracy. SAM 2 achieved superior segmentation accuracy across 9 datasets, indicating its ability to provide high-quality video segmentation with fewer interactions. SAM 2 again surpassed previous methods in the semi-supervised VOS setting, showing enhanced performance with various prompts, including clicks, bounding boxes, and ground-truth masks.

SAM 2 was assessed across 37 zero-shot datasets on the image segmentation front, achieving higher average mIoU scores than its predecessor SAM. This improvement is attributed to the more effective Hiera image encoder used in SAM 2, resulting in faster processing speeds.

Integrating SA-1B and video data in training further boosted accuracy, particularly on video benchmarks and new datasets. SAM 2 excelled in interactive and non-interactive settings, reflecting its robust capability in handling a diverse range of segmentation tasks.

Ablation studies on SAM 2's design choices highlight the impact of training data mix, quantity, and quality on performance. Training on diverse datasets, including VOS and SA-1B, yielded the best results. Increasing data quantity consistently improved performance, with optimal results achieved using a mix of all available datasets.

Quality filtering strategies, such as using the most edited masklets, also enhanced performance but did not surpass using the full dataset. These findings underscore the importance of a well-rounded and high-quality training dataset in achieving superior segmentation results.

Conclusion

To sum up, SA evolved into the video domain by extending promotable segmentation, integrating memory capabilities into the SAM architecture for video, and utilizing the diverse SA-V dataset for training and benchmarking. SAM 2 marked a significant advancement in visual perception, with contributions serving as milestones that propelled further research and applications in the field.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, August 14). SAM 2 Enhances Real-Time Video Segmentation. AZoAi. Retrieved on December 22, 2024 from https://www.azoai.com/news/20240814/SAM-2-Enhances-Real-Time-Video-Segmentation.aspx.

  • MLA

    Chandrasekar, Silpaja. "SAM 2 Enhances Real-Time Video Segmentation". AZoAi. 22 December 2024. <https://www.azoai.com/news/20240814/SAM-2-Enhances-Real-Time-Video-Segmentation.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "SAM 2 Enhances Real-Time Video Segmentation". AZoAi. https://www.azoai.com/news/20240814/SAM-2-Enhances-Real-Time-Video-Segmentation.aspx. (accessed December 22, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2024. SAM 2 Enhances Real-Time Video Segmentation. AZoAi, viewed 22 December 2024, https://www.azoai.com/news/20240814/SAM-2-Enhances-Real-Time-Video-Segmentation.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Deep Learning Boosts Security In Virtual Networks By Tackling Complex Intrusion Detection Challenges