SAM 2 Enhances Real-Time Video Segmentation

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Susha Cheriyedath, M.Sc.Aug 14 2024

In an article recently posted to the Meta Research website, researchers introduced segment anything model 2 (SAM 2), a transformer-based model with streaming memory for real-time video segmentation. By leveraging a user-interactive data engine, SAM 2 achieved superior performance with 3× fewer interactions in video tasks and 6× faster image segmentation than its predecessor. They released the model, dataset, and an interactive demo to advance video segmentation and related tasks.

*Study: SAM 2 Enhances Real-Time Video Segmentation. Image Credit: DC Studio/Shutterstock.com*

Related Work

Past work in image segmentation includes the SAM, which enabled flexible, zero-shot segmentation with various prompts and has been extended by methods like high-quality SAM (HQ-SAM) and efficiency-focused models such as efficientSAM and mobileSAM.

Interactive video object segmentation (VOS) has evolved with user input methods to propagate masks across frames, including graph-based optimization and modular designs. Semi-supervised VOS relies on initial frame masks and has seen advancements in neural network-based approaches and vision transformers.

Model Overview

The model extends SAM to handle image and video inputs by integrating a memory-based approach for real-time processing. SAM 2 uses point, box, and mask prompts on individual frames to define object boundaries and refines masks iteratively. Unlike SAM, SAM 2 leverages memories from past predictions and prompted frames, including those from future frames, to enhance its segmentation capabilities.

The architecture includes an image encoder for generating feature embeddings, memory attention for conditioning current frame features on past frames and prompts, and a prompt encoder and mask decoder like SAM’s but with added components to handle frame ambiguities and occlusions.

The memory encoder and memory bank manage past predictions and prompts, while training involves simulating interactive prompting on image and video data to predict ground-truth masklets sequentially.

Data Collection

To enable comprehensive video segmentation, a large and diverse dataset, segmentation anything in video (SA-V), was developed through a multi-phase data engine. Initially, SAM was used for per-frame annotations, yielding high-quality masks but with a slower annotation process (37.8 seconds per frame).

The second phase introduced SAM 2, which sped up the process to 7.4 seconds per frame by temporally propagating masks and collecting 63.5K masklets. In the final phase, the fully featured SAM 2 was employed, integrating various prompts and temporal memory, which reduced the annotation time to 4.5 seconds per frame and resulted in the collection of 197.0K masklets.

The data engine phases resulted in a dataset of 50.9K videos with 642.6K masklets featuring diverse scenes and object types. The dataset includes manually annotated and automatically generated masklets to enhance coverage and identify model failures. With significantly more annotations than existing datasets, the SA-V dataset is split into training, validation, and test sets to cover challenging scenarios, ensuring robustness in video segmentation tasks.

SAM 2 Advancements

In evaluating SAM 2's performance on zero-shot video and image tasks, SAM 2 demonstrates notable improvements over previous methods. For video tasks, SAM 2 was tested in both offline and online settings for promptable video segmentation, using up to 3 clicks per frame.

Results showed that SAM 2 outperformed the previous baselines, SAM cross memory enhancement++ (SAM+XMem++), and SAM contrastive universe training for improved efficiency (SAM+Cutie), with significant gains in junction and feature (J&F) accuracy. SAM 2 achieved superior segmentation accuracy across 9 datasets, indicating its ability to provide high-quality video segmentation with fewer interactions. SAM 2 again surpassed previous methods in the semi-supervised VOS setting, showing enhanced performance with various prompts, including clicks, bounding boxes, and ground-truth masks.

SAM 2 was assessed across 37 zero-shot datasets on the image segmentation front, achieving higher average mIoU scores than its predecessor SAM. This improvement is attributed to the more effective Hiera image encoder used in SAM 2, resulting in faster processing speeds.

Integrating SA-1B and video data in training further boosted accuracy, particularly on video benchmarks and new datasets. SAM 2 excelled in interactive and non-interactive settings, reflecting its robust capability in handling a diverse range of segmentation tasks.

Ablation studies on SAM 2's design choices highlight the impact of training data mix, quantity, and quality on performance. Training on diverse datasets, including VOS and SA-1B, yielded the best results. Increasing data quantity consistently improved performance, with optimal results achieved using a mix of all available datasets.

Quality filtering strategies, such as using the most edited masklets, also enhanced performance but did not surpass using the full dataset. These findings underscore the importance of a well-rounded and high-quality training dataset in achieving superior segmentation results.

Conclusion

To sum up, SA evolved into the video domain by extending promotable segmentation, integrating memory capabilities into the SAM architecture for video, and utilizing the diverse SA-V dataset for training and benchmarking. SAM 2 marked a significant advancement in visual perception, with contributions serving as milestones that propelled further research and applications in the field.

Journal reference:

Ravi, N., et al. (2024). Sam 2: Segment anything in images and videos—AI at Meta. (2024). Meta.com. https://ai.meta.com/research/publications/sam-2-segment-anything-in-images-and-videos/

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2024, August 14). SAM 2 Enhances Real-Time Video Segmentation. AZoAi. Retrieved on July 18, 2025 from https://www.azoai.com/news/20240814/SAM-2-Enhances-Real-Time-Video-Segmentation.aspx.
MLA
Chandrasekar, Silpaja. "SAM 2 Enhances Real-Time Video Segmentation". AZoAi. 18 July 2025. <https://www.azoai.com/news/20240814/SAM-2-Enhances-Real-Time-Video-Segmentation.aspx>.
Chicago
Chandrasekar, Silpaja. "SAM 2 Enhances Real-Time Video Segmentation". AZoAi. https://www.azoai.com/news/20240814/SAM-2-Enhances-Real-Time-Video-Segmentation.aspx. (accessed July 18, 2025).
Harvard
Chandrasekar, Silpaja. 2024. SAM 2 Enhances Real-Time Video Segmentation. AZoAi, viewed 18 July 2025, https://www.azoai.com/news/20240814/SAM-2-Enhances-Real-Time-Video-Segmentation.aspx.