In an article recently posted to the Meta Research website, researchers introduced segment anything model 2 (SAM 2), a transformer-based model with streaming memory for real-time video segmentation. By leveraging a user-interactive data engine, SAM 2 achieved superior performance with 3× fewer interactions in video tasks and 6× faster image segmentation than its predecessor. They released the model, dataset, and an interactive demo to advance video segmentation and related tasks.
Related Work
Past work in image segmentation includes the SAM, which enabled flexible, zero-shot segmentation with various prompts and has been extended by methods like high-quality SAM (HQ-SAM) and efficiency-focused models such as efficientSAM and mobileSAM.
Interactive video object segmentation (VOS) has evolved with user input methods to propagate masks across frames, including graph-based optimization and modular designs. Semi-supervised VOS relies on initial frame masks and has seen advancements in neural network-based approaches and vision transformers.
Model Overview
The model extends SAM to handle image and video inputs by integrating a memory-based approach for real-time processing. SAM 2 uses point, box, and mask prompts on individual frames to define object boundaries and refines masks iteratively. Unlike SAM, SAM 2 leverages memories from past predictions and prompted frames, including those from future frames, to enhance its segmentation capabilities.
The architecture includes an image encoder for generating feature embeddings, memory attention for conditioning current frame features on past frames and prompts, and a prompt encoder and mask decoder like SAM’s but with added components to handle frame ambiguities and occlusions.
The memory encoder and memory bank manage past predictions and prompts, while training involves simulating interactive prompting on image and video data to predict ground-truth masklets sequentially.
Data Collection
To enable comprehensive video segmentation, a large and diverse dataset, segmentation anything in video (SA-V), was developed through a multi-phase data engine. Initially, SAM was used for per-frame annotations, yielding high-quality masks but with a slower annotation process (37.8 seconds per frame).
The second phase introduced SAM 2, which sped up the process to 7.4 seconds per frame by temporally propagating masks and collecting 63.5K masklets. In the final phase, the fully featured SAM 2 was employed, integrating various prompts and temporal memory, which reduced the annotation time to 4.5 seconds per frame and resulted in the collection of 197.0K masklets.
The data engine phases resulted in a dataset of 50.9K videos with 642.6K masklets featuring diverse scenes and object types. The dataset includes manually annotated and automatically generated masklets to enhance coverage and identify model failures. With significantly more annotations than existing datasets, the SA-V dataset is split into training, validation, and test sets to cover challenging scenarios, ensuring robustness in video segmentation tasks.
SAM 2 Advancements
In evaluating SAM 2's performance on zero-shot video and image tasks, SAM 2 demonstrates notable improvements over previous methods. For video tasks, SAM 2 was tested in both offline and online settings for promptable video segmentation, using up to 3 clicks per frame.
Results showed that SAM 2 outperformed the previous baselines, SAM cross memory enhancement++ (SAM+XMem++), and SAM contrastive universe training for improved efficiency (SAM+Cutie), with significant gains in junction and feature (J&F) accuracy. SAM 2 achieved superior segmentation accuracy across 9 datasets, indicating its ability to provide high-quality video segmentation with fewer interactions. SAM 2 again surpassed previous methods in the semi-supervised VOS setting, showing enhanced performance with various prompts, including clicks, bounding boxes, and ground-truth masks.
SAM 2 was assessed across 37 zero-shot datasets on the image segmentation front, achieving higher average mIoU scores than its predecessor SAM. This improvement is attributed to the more effective Hiera image encoder used in SAM 2, resulting in faster processing speeds.
Integrating SA-1B and video data in training further boosted accuracy, particularly on video benchmarks and new datasets. SAM 2 excelled in interactive and non-interactive settings, reflecting its robust capability in handling a diverse range of segmentation tasks.
Ablation studies on SAM 2's design choices highlight the impact of training data mix, quantity, and quality on performance. Training on diverse datasets, including VOS and SA-1B, yielded the best results. Increasing data quantity consistently improved performance, with optimal results achieved using a mix of all available datasets.
Quality filtering strategies, such as using the most edited masklets, also enhanced performance but did not surpass using the full dataset. These findings underscore the importance of a well-rounded and high-quality training dataset in achieving superior segmentation results.
Conclusion
To sum up, SA evolved into the video domain by extending promotable segmentation, integrating memory capabilities into the SAM architecture for video, and utilizing the diverse SA-V dataset for training and benchmarking. SAM 2 marked a significant advancement in visual perception, with contributions serving as milestones that propelled further research and applications in the field.