In a recent paper submitted to the arXiv* server, researchers introduced a novel unsupervised technique called VideoCutLER, which performs remarkable multi-instance segmentation and tracking across video frames.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Unsupervised video instance segmentation is a pivotal component in various computer vision arenas such as video surveillance, video editing, and autonomous driving. However, acquiring labeled videos is a costly task. This demands a solution that comprehends video content holistically, without labels.
Prior techniques predominantly rely on optical flow networks and off-the-shelf motion estimators. Yet, optical flow falters in cases of occlusion, motion blur, intricate motions, or lighting changes, rendering it unreliable. These scenarios confound models relying heavily on optical flow. To counter this, the authors advocate for a method, VideoCutLER, that divorces itself from optical flow and instead utilizes synthetic video generation.
Proposed VideoCulter model
The proposed technique, VideoCutLER, follows a three-step cut-synthesis-and-learn process. Initially, researchers used the MaskCut model to craft pseudo-masks from objects. The MaskCut model is a spectral clustering technique for object detection and image instance segmentation. The MaskCut is built upon a self-supervised learning framework, DINO. It generates a patch-wise affinity matrix using key features from the DINO model. The normalized cut (NCut) algorithm was applied to the affine matrix to segment the single object within the image. To segment multiple instances, MaskCut was employed iteratively.
In the second step, the researchers proposed the ImageCut2Video technique. It generates synthetic videos with corresponding mask paths for given images and MaskCut masks. In the case of unlabeled images, ImageCut2Video synthesizes videos and generates pseudo-mask paths, allowing for unsupervised multi-task training. Static and mobile objects merge via resized, repositioned, and augmented masks, fostering dynamic trajectories.
Lastly, the authors introduced an unsupervised video instance segmentation (VIS) model named VideoMask2Former, which is trained using these trajectories. This model is built on a residual network (ResNet50) and leverages 3D spatiotemporal features to predict pseudo-mask trajectories.
Training and evaluation of the proposed model
The proposed model was trained on more than one million unlabeled images taken from the ImageNet dataset. It demonstrates zero-shot unsupervised video instance segmentation across benchmarks: YouTube video instance segmentation (YouTubeVIS)-2019, 2021, video object detection dataset (DAVIS)-2017, and DAVIS2017-Motion.
The model Mask2Former (ResNet50) was trained on ImageNet with MaskCut's masks. Then, VideoMask2Former, initialized with previous weights, was fine-tuned on ImageNet synthetic videos.
For YouTubeVIS datasets, averaged precision (AP) and averaged recall (AR) were evaluated at ten intersection-over-unit (IoU) thresholds for a range of values from 50 percent to 90 percent. Instance-based IoU is computed in both spatial and temporal domains. For DAVIS datasets, region measure (J) and boundary measure (F) scores are employed. These metrics were computed per instance and averaged for final scores.
Experiments revealed that YouTubeVIS underscores VideoCutLER's supremacy over the optical flow-based model, object-centric layered representation (OCLR), and motion grouping. The proposed method excels with over 10x higher AP and 18x higher AP than OCLR on YouTubeVIS-2019. Furthermore, VideoCutLER's performance advantage extends to YouTubeVIS-2021. Evaluating DAVIS-2017 and DAVIS-2017-Motion highlights VideoCutLER's strength in segmenting both static and dynamic objects. Despite DAVIS's focus on moving objects, the proposed model outperforms with approximately four percent higher J, F, and J and F scores.
VideoCutLER's prowess shines through multi-faceted evaluations across video instance segmentation benchmarks. The proposed model bridges the gap between unsupervised and supervised methods by enhancing instance discovery and tracking. VideoCutLER's pretraining potential shines in label-efficient and fully supervised scenarios. VideoCutLER consistently surpasses DINO, achieving a remarkable performance advantage, even with limited labeled data.
Hyperparameters and design decisions
The current study delves into the essential hyperparameters and design decisions for the VideoCutLER model. Initially, researchers conducted an ablation study on different factors. It was concluded that employing synthetic videos with three frames is the optimal approach for training an unsupervised video instance segmentation model.
Notably, increasing the frame count does not lead to further performance enhancements, aligning with previously reported observations. Various augmentation techniques, including adjustments in brightness, rotation, contrast, and random cropping, are employed as defaults during model training. The findings indicate that integrating these augmentations yields around a three percent performance improvement compared to using ImageCut2Video without any data augmentations.
Conclusion
In summary, a simple unsupervised method was introduced for segmenting multiple instances within videos. The approach, VideoCutLER, operates without labels or motion-based cues, such as optical flow. Interestingly, VideoCutLER does not require real videos for training; it generates training videos using natural ImageNet-1K images. Despite its simplicity, VideoCutLER surpasses models using extra cues or video data, achieving 10x their benchmark performance, such as YouTubeVIS. Moreover, VideoCutLER serves as a robust self-supervised pre-trained model for supervised video instance segmentation, showing promise for diverse video recognition applications.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Wang, X., Misra, I., Zeng, Z., Girdhar, R., and Darrell, T. (2023). VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation. arXiv. DOI: https://doi.org/10.48550/arXiv.2308.14710, https://arxiv.org/abs/2308.14710