A groundbreaking AI model, CoTracker3, dramatically improves video point tracking by eliminating complex components and effectively leveraging real-world data, outpacing existing technologies with superior performance and efficiency.
Research: CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers at Meta AI, Visual Geometry Group, University of Oxford introduced CoTracker3. This new point-tracking model simplified previous architectures and used a semi-supervised training method to improve performance on real videos. By generating pseudo-labels from off-the-shelf teachers, it bridged the gap between synthetic and real video data. CoTracker3 achieved better results with significantly less data, offering both online and offline tracking options that effectively handled visible and occluded points.
CoTracker3 specifically simplified components such as the global matching stage and used a multi-layer perceptron (MLP) for correlation feature processing, making it faster and more efficient than prior models.
Background
Point tracking is essential for video analysis tasks like three-dimensional (3D) reconstruction and video editing, as it helps recover precise correspondences between frames. Recent advances in point tracking models have been driven by transformer-based architectures, such as PIPs and tracking any point (TAP)-Vid, which introduced benchmarks and improved tracking techniques.
While synthetic videos have been commonly used for training due to the challenges of annotating real videos, this has led to performance gaps between synthetic and real-world data. Bootstrap TAP with iterative refinement (BootsTAPIR) addressed this by training on large collections of unlabelled real videos, but its complex semi-supervised training methods still leave room for improvement.
The paper introduced CoTracker3, a simpler and more data-efficient point-tracking model. By removing components like global matching and simplifying correlation processing, unnecessary components from previous trackers were eliminated, and a streamlined training process was used, significantly reducing the amount of real video data required. CoTracker3 outperformed state-of-the-art trackers, including BootsTAPIR, in both performance and data efficiency, filling gaps in the previous models’ complexity and scalability.
Point Tracking and Semi-Supervised Training
CoTracker3 was designed for point tracking in videos. The task involved predicting the movement of a query point across video frames, along with its visibility and confidence levels. CoTracker3 improved upon existing models, using a semi-supervised training method that incorporated both synthetic and real unlabelled videos.
Unlike earlier approaches that relied heavily on synthetic data, CoTracker3 used pseudo-labeled real videos, where multiple teacher models (trained on synthetic data) generated labels for real video datasets. These labels trained a student model, benefiting from a larger, more diverse dataset and mitigating issues like distribution shifts between synthetic and real data. The use of random teacher models for each batch during training also helped prevent overfitting and promoted generalization, making the training process more robust.
The model had two versions, online and offline. The online version processed videos in a sliding window fashion, while the offline version tracked points in both forward and backward directions. The model used four-dimensional (4D) correlation features to locate the query point and a transformer to iteratively update tracks, confidence, and visibility.
Simplified compared to prior architectures, CoTracker3 removed the global matching stage and used a multi-layer perceptron (MLP) for feature processing, making it faster and more efficient. It outperformed previous models, even those trained with 1,000 times more data, and performed exceptionally well when fine-tuned with additional real videos.
Experimental Evaluation
The researchers evaluated CoTracker3 using several protocols and benchmarks, comparing its performance with state-of-the-art trackers and analyzing its behavior with occluded points. They tested on the TAP-Vid dataset, including TAP-Vid-Kinetics, TAP-Vid-DAVIS, and red-green-blue (RGB)-Stacking, which featured both real and synthetic videos with complex camera motion and texture-less regions, using metrics such as occlusion accuracy (OA), average Jaccard (AJ), and visible point tracking accuracy (δvis avg). Additionally, the authors evaluated CoTracker3 on the RoboTAP and DynamicReplica benchmarks for robotic manipulation tasks and occluded point tracking.
CoTracker3 outperformed BootsTAPIR by a significant margin across all benchmarks despite using 1,000 times less data during training, demonstrating its superior data efficiency. On TAP-Vid and RoboTAP, it achieved state-of-the-art results, particularly in occlusion tracking, with offline models showing even better results than online ones for specific datasets. In particular, cross-track attention played a key role in improving occluded point tracking by leveraging the positions of visible points to estimate the locations of occluded ones.
Ablation studies highlighted the benefits of using cross-track attention, which improved occluded point tracking by leveraging visible point positions. The researchers also found that training with pseudo-labeled real videos helped bridge the gap between synthetic and real data, further enhancing the model’s robustness.
The researchers also explored self-training, where the model refined itself using its predictions as pseudo-labels, resulting in further improvements. This self-training approach showed that even using its own predictions for refinement led to gains, reducing the domain gap between synthetic and real videos. Finally, scaling experiments showed that increasing the number of real videos boosted performance, although it plateaued after 30,000 videos. The findings demonstrated the importance of large-scale pseudo-labeling and cross-track attention in enhancing CoTracker3’s tracking capabilities.
Conclusion
In conclusion, CoTracker3 presented a significant advancement in point tracking by simplifying previous architectures and improving efficiency through semi-supervised training. By using pseudo-labels generated from multiple teacher models, it bridged the gap between synthetic and real data with minimal use of labeled real videos.
CoTracker3 outperformed state-of-the-art models, including those trained on much larger datasets, particularly in handling occluded points. Its ability to significantly outperform trackers like BootsTAPIR, despite using orders of magnitude less training data, underscores its data efficiency and scalability. Its online and offline versions provided flexibility for various tracking tasks, and the model's ability to jointly track points enhanced its utility in complex video analysis tasks such as three-dimensional (3D) tracking and dynamic reconstruction.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Source:
Journal reference:
- Preliminary scientific report.
Karaev, N., Makarov, I., Wang, J., Neverova, N., Vedaldi, A., & Rupprecht, C. (2024). CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos. ArXiv. https://arxiv.org/abs/2410.11831