A new era in video analysis: BLIP-3-Video by Salesforce cuts down token usage dramatically, offering state-of-the-art performance while streamlining computational demands. Discover how fewer tokens can mean smarter, faster video understanding.
Research: xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, Salesforce researchers presented the cross-generative multimodal video (xGen-MM-Vid) bootstrapping language-image pre-training 3 (BLIP-3), a multimodal language model specifically designed to capture temporal information across multiple video frames with high efficiency.
BLIP-3-Video utilized a novel ‘temporal encoder’ alongside a conventional visual tokenizer, allowing it to use significantly fewer visual tokens than competing models (32 tokens compared to thousands in other models).
The study explored various temporal encoders, including learnable spatiotemporal pooling and sequential models like token turing machines (TTM). Detailed experiments showed that different encoder types had a noticeable impact on performance, particularly in handling complex video scenarios. Experimental results confirmed that BLIP-3-Video achieved video question-answering accuracies comparable to much larger state-of-the-art models while being smaller and more efficient.
Related Work
Past work showed that large vision-language models (VLMs) excel in computer vision through extensive image-text training, with open-source models achieving notable results despite smaller sizes.
The emergence of video VLMs emphasizes the need to capture temporal information across multiple frames, employing techniques like spatial/temporal pooling and separate video encoders.
However, approaches that collect all visual tokens from each frame can significantly increase token count, posing challenges for processing longer videos due to quadratic computational demands. BLIP-3-Video addresses this by optimizing token efficiency while maintaining accuracy.
BLIP-3-Video Model Overview
The BLIP-3-Video model architecture builds upon the existing image-based vision-language model, BLIP-3, by incorporating an explicit temporal encoder.
This architecture comprises four key components: a vision encoder (ViT) that processes each frame, a frame-level tokenizer to reduce the number of tokens, a temporal encoder to create video-level token representations, and an autoregressive language model (LLM) that generates text captions based on the video tokens and text prompt tokens.
The model's efficiency is demonstrated by its ability to work with limited visual tokens, allowing for effective video analysis while maintaining a compact structure.
Initially, a pre-trained sign language image processing (SigLIP) is the vision encoder, processing one image frame at a time. The perceiver-resampler maps visual tokens to 128 tokens per frame independently.
Once the model generates visual tokens across multiple frames, these tokens are fed into the temporal encoder, consolidating the frame-level tokens into a manageable number of video tokens. The temporal encoder employs advanced techniques, including attentional pooling and sequential memory mechanisms, to maintain context. Various temporal encoder types, including pooling methods and sequential models, are explored, which help enhance the model's capacity to represent temporal information effectively.
For computational efficiency, the model uniformly samples eight frames from each video, resulting in a flow where the vision encoder maps a video into 8 × 729 visual tokens, which are then reduced to 8 × 128 visual tokens through the perceiver-resampler. The temporal encoder further compresses these tokens, maintaining only the most relevant ones. It is further compressed into 16 to 128 video tokens via the temporal encoder. The pioneering hierarchical image and text alignment for multimodal learning (Phi-3) LLM backbone processes video and text prompt tokens, allowing the model to generate coherent text outputs from the combined inputs.
BLIP-3-Video employs a three-stage curriculum learning approach: image caption pretraining, video caption pretraining, and video instruction tuning.
During training, the vision encoder remains frozen, and the model utilizes pre-trained weights from BLIP-3, randomly initializing the temporal encoder weights.
The training stages utilize various video caption datasets and incorporate both open-ended and multiple-choice video question-answering formats to refine the model's capabilities. This demonstrates the overall efficiency and effectiveness of the training process, which also included specific adjustments to hyperparameters to ensure optimized performance.
Performance Evaluation and Benchmarks
The BLIP-3-Video model was implemented with several advanced architectural components, including a temporal encoder that enhances its video question-answering capabilities.
This model operates at an input resolution of 384 × 384 and utilizes a SigLIP encoder, which generates 729 tokens per frame, each with a channel size of 1152.
The perceiver-resampler, consisting of multiple cross-attention layers, feeds into the temporal encoder, enabling efficient video data processing. The TokenLearner functions as a spatiotemporal attentional pooling mechanism. It employs a multi-layer perceptron (MLP) for the attention function and adjusts its inner dimensions based on the number of target tokens. By integrating a grouped time to market (TTM) as the sequential model temporal encoder, the system effectively manages a total memory size of 512 tokens. These architectural choices result in substantial improvements in processing speed and accuracy compared to previous models.
In the evaluation phase, the model's performance was assessed on several public benchmarks, including multi-source video dialogue question answering (MSVD-QA) and the next generation of QA (NExT-QA). The experiments revealed that BLIP-3-Video achieved competitive accuracy rates compared to other state-of-the-art models, even with fewer tokens.
Specifically, it demonstrated notable performance with 32 or 128 tokens while maintaining or exceeding the accuracy levels of larger models with higher token counts. These results suggest that a well-designed temporal encoder can significantly enhance question-answering capabilities without requiring excessive visual tokens, thereby streamlining the model's efficiency and effectiveness. Further tests highlighted the model's robustness across diverse datasets, indicating its adaptability.
Further analyses involved ablation studies comparing different temporal encoders and pooling strategies, revealing that even with just 32 tokens, the model performed well in video question-answering tasks.
The reduced visual tokens enhanced computational efficiency, allowing the model to process significantly more samples per second, highlighting its potential for practical video analysis applications. This efficiency is particularly crucial for real-time applications and processing longer video content.
Conclusion
To summarize, BLIP-3-Video was introduced as an efficient vision-language model for videos featuring 4 billion parameters. It incorporated a temporal encoder that enabled the abstraction of entire videos using only 16 or 32 tokens.
In contrast to other state-of-the-art video VLMs that required thousands of visual tokens, BLIP-3-Video achieved competitive performance with significantly fewer tokens.
The development of BLIP-3-Video demonstrates that token efficiency does not have to come at the expense of accuracy, offering a new direction for future video-based AI research.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Ryoo, M. S., et al. (2024). XGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs. ArXiv. DOI: 10.48550/arXiv.2410.16267, https://arxiv.org/abs/2410.16267