Dive into ProLIP's breakthrough approach in vision-language models—where uncertainty adds precision, and new probabilistic techniques unlock a richer, more accurate world of image-text relationships.
Image traversals with ProLIP. For each image, we estimate the [ROOT] caption which include the image most using Equation (4). Then, we interpolate the [ROOT] caption and the retrieved caption. We compare our interpolation and HierarCaps GTs. Red denotes when the estimated and GT roots are different. Research: Probabilistic Language-Image Pre-Training
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers at Naver AI lab introduced probabilistic language-image pre-training (ProLIP). This novel vision-language model employed probabilistic objectives for embedding image-text pairs.
Using an efficient "uncertainty token" strategy ProLIP estimated uncertainty without additional parameters and included a new inclusion loss to better align image-text relationships. This efficient architecture avoids computational overhead common in previous probabilistic models, making ProLIP highly scalable to large datasets. The model achieved impressive results, demonstrating the efficacy of its probabilistic approach.
Background
Past work focused on developing vision-language models (VLMs) to create a joint embedding space for aligned image-text pairs, using deterministic representations that simplify real-world relationships.
Researchers highlighted the many-to-many nature of image-text matching, where multiple captions can describe a single image.
Previous models, such as contrastive language–image pre-training (CLIP), struggled to capture this diversity as they mapped inputs to deterministic points in Euclidean space. Efforts to enhance VLMs included methods to predict uncertainty and improve interpretability through novel loss functions.
Probabilistic Approach in VLMs
The architecture of ProLIP is designed to model input as Gaussian random variables with diagonal covariance, estimating mean and variance vectors from the input. Like CLIP, ProLIP employs separate visual and textual data encoders, utilizing vision transformers (ViT) for the visual encoder and Transformers for the textual encoder.
Previous probabilistic VLM (PrVLMs) introduced extra parameters to estimate uncertainty, but this added complexity limited usability. Instead, ProLIP introduces an uncertainty token [UNC] alongside the class token [CLS], which requires negligible additional parameters.
The visual encoder processes [CLS] and [UNC] at the start, while the textual encoder uses [UNC] and [CLS] at the end, ensuring that both tokens align with the existing architecture. The [CLS] output serves as the mean, while the [UNC] output functions as the log of variance, with a linear layer projecting these into the final embedding space.
The probabilistic pairwise contrastive loss (PPCL) is introduced as the main objective function, improving on the probabilistic matching loss (PML) from PCME++ for stable training by employing a log-sigmoid loss rather than binary cross-entropy, which enhances stability and convergence when scaling up to large datasets.
Furthermore, an inclusion loss is proposed to ensure that learned uncertainties align more intuitively with human expectations. This novel objective function enforces that one random variable (Z1) is included within another (Z2) by emphasizing areas of high probability density.
The inclusion loss uses an asymmetrical measure to determine inclusion rather than traditional similarity measures like Kullback–Leibler (KL) divergence, allowing it to better represent human-aligned uncertainty by focusing on inclusion rather than dissimilarity.
ProLIP also implements prompt tuning with uncertainty estimates, recognizing the potential of estimated uncertainty to enhance zero-shot classification (ZSC). By analyzing the suitability of text prompts based on their uncertainty, the model can improve performance through a Bayesian prompt re-weighting (BPRW) strategy, allowing for the optimization of prompt weights to better describe the corresponding image embeddings.
ProLIP: Enhanced Image-Text Understanding
The experiments utilized the ViT-B/16 model as the image encoder and a 12-layer Transformer as the text encoder, with an embedding dimension of 768 and a context length of 64 tokens. The ProLIP model was implemented based on openclip and trained using the DataComp-1B dataset, requiring approximately one day to train on 32 NVIDIA H100 GPUs with Bfloat16 precision. Specific bias values and parameters were set to initialize the model, while a portion of the image-text pairs was masked during training to enhance model performance.
The evaluation was conducted across 38 tasks from the DataComp evaluation suite, covering various categories, including ImageNet and VTAB tasks. The ProLIP model demonstrated superior performance compared to CLIP across all metrics, with samples of 1.28 billion seen.
Notably, when trained with 12.8 billion seen samples, ProLIP achieved a high-performing PrVLM backbone, highlighting its effectiveness in zero-shot classification tasks. Detailed results showcase the performance across different datasets and demonstrate the benefits of using multiple prompts for each task.
Further analysis was conducted to understand the learned uncertainty in the model's predictions. Researchers observed that shorter, general text prompts tend to carry higher uncertainty than longer, specific captions. Experiments revealed that shorter texts were generally more uncertain, and a clear relationship was identified between text hierarchy levels and uncertainty values. The HierarImgs dataset was also constructed to explore visual uncertainty across different image hierarchies, confirming that lower-level images tended to be more uncertain than their higher-level counterparts.
Conclusion
To sum up, the work introduced ProLIP, a VLM that captured the inherent diversity in image-text relationships through probabilistic mappings and uncertainty estimation via a [UNC]. The inclusion loss improved interpretability by enforcing distributional inclusion between image-text pairs and inputs. The experiments demonstrated ProLIP's effectiveness in zero-shot classification tasks while providing additional insights into input data uncertainty.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Source:
Journal reference:
- Preliminary scientific report.
Chun, S., Kim, W., Park, S., & Yun, S. (2024). Probabilistic Language-Image Pre-Training. ArXiv. https://arxiv.org/abs/2410.18857