Supercharging CLIP with LLMs: A New Era for Multimodal AI

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Joel ScanlonNov 14 2024

With a groundbreaking fine-tuning approach, researchers bridge text and vision models to set a new standard for cross-lingual and long-caption retrieval in multimodal AI.

LLM2CLIP Overview. After applying caption contrastive fine-tuning to the LLM, the increased textual discriminability enables more effective CLIP training. We leverage the open-world knowledge and general capabilities of the LLM to better process dense captions, addressing the previous limitations of the pretrained CLIP visual encoder and providing richer, higher-dimensional textual supervision. Experimental results demonstrate that LLM2CLIP can make any SOTA CLIP model even more SOTA ever.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers at Microsoft Corporation and Tongji University explored a new multimodal framework called LLM2CLIP, a novel approach that harnessed large language models (LLMs) to enhance the popular contrastive language–image pre-training (CLIP’s) multimodal representation learning.

They fine-tuned the LLM in a novel way, specifically in the caption space using contrastive learning, improving textual discriminability by adapting the attention mechanisms from causal to bidirectional and enabling CLIP to handle more complex captions. This caption contrastive (CC) fine-tuning was critical in transforming the LLM into a more discriminative encoder, boosting the retrieval performance of the CLIP-based Evangelion unit-02 (EVA02) model by 16.5% and transforming CLIP into a state-of-the-art cross-lingual model. Additionally, integrating LLM2CLIP into multimodal training outperformed the standard CLIP across nearly all benchmarks.

Related Work

Past work on CLIP highlighted its impact as a foundational model for multimodal tasks, enabling zero-shot retrieval, detection, and segmentation through rich image-caption alignments. Many researchers attempted to enhance CLIP by incorporating more robust language models, such as JinaCLIP and T5-V, though these prior models faced limitations in visual feature extraction and handling longer token lengths. By leveraging LLM Meta artificial intelligence (Llama3), the LLM2CLIP framework addresses these challenges by improving the understanding of long and dense captions without compromising token length, significantly boosting performance.

Optimizing CLIP Performance

The methodology addresses the limitations LLMs face in multimodal representation learning for CLIP models. Initially, LLMs, despite their deep textual comprehension and broad world knowledge, showed poor performance as text encoders due to their generative training approach, which led to a lack of discriminability in their feature outputs.

To tackle this, the authors introduced a metric—the Microsoft Common Objects in Context (MS COCO) caption retrieval accuracy (CRA). They demonstrated that LLMs achieved only 18.4% CRA, significantly lower than the original CLIP model’s 66%, reinforcing the need for fine-tuning. Consequently, the authors developed the caption contrastive (CC) fine-tuning method, optimizing LLMs to align more effectively and enhance text discriminability.

The fine-tuning method adapted LLMs by modifying their attention mechanisms from causal to bidirectional while employing masked next token prediction (MNTP) for a robust initialization. This approach transformed the LLMs into effective encoders that improve the proximity of captions related to the same image while distancing those from different images.

Using the ShareCaptioner-modified contrastive captioning 3 million (CC-3M) dataset and other text corpora like Wikitext-103, they ensured the LLM maintained discriminability and versatile text-processing capabilities. With these adjustments, the CRA score for the fine-tuned Llama-3 significantly jumped to 73%, surpassing state-of-the-art models like CLIP trained on larger datasets.

Real examples of top-1 results from the caption-to-caption retrieval experiment. Before fine-tuning, Llama3’s results were often completely unrelated.

Real examples of top-1 results from the caption-to-caption retrieval experiment. Before fine-tuning, Llama3’s results were often completely unrelated.

Enhancing CLIP with LLM2CLIP

The LLM2CLIP framework combined the fine-tuned LLM with CLIP’s visual encoder, which was frozen during training to maintain its inherent knowledge and reduce training overhead. Additional adapter layers were added to optimize alignment, a design that minimized computational and memory costs. By employing low-rank adaptation (LoRA) training and pre-extracting text features, LLM2CLIP achieved remarkable resource efficiency, even when applied to large models like Mistral-Nemo 12B and EVAViT. This approach ensured performance gains in cross-modal tasks without significant computational expense.

Key Training Insights and Evaluation

The experiments in this study utilized several datasets for training and evaluating the LLM2CLIP model, assessing improvements in both image-text retrieval and caption generation tasks. They applied CC-3M with dense captions and tested on datasets such as COCO 2014, Flickr 1k, and ShareGPT4V for short and long text retrieval tasks. Results from experiments comparing LLM2CLIP with standard models like OpenAI’s CLIP showed significant performance improvements, particularly when CC fine-tuning was applied. For example, models trained with CC fine-tuning achieved higher CRA scores, demonstrating the effectiveness of training with augmented caption datasets.

The study highlighted that directly replacing CLIP’s text encoder with a vanilla LLM model, without fine-tuning, resulted in significant degradation in retrieval performance, emphasizing the importance of enhancing the discriminability of output features. With CC fine-tuning, the researchers significantly improved the model’s ability to match captions to images, surpassing existing models.

These results underscore how caption-contrastive training and robust caption-distributed data are essential for advancing CLIP systems. They confirm that the LLM2CLIP approach transforms pre-trained CLIP models, achieving even greater effectiveness than previous models trained on larger datasets like Laion2B.

Conclusion

To sum up, this paper introduced LLM2CLIP as a pioneering method that leveraged the deep text comprehension of LLMs to enhance CLIP training. Through tailored fine-tuning to address weak output discriminability, this method validated its potential across a range of benchmarks. LLM2CLIP brings unique advantages, such as compatibility with long texts and the ability to integrate open-world knowledge into CLIP training.

Journal reference:

Preliminary scientific report. Huang, W., Wu, A., Yang, Y., Luo, X., Yang, Y., Hu, L., Dai, Q., Dai, X., Chen, D., Luo, C., & Qiu, L. (2024). LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation. ArXiv. https://arxiv.org/abs/2411.04997

Be the first to rate this article

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2024, November 14). Supercharging CLIP with LLMs: A New Era for Multimodal AI. AZoAi. Retrieved on April 06, 2025 from https://www.azoai.com/news/20241114/Supercharging-CLIP-with-LLMs-A-New-Era-for-Multimodal-AI.aspx.
MLA
Chandrasekar, Silpaja. "Supercharging CLIP with LLMs: A New Era for Multimodal AI". AZoAi. 06 April 2025. <https://www.azoai.com/news/20241114/Supercharging-CLIP-with-LLMs-A-New-Era-for-Multimodal-AI.aspx>.
Chicago
Chandrasekar, Silpaja. "Supercharging CLIP with LLMs: A New Era for Multimodal AI". AZoAi. https://www.azoai.com/news/20241114/Supercharging-CLIP-with-LLMs-A-New-Era-for-Multimodal-AI.aspx. (accessed April 06, 2025).
Harvard
Chandrasekar, Silpaja. 2024. Supercharging CLIP with LLMs: A New Era for Multimodal AI. AZoAi, viewed 06 April 2025, https://www.azoai.com/news/20241114/Supercharging-CLIP-with-LLMs-A-New-Era-for-Multimodal-AI.aspx.