Demystifying Vision-Language Models

Download PDF Copy

By Soham NandiReviewed by Susha Cheriyedath, M.Sc.Jun 3 2024

In an article recently submitted to the arxiv* server, researchers introduced vision-language models (VLMs), explaining their functionality, training processes, and evaluation methods. They addressed the challenges of mapping high-dimensional visual data to discrete language. The authors aimed to provide a foundational understanding of VLMs for newcomers, with a focus on image-to-language mapping and a discussion on extending these models to handle video data.

*Study: Demystifying Vision-Language Models. Image Credit: MiniStocker/Shutterstock*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

In recent years, large language models (LLMs) such as Llama and chat generative pre-trained transformers (ChatGPT) have expanded to incorporate visual inputs, giving rise to VLMs. Despite advancements, VLMs encounter challenges in spatial understanding, attribute grasp, and content reliability.
This paper served as an introductory guide to VLMs, aiming to facilitate accessibility for newcomers, particularly students and researchers from other disciplines. It covered VLM functionality, training methodologies, and evaluation techniques, categorizing various training paradigms, exploring data curation strategies, and discussing contrastive and generative methods.

The authors emphasized the importance of robust evaluation, highlighting both strengths and limitations of current benchmarks. Ultimately, the paper sought to foster responsible development and address obstacles in VLM research, aiming to broaden participation and promote advancements in this dynamic field.

The Families of VLMs

VLMs have evolved significantly with various training paradigms and methodologies. Early works adapted transformer architectures like bidirectional encoder representations from transformers (BERT) to process visual data, leading to models like Visual-BERT and vision and language BERT (ViLBERT), which excelled in associating textual and visual cues through attention mechanisms.

Contrastive-based VLMs, exemplified by contrastive language–image pre-training (CLIP), leveraged contrastive training to map images and caption to similar embedding vectors, achieving strong classification capabilities. Models like sigmoid loss for language image pre-training (SigLIP) and latent language image pre-training (Llip) extended CLIP, improving performance on smaller batch sizes and incorporating caption diversity

Additionally, masking objectives in VLMs, similar to denoising autoencoders, aimed to predict missing tokens or image patches, as seen in the foundational language and vision alignment model (FLAVA) and MaskVLM. These approaches reduced dependency on pre-trained models and enhanced the model's ability to preserve information through auto-encoding. Moreover, an information-theoretic view of VLM objectives framed them as solving a rate-distortion problem, reducing superfluous information while maximizing predictive information.

Generative-based VLMs focused on generating text and/or images, enabling tasks like image captioning and language-guided image editing. Models such as contrastive captioners (CoCa), CM3leon, and Chameleon demonstrated advancements in multimodal understanding and reasoning.

Generative classifiers like Stable Diffusion and Imagen, originally designed for image generation, could also perform discriminative tasks like classification and caption prediction efficiently.
VLMs leveraging pre-trained backbones, such as Frozen and the Mini-GPT series, offered efficient solutions by connecting vision encoders to frozen language models or training only a linear projection layer. These approaches reduced computational costs while achieving strong performance in visual question answering and other tasks.

Guide to VLM Training

Training VLMs involved various approaches, each with distinct advantages and limitations. Contrastive models, such as CLIP, matched text and visual representations in a shared space, offering a simple training paradigm and improved grounding. However, they were not suitable for generating images or detailed image descriptions and required large datasets and significant computational resources.

Masking techniques reconstructed masked images and text, allowing individual example consideration and the use of smaller mini-batches without negative examples. While this method eliminated batch dependency, it might introduce inefficiencies due to the need for a decoder.

Generative models employed diffusion or autoregressive methods to generate images from text prompts, aiding in understanding learned representations and joint distribution learning. These models, though, were more computationally expensive than contrastive models.

Using pre-trained LLMs on a backbone was efficient when resources were limited, as it only required learning the mapping between text and vision representations. However, this approach could suffer from hallucinations or biases inherent in pre-trained models.

Improving grounding involved strategies like bounding box annotations, which incorporated box regression and intersection over union (IoU) loss to accurately locate and align visual concepts with text. Negative captioning, similar to contrastive objectives, used negative samples to enhance generalization and discriminative feature learning, improving model performance.

Data quality and curation were crucial for optimal VLM performance. This involved creating diverse, balanced datasets with high-quality captions and reduced duplicates. Data pruning, through heuristics, bootstrapping, and diversity strategies, eliminated low-quality pairs. Synthetic data generation and augmentation further enhanced training data quality. Software tools such as OpenCLIP and transformers facilitated VLM evaluation and comparison. Hyperparameters, including image resolution, visual encoder capacity, and pretraining data, significantly impacted model performance.

Approaches for Responsible VLM Evaluation

The evaluation of VLMs encompassed various methods, including image captioning, visual question answering (VQA), and zero-shot image classification. Metrics like bilingual evaluation understudy (BLEU) and recall-oriented understudy for gisting evaluation (ROUGE) assessed caption quality, while VQA tasks tested the model's ability to answer image-related questions. Zero-shot classification evaluated untrained tasks and benchmarks like Winoground examined visio-linguistic compositional reasoning. Dense captioning tasks offered detailed image understanding.

Synthetic data evaluations controlled scene elements, and hallucination assessments aimed to prevent false information generation. Addressing memorization concerns, methods like text randomization reduced overfitting. Red teaming efforts aimed to identify and mitigate undesirable model outputs, focusing on risks like privacy and bias. These evaluations ensured VLMs' reliability across various applications, driving improvements in model performance and safety.

Conclusion

In conclusion, the field of VLMs has rapidly advanced, offering a rich array of approaches from contrastive to generative methods. Challenges persist, including data quality, computational costs, and model reliability. Evaluation benchmarks and responsible training practices are essential for progress. Despite challenges, VLMs hold immense potential for revolutionizing AI-driven applications, making continued exploration and development imperative.

Journal reference:

Preliminary scientific report. Bordes, F., et al. (2024). An Introduction to Vision-Language Modeling. ArXiv.org. https://doi.org/10.48550/arXiv.2405.17247, https://arxiv.org/abs/2405.17247

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, June 03). Demystifying Vision-Language Models. AZoAi. Retrieved on October 13, 2025 from https://www.azoai.com/news/20240603/Demystifying-Vision-Language-Models.aspx.
MLA
Nandi, Soham. "Demystifying Vision-Language Models". AZoAi. 13 October 2025. <https://www.azoai.com/news/20240603/Demystifying-Vision-Language-Models.aspx>.
Chicago
Nandi, Soham. "Demystifying Vision-Language Models". AZoAi. https://www.azoai.com/news/20240603/Demystifying-Vision-Language-Models.aspx. (accessed October 13, 2025).
Harvard
Nandi, Soham. 2024. Demystifying Vision-Language Models. AZoAi, viewed 13 October 2025, https://www.azoai.com/news/20240603/Demystifying-Vision-Language-Models.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.