Unlocking Multimodal Understanding: Qwen-VL Models for Vision-Language Interaction

In a recent paper submitted to the arXiv* server, researchers introduced a series of Qwen vision-language (Qwen-VL) models, which feature expansive VL models engineered to collectively understand and perceive text and images.

Study: Unlocking Multimodal Understanding: Qwen-VL Models for Vision-Language Interaction. Image credit: Blue Planet Studio/Shutterstock
Study: Unlocking Multimodal Understanding: Qwen-VL Models for Vision-Language Interaction. Image credit: Blue Planet Studio/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Background

Large language models (LLMs) have gained considerable attention for their impressive text generation and comprehension capabilities. Fine-tuning enables these models to synchronize with user intent, showcasing their capacity as intelligent assistants. However, their limitation lies in handling modalities beyond text, such as images and speech. To overcome this, a set of large vision language models (LVLMs) has been developed. These models bridge the gap between text and visual information.

Advancements in VL Learning

Recent developments have witnessed a strong focus on vision-language learning. The contrastive captioners (CoCa) model introduces an encoder-decoder structure for image-text tasks, while the Ofa model transforms tasks using sequence-to-sequence methods. Vision-language representation models aim for robust representations. Challenges remain in robustness, generalization, and in-context abilities. LVLMs leverage substantial language models to address issues such as robustness, generalization, and in-context abilities. The proposed model, QWen-VL, integrates various tasks with remarkable performance.

Architecture of Qwen-VL

The Qwen-VL series models are the latest addition to the open-source Qwen series. This series comprises two variants: Qwen-VL and Qwen-VL-Chat. The model Qwen-VL augments the Qwen-7B LLM with visual capabilities through a visual encoder. The resulting model is trained in three stages and gains the ability to understand visual cues across various scales. Furthermore, Qwen-VL-Chat enhances interaction by leveraging alignment mechanisms, supporting multiple image inputs, multi-round dialogues, and localization.

The three components of the Qwen-VL network are LLM, a visual encoder, and a position-aware VL adapter. Qwen-VL employs a substantial LLM as its foundational element, initialized with pre-trained Qwen-7B weights. Qwen-VL's visual encoder employs the vision transformer (ViT) architecture, initialized with pre-trained weights sourced from Openclip's ViT-bigG model. The input images are resized to a designated resolution in both training and inference phases. The visual encoder divides these images into patches using a 14-pixel stride, producing a collection of image features.

To tackle efficiency issues related to extended sequences of image features, Qwen-VL introduces a vision-language adapter designed for compression. This adapter includes a randomly initialized single-layer cross-attention module. It employs trainable vectors as query vectors and utilizes the visual encoder image features as keys for performing cross-attention. This procedure condenses the sequence of visual features to a constant length of 256. To ensure that positional information's importance is preserved in understanding images, 2D absolute positional encodings are introduced within the cross-attention query-key pairs. This method mitigates the risk of losing positional information during the compression procedure.

The outcome of this compression is a shortened image feature sequence with a length of 256, which is then fed into the LLM. The visual encoder and adapter work together to process images, resulting in unchanging sequences of image features. To distinguish between input for image features and text features, distinct tokens are added at the start and end of the image feature sequence. These tokens signify the start and end of the image content, respectively.

Training and evaluation of Qwen-VL

The Qwen-VL model's training process unfolds across three distinct stages: two pre-training phases and a conclusive instruction-guided fine-tuning phase.

The dataset consists of 1.4 billion instances, of which around 77 percent are in English, and the remaining are in Chinese. All these instances are weakly labeled image-text pairs taken from web repositories and in-house data. In the first pre-training phase, the objective is to minimize the loss related to text tokens. The input images from the dataset are standardized, and LLMs are static and optimized targets using the vision encoder and VL adaptor for training.

The input for the second pre-training phase is the high-quality VL annotation data with a higher-resolution image and interleaved image-text data. The model Qwen-VL is trained across seven tasks concurrently. The visual encoder's input resolution was expanded to mitigate information loss. The entire model is trained by employing the AdamW optimizer. Model parallelism is employed for ViT and LLM.

The final stage is supervised fine-tuning. In this stage, the pre-trained Qwen-VL model undergoes fine-tuning via instruction guidance, culminating in the interactive Qwen-VL-Chat model. Multi-modal instruction tuning is sourced from captions and LLM self-instruction dialogue. To expand comprehension and interaction abilities, additional dialogue data is manually annotated. Annotated dialogue training data incorporates multi-image comprehension and localization capabilities. The training dataset encompasses 350k entries. For multi-image dialogue, image identifiers are introduced. Training employs the ChatML format, with statement termination marked by special tokens. During this phase, the visual encoder is fixed, with optimization primarily targeting the LLMs and adapter modules.

The proposed models were comprehensively evaluated across various traditional vision-language tasks such as image captioning, visual question answering (VQA), image understanding, and understanding real-word behavior based on instructions. The models consistently perform better than earlier benchmark models, surpassing larger-parameter generalist models across various tasks.

In summary, Qwen-VL series, a suite of large-scale multilingual VL models, excels in diverse benchmarks, enabling multilingual conversations, multi-image interactions, Chinese grounding, and fine-grained recognition.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Dr. Sampath Lonka

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Lonka, Sampath. (2023, August 29). Unlocking Multimodal Understanding: Qwen-VL Models for Vision-Language Interaction. AZoAi. Retrieved on July 03, 2024 from https://www.azoai.com/news/20230829/Unlocking-Multimodal-Understanding-Qwen-VL-Models-for-Vision-Language-Interaction.aspx.

  • MLA

    Lonka, Sampath. "Unlocking Multimodal Understanding: Qwen-VL Models for Vision-Language Interaction". AZoAi. 03 July 2024. <https://www.azoai.com/news/20230829/Unlocking-Multimodal-Understanding-Qwen-VL-Models-for-Vision-Language-Interaction.aspx>.

  • Chicago

    Lonka, Sampath. "Unlocking Multimodal Understanding: Qwen-VL Models for Vision-Language Interaction". AZoAi. https://www.azoai.com/news/20230829/Unlocking-Multimodal-Understanding-Qwen-VL-Models-for-Vision-Language-Interaction.aspx. (accessed July 03, 2024).

  • Harvard

    Lonka, Sampath. 2023. Unlocking Multimodal Understanding: Qwen-VL Models for Vision-Language Interaction. AZoAi, viewed 03 July 2024, https://www.azoai.com/news/20230829/Unlocking-Multimodal-Understanding-Qwen-VL-Models-for-Vision-Language-Interaction.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
AI in Academic Assessments: Evaluating GPT-4 in Biomedical Science Exams