In a recent paper submitted to the arXiv* server, researchers introduced a series of Qwen vision-language (Qwen-VL) models, which feature expansive VL models engineered to collectively understand and perceive text and images.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Large language models (LLMs) have gained considerable attention for their impressive text generation and comprehension capabilities. Fine-tuning enables these models to synchronize with user intent, showcasing their capacity as intelligent assistants. However, their limitation lies in handling modalities beyond text, such as images and speech. To overcome this, a set of large vision language models (LVLMs) has been developed. These models bridge the gap between text and visual information.
Advancements in VL Learning
Recent developments have witnessed a strong focus on vision-language learning. The contrastive captioners (CoCa) model introduces an encoder-decoder structure for image-text tasks, while the Ofa model transforms tasks using sequence-to-sequence methods. Vision-language representation models aim for robust representations. Challenges remain in robustness, generalization, and in-context abilities. LVLMs leverage substantial language models to address issues such as robustness, generalization, and in-context abilities. The proposed model, QWen-VL, integrates various tasks with remarkable performance.
Architecture of Qwen-VL
The Qwen-VL series models are the latest addition to the open-source Qwen series. This series comprises two variants: Qwen-VL and Qwen-VL-Chat. The model Qwen-VL augments the Qwen-7B LLM with visual capabilities through a visual encoder. The resulting model is trained in three stages and gains the ability to understand visual cues across various scales. Furthermore, Qwen-VL-Chat enhances interaction by leveraging alignment mechanisms, supporting multiple image inputs, multi-round dialogues, and localization.
The three components of the Qwen-VL network are LLM, a visual encoder, and a position-aware VL adapter. Qwen-VL employs a substantial LLM as its foundational element, initialized with pre-trained Qwen-7B weights. Qwen-VL's visual encoder employs the vision transformer (ViT) architecture, initialized with pre-trained weights sourced from Openclip's ViT-bigG model. The input images are resized to a designated resolution in both training and inference phases. The visual encoder divides these images into patches using a 14-pixel stride, producing a collection of image features.
To tackle efficiency issues related to extended sequences of image features, Qwen-VL introduces a vision-language adapter designed for compression. This adapter includes a randomly initialized single-layer cross-attention module. It employs trainable vectors as query vectors and utilizes the visual encoder image features as keys for performing cross-attention. This procedure condenses the sequence of visual features to a constant length of 256. To ensure that positional information's importance is preserved in understanding images, 2D absolute positional encodings are introduced within the cross-attention query-key pairs. This method mitigates the risk of losing positional information during the compression procedure.
The outcome of this compression is a shortened image feature sequence with a length of 256, which is then fed into the LLM. The visual encoder and adapter work together to process images, resulting in unchanging sequences of image features. To distinguish between input for image features and text features, distinct tokens are added at the start and end of the image feature sequence. These tokens signify the start and end of the image content, respectively.
Training and evaluation of Qwen-VL
The Qwen-VL model's training process unfolds across three distinct stages: two pre-training phases and a conclusive instruction-guided fine-tuning phase.
The dataset consists of 1.4 billion instances, of which around 77 percent are in English, and the remaining are in Chinese. All these instances are weakly labeled image-text pairs taken from web repositories and in-house data. In the first pre-training phase, the objective is to minimize the loss related to text tokens. The input images from the dataset are standardized, and LLMs are static and optimized targets using the vision encoder and VL adaptor for training.
The input for the second pre-training phase is the high-quality VL annotation data with a higher-resolution image and interleaved image-text data. The model Qwen-VL is trained across seven tasks concurrently. The visual encoder's input resolution was expanded to mitigate information loss. The entire model is trained by employing the AdamW optimizer. Model parallelism is employed for ViT and LLM.
The final stage is supervised fine-tuning. In this stage, the pre-trained Qwen-VL model undergoes fine-tuning via instruction guidance, culminating in the interactive Qwen-VL-Chat model. Multi-modal instruction tuning is sourced from captions and LLM self-instruction dialogue. To expand comprehension and interaction abilities, additional dialogue data is manually annotated. Annotated dialogue training data incorporates multi-image comprehension and localization capabilities. The training dataset encompasses 350k entries. For multi-image dialogue, image identifiers are introduced. Training employs the ChatML format, with statement termination marked by special tokens. During this phase, the visual encoder is fixed, with optimization primarily targeting the LLMs and adapter modules.
The proposed models were comprehensively evaluated across various traditional vision-language tasks such as image captioning, visual question answering (VQA), image understanding, and understanding real-word behavior based on instructions. The models consistently perform better than earlier benchmark models, surpassing larger-parameter generalist models across various tasks.
In summary, Qwen-VL series, a suite of large-scale multilingual VL models, excels in diverse benchmarks, enabling multilingual conversations, multi-image interactions, Chinese grounding, and fine-grained recognition.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.