Pixtral 12B integrates advanced vision encoding and text processing to set new benchmarks in multimodal AI, excelling in both image analysis and natural language tasks while maintaining flexibility and high accuracy across various real-world applications.
Complete Pixtral Architecture. Pixtral has two components: a vision encoder, which tokenizes images, and a multimodal decoder, which predicts the next text token given a sequence of text and images. Pixtral can take an arbitrary number of images as input, provided they fit within its 128K context window. Pixtral 12B
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers introduced Pixtral 12 billion (B), a 12-B-parameter multimodal language model (MLM) that excels at understanding images and natural language. It outperforms larger models on multimodal (MM) benchmarks and retains strong language capabilities.
Pixtral featured a new, highly innovative vision encoder designed for processing images at native resolution and aspect ratio, using novel techniques like break tokens and relative rotary position encodings (ROPE-2D), and handled long context windows. Additionally, the authors presented an open-source benchmark, MM multiturn benchmark (MM-MT-Bench), for evaluating vision-language models.
Background
MLMs have become increasingly important for integrating text and visual data in a unified framework. Previous research in this field, such as models like Llama-3 and Qwen2-VL, has made strides in combining image and text understanding, particularly in tasks like image captioning and MM question answering.
However, these models often exhibit trade-offs between text and image processing capabilities. Additionally, MM models tend to sacrifice accuracy on text-only tasks, and there has been a lack of standardized evaluation protocols across different models.
Pixtral 12B addressed these gaps by introducing a highly capable MLM that excelled in both MM and text-only tasks without sacrificing performance in either. It featured a highly flexible vision encoder capable of processing images at their native resolution and aspect ratio, offering support for variable image sizes depending on task requirements. Break tokens allowed the model to efficiently handle image patches with different aspect ratios, while the ROPE-2D encoding enabled it to seamlessly process images at both high and low resolutions.
Pixtral 12B outperformed models of similar and larger sizes, demonstrating superior MM reasoning and text-based performance. This paper also highlighted issues with current evaluation metrics and prompts, proposing solutions to standardize evaluation methods and introduce flexible parsing strategies. This approach ensures models are not unfairly penalized for providing substantively correct answers in different formats. Additionally, Pixtral introduced the MM-MT-Bench to assess MM performance in more practical, real-world scenarios, filling a critical gap in MM model evaluations.
Architecture Details
Pixtral 12B integrated a vision encoder and an MM decoder for high-level reasoning on both images and text. The model was built on Mistral Nemo 12B, a 12-B parameter decoder-only language model.
The vision encoder, Pixtral vision transformer (PixtralViT), was specifically designed to process images of variable resolutions and aspect ratios. It incorporated several novel features, including break tokens to manage images with different aspect ratios, gated feedforward networks, and ROPE-2D for efficient and flexible image processing.
These features enhanced the model's ability to handle MM tasks, outperforming traditional models optimized for standard resolutions like ImageNet. Pixtral's ability to handle variable image sizes sets it apart from other models that are often constrained to fixed resolutions during training.
The vision encoder outputs were transformed and linked to the MM decoder via a two-layer, fully connected network. The architecture allowed for seamless handling of image and text tokens, enabling Pixtral to excel in tasks requiring complex MM reasoning, such as MT and multi-image conversations.
Benchmarking MM Instruction Models
MM-MT-Bench was designed to evaluate MM models' ability to follow instructions, particularly for real-world applications like extraction, summarization, and reasoning from images.
The benchmark contained 92 conversations involving five categories of images: charts, tables, portable document format (PDFs), diagrams, and miscellaneous. These conversations simulated practical use cases, assessing models' performance in MT dialogues. An independent judge evaluated the model’s responses based on correctness and completeness, using a scale of one to 10.
This benchmark was built on the MT-Bench used for text-only models but extended to MM tasks. Each conversation included reference answers for previous turns, allowing for comprehensive assessments across different conversation lengths. Pixtral’s results on MM-MT-Bench showed a strong correlation with human preferences, achieving a Pearson correlation of 0.91 with LMSys-Vision ELO ratings, further validating its practical utility.
Evaluations showed that Pixtral 12B significantly outperformed open-source models of similar size and some closed-source models in MM tasks. Its strong performance held even with flexible parsing metrics, proving its reliability in following prompts.
This flexibility in handling explicit and vague prompts was crucial for maintaining high accuracy in tasks requiring both textual and visual understanding. MM-MT-Bench proved valuable for testing models designed for real-world scenarios beyond the simpler multiple-choice question-answering tasks typically used in MM evaluations.
Conclusion
In conclusion, the introduction of Pixtral 12B marked a significant advancement in MLM, demonstrating exceptional capabilities in both multimodal and text-only tasks.
By integrating a novel vision encoder with an MM decoder, Pixtral 12B excelled in understanding and processing images and natural language simultaneously.
It outperformed both open-source and larger closed-source models, addressing common challenges in MM understanding, including variable image resolutions and long context windows.
The introduction of the MM-MT-Bench benchmark provided a standardized evaluation framework for assessing model performance in real-world scenarios, ensuring reliable results. This effort to standardize evaluation through the use of explicit prompts and flexible parsing made Pixtral 12B a robust and versatile model for practical applications. Qualitative examples showcased Pixtral's effectiveness in complex tasks, such as chart analysis and multi-image instruction following.
With its versatile architecture and strong instruction-following capabilities, Pixtral 12B is poised to enhance various applications and significantly contribute to the field of MM artificial intelligence (AI).
The model is released under the Apache 2.0 license, promoting further research and development.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.