MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities

Download PDF Copy

By Dr. Sampath LonkaReviewed by Susha Cheriyedath, M.Sc.Aug 9 2023

In a recent paper submitted to the arXiv* server, researchers introduced MM-Vet, a comprehensive benchmark designed to evaluate complex multimodal tasks using large multimodal models (LMMs).

*Study: MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities. Image credit: Jamie Jin/Shutterstock*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

The emergence of LLMs has introduced a new era of artificial intelligence (AI) capable of addressing a wide array of intricate natural language tasks, often approaching human-level performance. Building on this progress, LMMs seek to enhance general intelligence further by incorporating multimodal inputs into their architectures. Given that more than 80% of human cognition, learning, and activities are influenced by visual information, LMMs are beginning this exploration by integrating visual capabilities. One trajectory of LMM research involves augmenting LLMs with comprehensive visual comprehension through end-to-end fine-tuning.

Another avenue explores the modular fusion of LLMs with image-to-text vision-language (VL) models. The availability of powerful open-source LLMs, such as Large Language Model Meta AI (LLaMA), has led to the creation of several open-sourced LMMs. These investigations demonstrate the potential to address complex multimodal tasks such as commonsense reasoning, open-world recognition, and scene text understanding.

However, despite qualitative demonstrations of LMM capabilities, systematically evaluating these intricate multimodal tasks is challenging, necessitating a quantitative evaluation benchmark. Existing benchmarks focus primarily on simpler VL tasks, whereas MM-Vet proposes the integration of core VL capabilities for assessing complex tasks. To handle diverse question types and answer formats, the study introduces an LLM-based evaluator as the metric for open-ended model outputs. This approach ensures a comprehensive evaluation covering factual accuracy and textual quality.

Related work

Exploring multimodal intelligence, researchers integrate VL models for joint comprehension and the generation of signals. Driven by the success of LLMs, they delve into LMMs to address intricate tasks. These models seamlessly blend diverse VL capabilities. Some extend LLMs with multi-sensory abilities, as observed in Frozen, Flamingo, PaLM-E, and generative pre-trained transformers (GPT)-4. Open-sourced LLMs such as OpenFlamingo, large language visual assistant (LLaVA), MiniGPT-4, Otter, and InstructBLIP facilitate various studies.

Multimodal agents link vision tools with LLMs for integrated skills. While classic VL benchmarks target specific abilities, modernized benchmarks such as MM-Vet tackle complex multimodal tasks, providing insights beyond rankings. MM-Vet employs an open-ended LLM-based evaluator for various response styles and questions, extending techniques from natural language processing (NLP).

MM-Vet

The primary goal is to construct a comprehensive multimodal benchmark that mirrors real-world scenarios that an AI agent could encounter. The benchmark revolves around six core VL capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset comprises 187 images and 205 questions, with diverse question types and open-ended response demands. Human-annotated answers are available for 155 questions, while 50 questions use internet-derived responses. Supplementary images from various sources enrich the dataset. A GPT-4-based evaluator is employed for evaluation, employing few-shot prompts to assess model responses. The scores range from 0 to 1, enabling unified metric computation for capabilities and their integrations. This approach accommodates diverse response styles, question types, and problem sets. GPT-4 generates scores based on the input question, ground truth, and model output. Aggregated scores offer insights into the performance of each capability or capability integration.

Study Results

Experimental setup: The MM-Vet experiment evaluates two categories of Language and Vision Models (LMMs), LMMs tuned end-to-end (e.g., OpenFlamingo, LLaVA), and methods employing LLM tools (e.g., multimodal reasoning and action (MM-ReAct), Transformers Agent). The generated scores by GPT-4 range from 0 to 1, and each sample is evaluated five times to account for score variability. The reported performance averages scores for capabilities and their integrations, taking variances into consideration.

Analysis of results: The outcomes for various methods are showcased concerning individual capabilities and their integrations, including recognition, OCR, knowledge, language generation, spatial awareness, and math.

Recognition:The top-performing model is LLaVA-13B (LLaMA-2) in this category, likely due to its vision model's substantial training data volume and the strength of its language model.
OCR: MM-ReAct-GPT-4 performs exceptionally well in OCR, aided by external OCR tools. LLaVA-13B (LLaMA-2) stands out among the tuned models, utilizing CLIP ViT-L/14 and extensive image-OCR data.
Knowledge: MM-ReAct-GPT-4 excels in knowledge-related tasks, benefiting from its robust LLM backbone and external knowledge tools.
Language Generation: MM-ReAct-GPT-4 and LLaVA-13B (LLaMA-2) exhibit strong performance in language generation, leveraging their powerful language models.
Spatial Awareness: MM-ReAct-GPT-4 demonstrates superior spatial awareness capabilities, capitalizing on dense captioning and OCR tools to provide detailed location information.
Math: MM-ReAct-GPT-4 stands out in math tasks due to its PAL math tool.
Capability Integrations: MM-ReAct-GPT-4 achieves the highest scores in numerous capability integrations. Google Bard outperforms in select capabilities and integrations. The LLM-based evaluation proves effective in assessing LMM predictions, while GPT-4 shows minimal discrepancies compared to human annotations.

Comparison with Bard: Bard achieves high scores in multiple capabilities and integrations. MM-ReAct-GPT-4 competes closely, highlighting the potential of external tools. Open-source LMMs show promise but have room for improvement. Google Bard and MM-ReAct-GPT-4 perform well, but there's scope for further LMM enhancement. Vision encoder superiority remains uncertain. GPT-4 proves effective for open-ended evaluation, while current top methods achieve around 50% scores, indicating the need for more capable LMMs or tool-enhanced solutions.

Conclusion

In summary, the current study introduced the MM-Vet benchmark, designed to evaluate the combined vision-language abilities of LMMs. A fresh multimodal dataset is carefully crafted, emphasizing the integration of various capabilities. The assessment employs an open-ended approach utilizing an LLM-based evaluator. Evaluating a range of LMMs using MM-Vet highlights the necessity for improved integrated capabilities, given that current leading models achieve only approximately 50% scores.

Journal reference:

Preliminary scientific report. Yu, W., et al. (2023). MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv. DOI: https://doi.org/10.48550/arXiv.2308.02490, https://arxiv.org/abs/2308.02490

Posted in: AI Research News

Comments (0)

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Lonka, Sampath. (2023, August 09). MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities. AZoAi. Retrieved on July 12, 2025 from https://www.azoai.com/news/20230809/MM-Vet-Benchmarking-Multimodal-AI-with-Comprehensive-Visual-Language-Abilities.aspx.
MLA
Lonka, Sampath. "MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities". AZoAi. 12 July 2025. <https://www.azoai.com/news/20230809/MM-Vet-Benchmarking-Multimodal-AI-with-Comprehensive-Visual-Language-Abilities.aspx>.
Chicago
Lonka, Sampath. "MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities". AZoAi. https://www.azoai.com/news/20230809/MM-Vet-Benchmarking-Multimodal-AI-with-Comprehensive-Visual-Language-Abilities.aspx. (accessed July 12, 2025).
Harvard
Lonka, Sampath. 2023. MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities. AZoAi, viewed 12 July 2025, https://www.azoai.com/news/20230809/MM-Vet-Benchmarking-Multimodal-AI-with-Comprehensive-Visual-Language-Abilities.aspx.