MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities

In a recent paper submitted to the arXiv* server, researchers introduced MM-Vet, a comprehensive benchmark designed to evaluate complex multimodal tasks using large multimodal models (LMMs).

Study: MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities. Image credit: Jamie Jin/Shutterstock
Study: MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities. Image credit: Jamie Jin/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Background

The emergence of LLMs has introduced a new era of artificial intelligence (AI) capable of addressing a wide array of intricate natural language tasks, often approaching human-level performance. Building on this progress, LMMs seek to enhance general intelligence further by incorporating multimodal inputs into their architectures. Given that more than 80% of human cognition, learning, and activities are influenced by visual information, LMMs are beginning this exploration by integrating visual capabilities. One trajectory of LMM research involves augmenting LLMs with comprehensive visual comprehension through end-to-end fine-tuning.

Another avenue explores the modular fusion of LLMs with image-to-text vision-language (VL) models. The availability of powerful open-source LLMs, such as Large Language Model Meta AI (LLaMA), has led to the creation of several open-sourced LMMs. These investigations demonstrate the potential to address complex multimodal tasks such as commonsense reasoning, open-world recognition, and scene text understanding.

However, despite qualitative demonstrations of LMM capabilities, systematically evaluating these intricate multimodal tasks is challenging, necessitating a quantitative evaluation benchmark. Existing benchmarks focus primarily on simpler VL tasks, whereas MM-Vet proposes the integration of core VL capabilities for assessing complex tasks. To handle diverse question types and answer formats, the study introduces an LLM-based evaluator as the metric for open-ended model outputs. This approach ensures a comprehensive evaluation covering factual accuracy and textual quality.

Related work

Exploring multimodal intelligence, researchers integrate VL models for joint comprehension and the generation of signals. Driven by the success of LLMs, they delve into LMMs to address intricate tasks. These models seamlessly blend diverse VL capabilities. Some extend LLMs with multi-sensory abilities, as observed in Frozen, Flamingo, PaLM-E, and generative pre-trained transformers (GPT)-4. Open-sourced LLMs such as OpenFlamingo, large language visual assistant (LLaVA), MiniGPT-4, Otter, and InstructBLIP facilitate various studies.

Multimodal agents link vision tools with LLMs for integrated skills. While classic VL benchmarks target specific abilities, modernized benchmarks such as MM-Vet tackle complex multimodal tasks, providing insights beyond rankings. MM-Vet employs an open-ended LLM-based evaluator for various response styles and questions, extending techniques from natural language processing (NLP).

MM-Vet

The primary goal is to construct a comprehensive multimodal benchmark that mirrors real-world scenarios that an AI agent could encounter. The benchmark revolves around six core VL capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset comprises 187 images and 205 questions, with diverse question types and open-ended response demands. Human-annotated answers are available for 155 questions, while 50 questions use internet-derived responses. Supplementary images from various sources enrich the dataset. A GPT-4-based evaluator is employed for evaluation, employing few-shot prompts to assess model responses. The scores range from 0 to 1, enabling unified metric computation for capabilities and their integrations. This approach accommodates diverse response styles, question types, and problem sets. GPT-4 generates scores based on the input question, ground truth, and model output. Aggregated scores offer insights into the performance of each capability or capability integration.

Study Results

Experimental setup: The MM-Vet experiment evaluates two categories of Language and Vision Models (LMMs), LMMs tuned end-to-end (e.g., OpenFlamingo, LLaVA), and methods employing LLM tools (e.g., multimodal reasoning and action (MM-ReAct), Transformers Agent). The generated scores by GPT-4 range from 0 to 1, and each sample is evaluated five times to account for score variability. The reported performance averages scores for capabilities and their integrations, taking variances into consideration.

Analysis of results: The outcomes for various methods are showcased concerning individual capabilities and their integrations, including recognition, OCR, knowledge, language generation, spatial awareness, and math.

  • Recognition:The top-performing model is LLaVA-13B (LLaMA-2) in this category, likely due to its vision model's substantial training data volume and the strength of its language model.
  • OCR: MM-ReAct-GPT-4 performs exceptionally well in OCR, aided by external OCR tools. LLaVA-13B (LLaMA-2) stands out among the tuned models, utilizing CLIP ViT-L/14 and extensive image-OCR data.
  • Knowledge: MM-ReAct-GPT-4 excels in knowledge-related tasks, benefiting from its robust LLM backbone and external knowledge tools.
  • Language Generation: MM-ReAct-GPT-4 and LLaVA-13B (LLaMA-2) exhibit strong performance in language generation, leveraging their powerful language models.
  • Spatial Awareness: MM-ReAct-GPT-4 demonstrates superior spatial awareness capabilities, capitalizing on dense captioning and OCR tools to provide detailed location information.
  • Math: MM-ReAct-GPT-4 stands out in math tasks due to its PAL math tool.
  • Capability Integrations: MM-ReAct-GPT-4 achieves the highest scores in numerous capability integrations. Google Bard outperforms in select capabilities and integrations. The LLM-based evaluation proves effective in assessing LMM predictions, while GPT-4 shows minimal discrepancies compared to human annotations.

Comparison with Bard: Bard achieves high scores in multiple capabilities and integrations. MM-ReAct-GPT-4 competes closely, highlighting the potential of external tools. Open-source LMMs show promise but have room for improvement. Google Bard and MM-ReAct-GPT-4 perform well, but there's scope for further LMM enhancement. Vision encoder superiority remains uncertain. GPT-4 proves effective for open-ended evaluation, while current top methods achieve around 50% scores, indicating the need for more capable LMMs or tool-enhanced solutions.

Conclusion

In summary, the current study introduced the MM-Vet benchmark, designed to evaluate the combined vision-language abilities of LMMs. A fresh multimodal dataset is carefully crafted, emphasizing the integration of various capabilities. The assessment employs an open-ended approach utilizing an LLM-based evaluator. Evaluating a range of LMMs using MM-Vet highlights the necessity for improved integrated capabilities, given that current leading models achieve only approximately 50% scores.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Dr. Sampath Lonka

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Lonka, Sampath. (2023, August 09). MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities. AZoAi. Retrieved on September 16, 2024 from https://www.azoai.com/news/20230809/MM-Vet-Benchmarking-Multimodal-AI-with-Comprehensive-Visual-Language-Abilities.aspx.

  • MLA

    Lonka, Sampath. "MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities". AZoAi. 16 September 2024. <https://www.azoai.com/news/20230809/MM-Vet-Benchmarking-Multimodal-AI-with-Comprehensive-Visual-Language-Abilities.aspx>.

  • Chicago

    Lonka, Sampath. "MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities". AZoAi. https://www.azoai.com/news/20230809/MM-Vet-Benchmarking-Multimodal-AI-with-Comprehensive-Visual-Language-Abilities.aspx. (accessed September 16, 2024).

  • Harvard

    Lonka, Sampath. 2023. MM-Vet: Benchmarking Multimodal AI with Comprehensive Visual-Language Abilities. AZoAi, viewed 16 September 2024, https://www.azoai.com/news/20230809/MM-Vet-Benchmarking-Multimodal-AI-with-Comprehensive-Visual-Language-Abilities.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Generative AI Models Unveil the Hidden Identities of Cities Through Text and Image Analysis