In a recent paper submitted to the arXiv* server, researchers introduced MM-Vet, a comprehensive benchmark designed to evaluate complex multimodal tasks using large multimodal models (LMMs).
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
The emergence of LLMs has introduced a new era of artificial intelligence (AI) capable of addressing a wide array of intricate natural language tasks, often approaching human-level performance. Building on this progress, LMMs seek to enhance general intelligence further by incorporating multimodal inputs into their architectures. Given that more than 80% of human cognition, learning, and activities are influenced by visual information, LMMs are beginning this exploration by integrating visual capabilities. One trajectory of LMM research involves augmenting LLMs with comprehensive visual comprehension through end-to-end fine-tuning.
Another avenue explores the modular fusion of LLMs with image-to-text vision-language (VL) models. The availability of powerful open-source LLMs, such as Large Language Model Meta AI (LLaMA), has led to the creation of several open-sourced LMMs. These investigations demonstrate the potential to address complex multimodal tasks such as commonsense reasoning, open-world recognition, and scene text understanding.
However, despite qualitative demonstrations of LMM capabilities, systematically evaluating these intricate multimodal tasks is challenging, necessitating a quantitative evaluation benchmark. Existing benchmarks focus primarily on simpler VL tasks, whereas MM-Vet proposes the integration of core VL capabilities for assessing complex tasks. To handle diverse question types and answer formats, the study introduces an LLM-based evaluator as the metric for open-ended model outputs. This approach ensures a comprehensive evaluation covering factual accuracy and textual quality.
Related work
Exploring multimodal intelligence, researchers integrate VL models for joint comprehension and the generation of signals. Driven by the success of LLMs, they delve into LMMs to address intricate tasks. These models seamlessly blend diverse VL capabilities. Some extend LLMs with multi-sensory abilities, as observed in Frozen, Flamingo, PaLM-E, and generative pre-trained transformers (GPT)-4. Open-sourced LLMs such as OpenFlamingo, large language visual assistant (LLaVA), MiniGPT-4, Otter, and InstructBLIP facilitate various studies.
Multimodal agents link vision tools with LLMs for integrated skills. While classic VL benchmarks target specific abilities, modernized benchmarks such as MM-Vet tackle complex multimodal tasks, providing insights beyond rankings. MM-Vet employs an open-ended LLM-based evaluator for various response styles and questions, extending techniques from natural language processing (NLP).
MM-Vet
The primary goal is to construct a comprehensive multimodal benchmark that mirrors real-world scenarios that an AI agent could encounter. The benchmark revolves around six core VL capabilities: recognition, knowledge, optical character recognition (OCR), spatial awareness, language generation, and math. The dataset comprises 187 images and 205 questions, with diverse question types and open-ended response demands. Human-annotated answers are available for 155 questions, while 50 questions use internet-derived responses. Supplementary images from various sources enrich the dataset. A GPT-4-based evaluator is employed for evaluation, employing few-shot prompts to assess model responses. The scores range from 0 to 1, enabling unified metric computation for capabilities and their integrations. This approach accommodates diverse response styles, question types, and problem sets. GPT-4 generates scores based on the input question, ground truth, and model output. Aggregated scores offer insights into the performance of each capability or capability integration.
Study Results
Experimental setup: The MM-Vet experiment evaluates two categories of Language and Vision Models (LMMs), LMMs tuned end-to-end (e.g., OpenFlamingo, LLaVA), and methods employing LLM tools (e.g., multimodal reasoning and action (MM-ReAct), Transformers Agent). The generated scores by GPT-4 range from 0 to 1, and each sample is evaluated five times to account for score variability. The reported performance averages scores for capabilities and their integrations, taking variances into consideration.
Analysis of results: The outcomes for various methods are showcased concerning individual capabilities and their integrations, including recognition, OCR, knowledge, language generation, spatial awareness, and math.
- Recognition:The top-performing model is LLaVA-13B (LLaMA-2) in this category, likely due to its vision model's substantial training data volume and the strength of its language model.
- OCR: MM-ReAct-GPT-4 performs exceptionally well in OCR, aided by external OCR tools. LLaVA-13B (LLaMA-2) stands out among the tuned models, utilizing CLIP ViT-L/14 and extensive image-OCR data.
- Knowledge: MM-ReAct-GPT-4 excels in knowledge-related tasks, benefiting from its robust LLM backbone and external knowledge tools.
- Language Generation: MM-ReAct-GPT-4 and LLaVA-13B (LLaMA-2) exhibit strong performance in language generation, leveraging their powerful language models.
- Spatial Awareness: MM-ReAct-GPT-4 demonstrates superior spatial awareness capabilities, capitalizing on dense captioning and OCR tools to provide detailed location information.
- Math: MM-ReAct-GPT-4 stands out in math tasks due to its PAL math tool.
- Capability Integrations: MM-ReAct-GPT-4 achieves the highest scores in numerous capability integrations. Google Bard outperforms in select capabilities and integrations. The LLM-based evaluation proves effective in assessing LMM predictions, while GPT-4 shows minimal discrepancies compared to human annotations.
Comparison with Bard: Bard achieves high scores in multiple capabilities and integrations. MM-ReAct-GPT-4 competes closely, highlighting the potential of external tools. Open-source LMMs show promise but have room for improvement. Google Bard and MM-ReAct-GPT-4 perform well, but there's scope for further LMM enhancement. Vision encoder superiority remains uncertain. GPT-4 proves effective for open-ended evaluation, while current top methods achieve around 50% scores, indicating the need for more capable LMMs or tool-enhanced solutions.
Conclusion
In summary, the current study introduced the MM-Vet benchmark, designed to evaluate the combined vision-language abilities of LMMs. A fresh multimodal dataset is carefully crafted, emphasizing the integration of various capabilities. The assessment employs an open-ended approach utilizing an LLM-based evaluator. Evaluating a range of LMMs using MM-Vet highlights the necessity for improved integrated capabilities, given that current leading models achieve only approximately 50% scores.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.