In a recent submission to the arXiv server*, researchers proposed Woodpecker, a training-free, interpretable, and effective technique for addressing hallucinations in multimodal large language models (MLLMs), showing promise in hallucination correction.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Within the research community, MLLMs are flourishing, representing a significant stride towards artificial general intelligence (AGI). By harnessing the capabilities of large language models (LLMs), these MLLMs merge disparate modalities such as vision and language, enabling them to provide comprehensive descriptions of images. However, these powerful models occasionally produce descriptions that deviate from the input image, a phenomenon referred to as hallucination. Hallucinations, both at the object and attribute levels, pose substantial challenges to the practical utility of MLLMs.
To mitigate these hallucinations, prior efforts have predominantly explored instruction-tuning methods. However, these approaches are data- and computation-intensive and often lead to trade-offs between hallucination reduction and descriptive detail. In contrast, the proposed framework, Woodpecker, offers a unique approach that corrects hallucinations directly without the need for retraining.
Hallucinations in MLLMs
The issue of hallucinations in MLLMs has garnered significant attention due to its impact on the reliability of these models. Research on MLLM hallucinations primarily revolves around evaluation and mitigation. Evaluation approaches involve training classification models to discern hallucinations or comparing the generated text against ground-truth answers. Mitigation strategies, on the other hand, focus on optimizing data collection and training schemes. In contrast to previous works aiming to reduce hallucinations, the proposed framework's primary objective is to refine MLLM responses by correcting hallucinated portions. This training-free framework, utilizing off-the-shelf models, simplifies integration with various MLLMs as a versatile plug-and-play module.
The Woodpecker method
The primary objective of the current study is to identify and rectify hallucinations in responses generated by MLLMs. This process is divided into five distinct subtasks, each contributing to the overarching goal:
Key Concept Extraction: To pinpoint hallucinations, the initial step is to extract key concepts from the generated sentences. This entails identifying the primary objects mentioned in the sentence, which are most likely sources of visual hallucinations.
Question Formulation: After acquiring key concepts, a series of questions are posed to facilitate hallucination diagnosis. These questions address both attribute-level and object-level hallucinations. To formulate these questions effectively, an LLM is prompted with in-context examples.
Visual Knowledge Validation: Using this step, researchers answered the questions posed in the previous step. For object-level questions, an open-set object detector is employed to ascertain object existence and count. Attribute-level questions are addressed using a pre-trained Visual Question Answering (VQA) model.
Visual Claim Generation: The responses to the questions are organized into visual claims, which are structured into a visual knowledge base. This knowledge base comprises object-level claims, providing information about object counts, and attribute-level claims, incorporating specific attributes of objects. These claims serve to mitigate hallucinations effectively.
Hallucination Correction: Informed by the visual claims, an LLM acts as a corrector, rectifying hallucinations in the generated responses. The LLM incorporates the visual knowledge base and original responses into a prompt and then proceeds to refine the responses. This refinement is performed with the inclusion of bounding boxes to enhance interpretability and facilitate correspondence between the entities mentioned in the responses and object instances in the image.
Experiments and results
Datasets: Researchers evaluated multiple tasks in MLLMs using several datasets with the Woodpecker method. They employed the Polling-based Object Probing Evaluation (POPE) dataset to evaluate hallucinations, the multi-model ensemble (MME) dataset to evaluate perception and cognition ability tasks, and the large language and vision assistant (LLaVA)-QA90 dataset to evaluate detailedness in MLLMs.
Sampling and Evaluation: For each set, 50 images are sampled, with six questions created for each image. The ratio of positive to negative samples is balanced at 50 percent each. The evaluation primarily focuses on object-level hallucinations, specifically regarding the existence aspect. MLLMs are prompted to answer whether an object exists in the image, leading to evaluation metrics including precision, recall, accuracy, and f1-score.
Implementation Details: The Woodpecker correction method is applied to the baseline models, which follow a "vision encoder-interface-language model" architecture. This framework involves three pre-trained models, including Generative Pre-Trained (GPT)-3.5-turbo, self-distillation with no labels (DINO), and the Flan T5-based model (BLIP-2-FlanT5XXL). Researchers introduced two measures to handle "yes-or-no" questions and feed the questions to the LLM during the correction process.
Results: The results on the POPE dataset reveal that Woodpecker effectively corrects object-level hallucinations in MLLM responses. It achieves significant improvements in accuracy over various baselines in the adversarial setting. In the case of the MME dataset, the Woodpecker method effectively addresses both object-level and attribute-level hallucinations. It significantly enhances scores, particularly in count-related queries. Finally, in the case of LLaVA-QA90, the Woodpecker method consistently outperforms baseline MLLMs in terms of accuracy and detailedness when handling description-type queries.
Analysis of Framework Modules: The study analyzed the contributions of the open-set detector and the VQA model in mitigating hallucinations. It finds that the detector plays a crucial role in addressing object-level hallucinations, while the VQA model effectively provides attribute information. The full model combines the advantages of both and achieves the best results.
Analysis of Correction Performance: The correction method exhibits an accuracy rate of 79.2 percent and maintains relatively low omission and mis-correction rates. This indicates the method's ability to cover most cases without overconfidence.
Conclusion
In summary, researchers introduced a pioneering, training-free framework to rectify hallucinations in MLLMs. This approach integrates various off-the-shelf models for easy integration with diverse MLLMs. Extensive experiments across three benchmarks, including GPT-4V for direct assessment, demonstrate its effectiveness. The current study aims to inspire innovative solutions for hallucination issues in MLLMs.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.