Woodpecker: Correcting Hallucinations in Multimodal Large Language Models

In a recent submission to the arXiv server*, researchers proposed Woodpecker, a training-free, interpretable, and effective technique for addressing hallucinations in multimodal large language models (MLLMs), showing promise in hallucination correction.

Study: Woodpecker: Correcting Hallucinations in Multimodal Large Language Models. Image credit: Generated using DALL.E.3
Study: Woodpecker: Correcting Hallucinations in Multimodal Large Language Models. Image credit: Generated using DALL.E.3

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

Within the research community, MLLMs are flourishing, representing a significant stride towards artificial general intelligence (AGI). By harnessing the capabilities of large language models (LLMs), these MLLMs merge disparate modalities such as vision and language, enabling them to provide comprehensive descriptions of images. However, these powerful models occasionally produce descriptions that deviate from the input image, a phenomenon referred to as hallucination. Hallucinations, both at the object and attribute levels, pose substantial challenges to the practical utility of MLLMs.

To mitigate these hallucinations, prior efforts have predominantly explored instruction-tuning methods. However, these approaches are data- and computation-intensive and often lead to trade-offs between hallucination reduction and descriptive detail. In contrast, the proposed framework, Woodpecker, offers a unique approach that corrects hallucinations directly without the need for retraining.

Hallucinations in MLLMs

The issue of hallucinations in MLLMs has garnered significant attention due to its impact on the reliability of these models. Research on MLLM hallucinations primarily revolves around evaluation and mitigation. Evaluation approaches involve training classification models to discern hallucinations or comparing the generated text against ground-truth answers. Mitigation strategies, on the other hand, focus on optimizing data collection and training schemes. In contrast to previous works aiming to reduce hallucinations, the proposed framework's primary objective is to refine MLLM responses by correcting hallucinated portions. This training-free framework, utilizing off-the-shelf models, simplifies integration with various MLLMs as a versatile plug-and-play module.

The Woodpecker method

The primary objective of the current study is to identify and rectify hallucinations in responses generated by MLLMs. This process is divided into five distinct subtasks, each contributing to the overarching goal:

Key Concept Extraction: To pinpoint hallucinations, the initial step is to extract key concepts from the generated sentences. This entails identifying the primary objects mentioned in the sentence, which are most likely sources of visual hallucinations.

Question Formulation: After acquiring key concepts, a series of questions are posed to facilitate hallucination diagnosis. These questions address both attribute-level and object-level hallucinations. To formulate these questions effectively, an LLM is prompted with in-context examples.

Visual Knowledge Validation: Using this step, researchers answered the questions posed in the previous step. For object-level questions, an open-set object detector is employed to ascertain object existence and count. Attribute-level questions are addressed using a pre-trained Visual Question Answering (VQA) model.

Visual Claim Generation: The responses to the questions are organized into visual claims, which are structured into a visual knowledge base. This knowledge base comprises object-level claims, providing information about object counts, and attribute-level claims, incorporating specific attributes of objects. These claims serve to mitigate hallucinations effectively.

Hallucination Correction: Informed by the visual claims, an LLM acts as a corrector, rectifying hallucinations in the generated responses. The LLM incorporates the visual knowledge base and original responses into a prompt and then proceeds to refine the responses. This refinement is performed with the inclusion of bounding boxes to enhance interpretability and facilitate correspondence between the entities mentioned in the responses and object instances in the image.

Experiments and results

Datasets: Researchers evaluated multiple tasks in MLLMs using several datasets with the Woodpecker method. They employed the Polling-based Object Probing Evaluation (POPE) dataset to evaluate hallucinations, the multi-model ensemble (MME) dataset to evaluate perception and cognition ability tasks, and the large language and vision assistant (LLaVA)-QA90 dataset to evaluate detailedness in MLLMs.

Sampling and Evaluation: For each set, 50 images are sampled, with six questions created for each image. The ratio of positive to negative samples is balanced at 50 percent each. The evaluation primarily focuses on object-level hallucinations, specifically regarding the existence aspect. MLLMs are prompted to answer whether an object exists in the image, leading to evaluation metrics including precision, recall, accuracy, and f1-score.

Implementation Details: The Woodpecker correction method is applied to the baseline models, which follow a "vision encoder-interface-language model" architecture.  This framework involves three pre-trained models, including Generative Pre-Trained (GPT)-3.5-turbo, self-distillation with no labels (DINO), and the Flan T5-based model (BLIP-2-FlanT5XXL). Researchers introduced two measures to handle "yes-or-no" questions and feed the questions to the LLM during the correction process.

Results: The results on the POPE dataset reveal that Woodpecker effectively corrects object-level hallucinations in MLLM responses. It achieves significant improvements in accuracy over various baselines in the adversarial setting. In the case of the MME dataset, the Woodpecker method effectively addresses both object-level and attribute-level hallucinations. It significantly enhances scores, particularly in count-related queries. Finally, in the case of LLaVA-QA90, the Woodpecker method consistently outperforms baseline MLLMs in terms of accuracy and detailedness when handling description-type queries.

Analysis of Framework Modules: The study analyzed the contributions of the open-set detector and the VQA model in mitigating hallucinations. It finds that the detector plays a crucial role in addressing object-level hallucinations, while the VQA model effectively provides attribute information. The full model combines the advantages of both and achieves the best results.

Analysis of Correction Performance: The correction method exhibits an accuracy rate of 79.2 percent and maintains relatively low omission and mis-correction rates. This indicates the method's ability to cover most cases without overconfidence.

Conclusion

In summary, researchers introduced a pioneering, training-free framework to rectify hallucinations in MLLMs. This approach integrates various off-the-shelf models for easy integration with diverse MLLMs. Extensive experiments across three benchmarks, including GPT-4V for direct assessment, demonstrate its effectiveness. The current study aims to inspire innovative solutions for hallucination issues in MLLMs.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
Dr. Sampath Lonka

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Lonka, Sampath. (2023, October 27). Woodpecker: Correcting Hallucinations in Multimodal Large Language Models. AZoAi. Retrieved on December 28, 2024 from https://www.azoai.com/news/20231027/Woodpecker-Correcting-Hallucinations-in-Multimodal-Large-Language-Models.aspx.

  • MLA

    Lonka, Sampath. "Woodpecker: Correcting Hallucinations in Multimodal Large Language Models". AZoAi. 28 December 2024. <https://www.azoai.com/news/20231027/Woodpecker-Correcting-Hallucinations-in-Multimodal-Large-Language-Models.aspx>.

  • Chicago

    Lonka, Sampath. "Woodpecker: Correcting Hallucinations in Multimodal Large Language Models". AZoAi. https://www.azoai.com/news/20231027/Woodpecker-Correcting-Hallucinations-in-Multimodal-Large-Language-Models.aspx. (accessed December 28, 2024).

  • Harvard

    Lonka, Sampath. 2023. Woodpecker: Correcting Hallucinations in Multimodal Large Language Models. AZoAi, viewed 28 December 2024, https://www.azoai.com/news/20231027/Woodpecker-Correcting-Hallucinations-in-Multimodal-Large-Language-Models.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
LG's EXAONE 3.5 Sets New Standards in Generative AI