In an article submitted to the arxiv* server, researchers introduced fact-conflicting hallucination detection (FACTCHD), a framework for the detection of fact-conflicting hallucinations in large language models (LLMs) such as ChatGPT/GPT-4. They evaluated multiple LLMs and found that existing methods struggled to accurately detect factual errors and researched ways to enhance the credibility of fact detection.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
The widespread use of LLMs for web content generation has raised concerns about spreading misinformation, referred to as fact-conflicting hallucination. Hallucination refers to instances where LLMs generate misleading information with a high degree of confidence. It's classified into three categories: input-conflicting, context-conflicting, and fact-conflicting hallucinations. While the first two types are relatively easy to identify, fact-conflicting hallucinations pose a substantial challenge, as they involve the dissemination of misleading information based on conflicting factual knowledge.
Hallucination is a result of outdated or flawed knowledge within LLMs and their limitations, including reconciliation and logical reasoning. These hallucinations not only impede the use of Artificial Intelligence (AI) in critical fields like finance and healthcare but also contribute to the spread of inaccurate information online, posing a significant threat. To address this, effective detection and mitigation of fact-conflicting hallucinations are vital.
Existing fact verification tasks and hallucination evaluation benchmarks have limitations; hence, researchers of the present study introduced a rigorous scenario requiring evaluators to use their knowledge, external resources, and reasoning to make factual judgments and provide explanations. While previous studies extensively examined hallucination in natural language generation settings, this one specifically concentrates on evaluating fact-conflicting hallucination.
About the Study
The research addresses hallucinations in LLMs like ChatGPT and factuality detection in natural language processing tasks. The researchers introduced FACTCHD, a benchmark that provides interpretable data for evaluating the factual accuracy of LLM-generated responses. Additionally, they also developed the TRUTH-TRIANGULATOR framework to enhance hallucination detection by cross-referencing multiple independent sources or perspectives, making the detection process more reliable.
The authors introduced a "QUERY-RESPONSE" format for LLM evaluation, considering interpretability in FACTCHD. They provided a unique perspective on evaluating LLMs and their responses, contributing to the development of more accurate and reliable language models with implications for model editing and refinement. Ultimately, the study aims to enhance the trustworthiness and reliability of LLMs, especially in scenarios where factual accuracy is crucial.
Firstly, the researchers attempt to define the task of FACTCHD, construct the FACTCHD benchmark with diverse reasoning patterns, and identify four factual error categories in LLM outputs, curating realistic hallucination data for analysis.
Then, they utilized knowledge graphs (KGs) from Wikidata and PrimeKG, along with textual knowledge from various datasets, to create a factual backbone for generating hallucinations, enabling multi-hop reasoning and vanilla inference patterns. To generate "QUERY-RESPONSE" contexts, they defined the system's role, specify input and objectives, and guide the model to provide factuality-related samples, ensuring precise control over the quality of responses. They utilized ChatGPT to create these scenarios through manual prompts and examples, refining the process iteratively based on prompt sensitivity.
An initial set of 100 examples is selected, and five are assessed for similarity to demonstrative examples. To enhance diversity in these contexts, Sentence-BERT (SBERT) was employed for automated screening, removing highly similar samples and ensuring dataset diversity by eliminating 1,538 training and 632 test set samples. The benchmark goes beyond error identification by requiring ChatGPT to construct coherent evidence chains based on factual knowledge from sub-graph facts and textual facts, enhancing credibility and user understanding. Filter rules are developed to guide 21 annotators in quality filtering, considering pattern consistency, response factuality, and evidence chain logic. The annotators undergo standardized training and use their judgment and search engines to make informed decisions, resulting in the removal of 565 training and 258 test samples via a voting mechanism.
Results
The study examines the influence of model capacity on the FACTCHD. Transitioning from the 7B to 13B models significantly improves detection, particularly in in-context and zero-shot learning scenarios. Alpaca-13B outperformed ChatGPT due to consistent prompts and tailored adjustments for Alpaca's sensitivity. Fine-tuning models with training data has a minimal impact on performance, suggesting training language models as hallucination detectors can alleviate the need for larger model capacities.
To enhance detection capabilities, they investigated the role of accurate facts and "QUERY-RESPONSE" context based on queries. Using intrinsic dataset facts as retrieved evidence leads to a notable increase in the FACTCLS score, highlighting the potential for augmentation with precise facts. Omitting the "query" during fine-tuning results in a 40% FACTCLS score decline, emphasizing the importance of an extensive "QUERY-RESPONSE" context. In the analysis, the TRUTH-TRIANGULATOR model demonstrated its broad applicability in real-world hallucination data. It proved capable of making proficient judgments, especially in discrepancies between the expert detection and ChatGPT enhanced by the tool. This enhances FACTCHD's reliability in genuine, unscripted scenarios.
Conclusion
To sum up, the FACTCHD benchmark is introduced to extensively evaluate the fact-conflicting hallucinations in LLMs, featuring diverse patterns and strong evidence chains for accurate factuality assessment. A semi-supervised approach leverages knowledge as facts for dataset creation. The TRUTH-TRIANGULATOR framework uses triangulation to assess information veracity by search tools and cross-referencing generators, especially for uncertain responses.
Future objectives include enhancing knowledge methods for hallucination detection, addressing factual errors from false or outdated prior knowledge, and broadening the scope to evaluate hallucinations in various cultural contexts and modalities.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.