In a paper published in the journal Nature, researchers discussed how large language model (LLM) systems, like chat generative pre-trained transformer (ChatGPT) or Gemini, often generated false and unsubstantiated answers, posing risks in fields like law, journalism, and medicine. Despite efforts to improve truthfulness, these systems remained unreliable.
The analysts proposed an entropy-based uncertainty estimator to detect confabulations and arbitrary and incorrect outputs by assessing meaning rather than specific word sequences. This method worked across datasets and tasks without prior data, helping users recognize when extra caution was needed with LLMs, enhancing their reliability and potential applications.
Related Work
Past work has highlighted that ‘hallucinations’ are a critical problem for natural language generation systems using LLMs like ChatGPT and Gemini, making users unable to trust outputs. Hallucinations, defined as nonsensical or unfaithful content to the provided source, include various failures of faithfulness and factuality. The disadvantages of hallucinations in LLMs include eroding user trust, which hinders their adoption in fields like law, journalism, and medicine, and the potential harm caused by inaccurate or fabricated information leading to serious consequences.
Semantic Entropy Overview
Semantic entropy builds on probabilistic tools for uncertainty estimation and can be applied directly to any LLM or similar foundation model without requiring architectural modifications. The 'discrete' variant of semantic uncertainty can be used even when the predicted probabilities for the generations are unavailable, for example, because access to the model's internals is limited. Uncertainty in machine learning aims to detect confabulations in LLMs, using the principle that the model will be uncertain about generations for which its output will be arbitrary.
One measure of uncertainty is the predictive entropy of the output distribution, which measures the information one has about the output given the input. A low predictive entropy indicates a heavily concentrated output distribution, whereas a high one suggests that many possible outputs are similarly likely. The analysis does not distinguish between aleatoric and epistemic uncertainty. Joint probabilities of sequences of tokens produced by generative LLMs need to be considered for computing entropies, and length normalization is used when comparing the log probabilities of generated sequences.
Detecting confabulations relies more on the LLM's uncertainty about meanings rather than specific word choices. Semantic uncertainty assesses the model's ambiguity regarding the intended meaning of its outputs, employing methods like clustering sequences based on bidirectional entailment and estimating entropy from shared-meaning probabilities.
Once the team identifies the classes of generated sequences that convey equivalent meanings, they estimate the likelihood that a sequence generated by the LLM belongs to a specific class by summing the probabilities of all potential sequences of tokens that could express that same meaning. This estimation ensures that not every possible meaning class is accessible and instead samples from the sequence-generating distribution induced by the model. For scenarios where sequence probabilities are unavailable, 'discrete' semantic entropy approximates the proportion of sampled answers belonging to each cluster, assuming that each output was equally probable and converging to the same estimator as sampling increases.
Semantic entropy is designed to detect confabulations and can improve model accuracy by refusing to answer questions when semantic uncertainty is high. Applying semantic entropy to various datasets captures the opportunities created by LLMs to produce free-form sentences as answers. This method's evaluation uses metrics like area under the receiver operating characteristic curve (AUROC), rejection accuracy, and area under rejection accuracy curve (AURAC), with correctness determined automatically using GPT-4 for sentence-length generations.
Detecting Confabulations Effectively
Semantic entropy and its discrete variant demonstrate superior performance over existing baselines in detecting confabulations during sentence-length and paragraph-length generations. Results across multiple datasets illustrate how semantic entropy outperforms naive entropy and supervised embedding regression methods in identifying errors indicative of confabulations.
Based on the AUROC and AURAC metrics, the evaluations reveal consistent improvements across different model sizes and datasets, highlighting semantic entropy's effectiveness in discerning when model outputs are likely incorrect due to semantic inconsistencies. Applied to biographical data generated by GPT-4, the discrete variant of semantic entropy continues to excel, demonstrating higher AUROC and AURAC scores compared to self-check and adapted P(True) baselines, particularly in rejecting answers prone to confabulations while maintaining answer accuracy.
Conclusion
To sum up, the approach effectively leveraged entropy-based uncertainty estimators to detect confabulations in LLMs like ChatGPT1 and Gemini2. The method enhanced reliability across diverse datasets and tasks by focusing on semantic meaning rather than specific word sequences. This capability addressed the challenge of hallucinatory outputs and opened new avenues for leveraging language models with greater confidence and applicability in real-world settings.