LLMs Encode Truth Better Than They Show, Revealing New Strategies for Error Detection

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonOct 14 2024

Discover how large language models secretly encode deeper truthfulness within "exact answer tokens," unlocking potential for enhanced error detection and reducing AI hallucinations in real-world applications.

Study: LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. Image Credit: Piscine26 / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers at Apple, Google Research, and Technion investigated how large language models (LLMs) internally encoded information about the truthfulness of their outputs, particularly focusing on errors or "hallucinations."

The authors revealed that specific "exact answer tokens" carried more truth-related information than previously known and showed that internal model states could predict errors. However, error detection methods lacked generalizability across datasets. The findings highlighted discrepancies between LLMs' internal knowledge and their external outputs, offering insights into improving error analysis and mitigation strategies.

Background

The increasing use of LLMs has brought attention to their limitations, particularly the generation of inaccurate information, often termed “hallucinations.” Previous research has focused on defining and understanding these errors through external analysis, exploring how users perceive such inaccuracies. While these studies have provided valuable insights, they have primarily examined errors from a user’s perspective, neglecting how LLMs internally encode these mistakes.

Another body of work has investigated internal LLM representations, suggesting that models may encode signals about truthfulness. However, these efforts were largely restricted to detecting errors without exploring how internal knowledge could improve our understanding and mitigate hallucinations.

This paper aimed to fill these gaps by showing that truthfulness signals were localized in specific "exact answer tokens" rather than across all tokens or outputs. Through detailed experiments, the authors trained classifiers to predict features related to truthfulness from internal model states.

Their findings revealed that focusing on these exact answer tokens improved error detection. However, error detection did not generalize across datasets, highlighting the complexity of truthfulness encoding and opening new avenues for targeted mitigation strategies based on internal model knowledge.

Enhanced Error Detection

AUC values of a probe error detector across layers and tokens, Mistral-7b-instruct. Generation proceeds from left to right, with detection performance peaking at the exact answer tokens.

The researchers focused on improving the detection of errors made by LLMs using their internal computations without relying on external resources. The task was defined as predicting whether a generated response was correct or incorrect by comparing it with ground-truth answers in a white-box setting.

The authors evaluated four LLMs (Mistral-7b, Mistral-7b-Instruct, Llama3-8b, and Llama3-8b-Instruct) on various datasets like Trivia question-answer (QA), HotpotQA, Winobias, and multi-genre natural language interface (MNLI), using different error-detection methods. They explored techniques like aggregating token probabilities (logits), probing classifiers, and a method called "P(True)" to measure model confidence.

A key innovation in this work was the identification and use of "exact answer tokens," which were the most critical parts of a generated response. Previous methods typically relied on the last token or average token values, often missing important information.

The authors demonstrated significant improvements in error detection performance by focusing on the exact answer tokens. Their experiments showed that these tokens held stronger signals for truthfulness, particularly when used in probing classifiers.

The results consistently highlighted that error-detection methods focusing on exact answer tokens outperformed traditional approaches across various datasets and models. The findings suggested that truthfulness in LLM outputs was highly localized, and paying attention to specific token locations enhanced the effectiveness of error detection.

Generalization and Error Detection

The generalization of error detection between tasks in LLMs revealed that these models encoded truthfulness, but the extent of generalization remained unclear. While probing classifiers could detect errors across various tasks, most of their performance could be attributed to logit-based methods, which focused on output logits rather than internal mechanisms.

The authors highlighted that LLMs possessed multiple task-specific truthfulness mechanisms rather than a universal system. Certain tasks, like factual retrieval and common-sense reasoning, showed better generalization, but anomalies such as the asymmetric transfer from TriviaQA to math tasks suggested more profound complexity.

A taxonomy was introduced to examine error types, categorizing errors based on the number of distinct responses generated and the frequency of correct versus incorrect answers. This classification provided insights into how LLMs handled specific errors, such as consistently incorrect answers or generating many responses.

Different error types in free-form generation, exposed when resampled many times.

Probes trained to detect error types from internal model representations indicated that LLMs encoded not only correctness but also fine-grained error patterns.

Additionally, selecting answers from multiple candidates using probes could improve accuracy, particularly in cases where the model lacked external preference for the correct response. However, these probes highlighted a disconnect between internal truthfulness encoding and actual output generation, underscoring the need for further research to mitigate errors effectively.

Conclusion

In conclusion, the researchers highlighted the critical role of internal representations in LLMs for understanding and improving error detection. The research enhanced error detection strategies by identifying that truthfulness information was localized in specific "exact answer tokens," although generalization across datasets remained challenging.

The findings underscored discrepancies between LLMs' internal knowledge and their outputs, suggesting the need for tailored mitigation strategies based on specific error types. Overall, this work paved the way for further exploration into leveraging internal knowledge to reduce hallucinations and improve the accuracy of LLMs in real-world applications.

Journal reference:

Preliminary scientific report. Orgad, H., Toker, M., Gekhman, Z., Reichart, R., Szpektor, I., Kotek, H., & Belinkov, Y. (2024). LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. ArXiv.org. DOI:10.48550/arXiv.2410.02707, https://arxiv.org/abs/2410.02707

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, October 14). LLMs Encode Truth Better Than They Show, Revealing New Strategies for Error Detection. AZoAi. Retrieved on July 29, 2025 from https://www.azoai.com/news/20241014/LLMs-Encode-Truth-Better-Than-They-Show-Revealing-New-Strategies-for-Error-Detection.aspx.
MLA
Nandi, Soham. "LLMs Encode Truth Better Than They Show, Revealing New Strategies for Error Detection". AZoAi. 29 July 2025. <https://www.azoai.com/news/20241014/LLMs-Encode-Truth-Better-Than-They-Show-Revealing-New-Strategies-for-Error-Detection.aspx>.
Chicago
Nandi, Soham. "LLMs Encode Truth Better Than They Show, Revealing New Strategies for Error Detection". AZoAi. https://www.azoai.com/news/20241014/LLMs-Encode-Truth-Better-Than-They-Show-Revealing-New-Strategies-for-Error-Detection.aspx. (accessed July 29, 2025).
Harvard
Nandi, Soham. 2024. LLMs Encode Truth Better Than They Show, Revealing New Strategies for Error Detection. AZoAi, viewed 29 July 2025, https://www.azoai.com/news/20241014/LLMs-Encode-Truth-Better-Than-They-Show-Revealing-New-Strategies-for-Error-Detection.aspx.