In an article recently submitted to the ArXiv* server, researchers proposed a simple decoding strategy known as Decoding by Contrasting Layers (DoLa) to address hallucinations in large language models (LLMs). The reduction of incorrect fact generation by DoLa was achieved through the contrast of logit differences between later and earlier layers, all without external knowledge or additional fine-tuning. This approach consistently improved truthfulness in various tasks. For instance, it resulted in a substantial 12-17% absolute points boost in the performance of LLaMA family models on TruthfulQA, highlighting its potential for enhancing factual accuracy in LLM-generated content.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Despite the progress of LLMs in natural language processing, many methods frequently experience "hallucinations" – generating content that diverges from factual data encountered during training. This poses significant challenges, particularly in high-stakes applications where trustworthy text generation is crucial. One possible reason for LLM’s hallucinations is their training objective, which may prioritize linguistic patterns over real-world facts.
Related Work
The prior research has addressed hallucination issues in language models (LLMs) involving the generation of content lacking factual accuracy. The problem has been addressed through various methods, including reinforcement learning, knowledge distillation, and inference-time checks by aiming at mitigation.
The distribution of linguistic knowledge across transformer layers has also been investigated, with early layers focusing on syntax and later layers handling semantics. Recent studies have highlighted the importance of middle and top layers for factual predictions and specific heads for determining truthfulness. Contrastive Decoding (CD) often relies on fixed criteria such as model size.
Proposed Method
The foundation of recent language models, comprising embedding layers, stacked transformer layers, and vocabulary prediction heads, is outlined in this study. Instead of applying the vocabulary prediction head solely to the final layer, DoLa introduces a novel strategy. It contrasts the output information from higher and lower layers to estimate the probability of the next token dynamically. This approach aims to amplify the factual knowledge within the model and reduce the generation of incorrect facts.
Determining the premature layer is a critical aspect of the process. This selection, where the model's predictions deviate from facts, is achieved dynamically using a distance measure. Additionally, this approach highlights the rationale behind contrasting the predictions from different layers and introduces a repetition penalty to address potential issues related to sentence repetition during generation.
The dynamic selection of the premature layer is a pivotal innovation in DoLa. This strategy enables the model to adapt to the complexity of each token to enhance its utilization of knowledge from different layers more efficiently. This contrasts with the static approach, which requires exhaustive experimentation and can be sensitive to data distribution. DoLa addresses these challenges and optimizes robust performance by selecting the premature layer dynamically.
The methodology further details how the output distribution is constructed by subtracting the log probabilities of premature layer outputs from those of the mature layer. This process sharpens the model's predictions toward factually correct outputs. An adaptive plausibility constraint is also introduced to fine-tune the behavior of the model in handling implausible tokens. Additionally, a repetition penalty is applied to mitigate potential issues related to repetitive sentence generation. Overall, DoLa's architecture and techniques enhance the factual accuracy of language models, which is particularly beneficial in tasks demanding a profound comprehension of real-world knowledge.
Experimental Analysis
The researchers investigated two main types of tasks: multiple-choice tasks and open-ended generation tasks. Multiple-choice datasets such as TruthfulQA and FACTOR (news/wiki) were employed to evaluate the model's proficiency in choosing the accurate response from the provided choices. The researchers utilized TruthfulQA, StrategyQA, and GSM8K in open-ended generation tasks that focus on tasks requiring not just factual knowledge but also chain-of-thought reasoning.
To evaluate the model as a chatbot, they utilized the GPT-4 automatic evaluation proposed by the Vicuna QA benchmark. Various model sizes were employed, ranging from 7 billion to 65 billion parameters. They compared their proposed DoLa approach against baselines, which included original decoding, Contrastive Decoding (CD), and Inference Time Intervention (ITI).
The researchers also fine-tuned certain hyperparameters, such as the adaptive plausibility constraint (α) and repetition penalty (θ), to optimize their method's performance. By systematically evaluating DoLa across different tasks and model sizes, they demonstrated its effectiveness in enhancing both factual accuracy and reasoning abilities while maintaining low latency.
Conclusion
To sum up, DoLa is a novel decoding strategy designed to mitigate hallucinations in LLMs. It leverages the hierarchical structure of transformer models to dynamically select and contrast layers that enhance factuality during decoding. Experimental results demonstrate substantial improvements in truthfulness across various tasks without requiring external information retrieval or model fine-tuning. DoLa offers a straightforward decoding approach but can also be seamlessly integrated with retrieval modules. This integration represents a pivotal stride in bolstering the autonomy, safety, and reliability of LLMs.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.