In an article recently posted to the Meta Research website, researchers explored the phenomenon of semantic drift in modern large language models (LLMs). They investigated how LLMs generate correct facts initially but gradually "drift away" and produce incorrect information as the generation continues. They aimed to enhance the reliability and trustworthiness of artificial intelligence (AI) generated content across various applications.
Background
LLMs have revolutionized natural language processing, demonstrating remarkable capabilities in generating human-like text. Unlike earlier approaches that relied on explicit content planning, modern language models make predictions token-by-token without a pre-established text structure. While this approach has great results, it also poses challenges in maintaining high-level structure throughout generation and keeping the text coherent and well-structured.
Semantic drift refers to the decrease in text generation quality as the length of generated content increases. It's seen as a type of error or hallucination where the model starts to produce less accurate information. Previous studies talked about semantic drift when models made questions or stories. Recent work has also noted a decline in factual accuracy for longer generations.
About the Research
In this paper, the authors focused on demonstrating that modern LLMs initially generate correct facts but later "drift away" to produce incorrect information. To quantify this phenomenon, they introduced a novel metric called "semantic drift score" that measures the degree of separation between correct and incorrect facts in generated texts.
Using the LLaMa2 model, the study generated Wikipedia-style biographies and assessed factual accuracy using the FActScore task, which labels individual facts as correct or incorrect. This approach enabled a detailed analysis of how accurate and inaccurate information is distributed throughout the generated content.
To mitigate factual inaccuracies, the researchers explored several strategies based on the observed pattern of correct-then-incorrect information generation. For example, they implemented early stopping methods to encourage the model to end sequences earlier (EOS token) to prevent drifting into incorrect details.
They also tested resampling and reranking pipelines, generating multiple versions of each sentence and selecting the best one based on similarity measures. Additionally, external application programming interface (API) calls were used to bring the model back to the correct path by querying external knowledge sources.
Furthermore, the authors examined the trade-offs between information quantity and factual accuracy for these strategies and extended their investigation to generating Wikipedia-style articles on diverse topics. They conducted statistical tests to assess the significance of their approaches and validated their automated evaluation pipeline with human annotations. Additionally, they explored how various uncertainty metrics correlated with factual accuracy.
Research Findings
The outcomes showed that LLaMa2 models start with accurate facts but gradually generate incorrect information. For example, LLaMa2-70B exhibited an average semantic drift score of 0.78, which increased to 0.8 when filtering out fully correct and incorrect samples. This underscored a pattern where factual accuracy decreased during text generation, highlighting a challenge in maintaining consistency and reliability in automated content creation.
The study also assessed several strategies to address semantic drift and enhance factual accuracy. Early stopping techniques, particularly the oracle method which stops generation at the beginning of semantic drift, proved highly effective. This approach improved factuality, achieving over 70% accuracy compared to the baseline's 44%.
Additionally, the resample-then-rerank strategy showed promise by improving baseline factual accuracy by 8.71% when generating the maximum number of tokens. However, integrating API calls to external knowledge sources did not enhance accuracy, indicating limited effectiveness in correcting the drift trend.
Applications
This research has significant implications for improving the reliability of AI-generated content across various domains. It can help long-form question-answering systems by improving factual accuracy and preventing the generation of incorrect information. For educational materials and encyclopedic entries, these methods can ensure the production of more trustworthy and credible content.
In automated journalism and report writing tools, implementing semantic drift mitigation strategies can enhance precision in factual content. Chatbots and virtual assistants can integrate these techniques to provide more accurate and consistent responses during extended user interactions.
Conclusion
In summary, the paper provided valuable insights into semantic drift in LLMs and offered practical methods for mitigating its effects. It emphasized the need to balance information quantity with factual accuracy in AI-generated content. The proposed methods, such as early stopping and resample-then-rerank approaches, showed promise in improving the reliability of long-form text generation.
Future work could focus on fine-tuning models to stop generation when variability exceeds a threshold, developing models to detect drift points using internal model states, and exploring advanced API integration techniques. Additionally, investigating how these methods apply to different model architectures and languages could enhance their applicability and generalizability.