Despite their power, larger AI models are prone to surprising errors, generating wrong answers with confidence—researchers call for new strategies to improve reliability in critical areas.
Research: Larger and more instructable language models become less reliable
In an article published in the journal Nature, researchers examined the limitations of large language models (LLMs) as they were scaled up and refined. They found that while these models became more powerful, they often made surprising errors in easy tasks and generated plausible-sounding but incorrect responses to complex questions. The authors highlighted the need for a more robust approach to developing artificial intelligence (AI), especially in critical applications where systematic errors were particularly problematic.
Background
LLMs have gained widespread use in a range of disciplines, such as education, medicine, and administration. Despite advancements in scaling and shaping these models with techniques like human feedback, their reliability remains a concern. Earlier work demonstrated that increasing model size and incorporating techniques like fine-tuning and reinforcement learning from human input improved performance but also introduced new inconsistencies, particularly in user-perceived reliability.
Previous studies have explored issues like prompt sensitivity and task avoidance; however, the underlying causes of model errors and unpredictable behavior remain unclear. This paper addresses this gap by analyzing key factors affecting LLM reliability—specifically difficulty concordance, task avoidance, and prompt stability—across different model families and benchmarks, offering insights into how these elements interact to shape model behavior.
Methodology and Experimental Design
The authors evaluated LLMs using five benchmarks: addition, anagram, locality, science and transforms. These tasks covered a wide range of numerical, linguistic, geographical, and information-processing skills to assess the models' performance across varied difficulty levels.
Each task represented distinct cognitive abilities, such as addition, which involved arithmetic calculations; anagrams, which tested vocabulary and problem-solving; locality, which required geographical reasoning; science, which examined the ability to handle basic and advanced science questions; and transforms, which simulated real-world data manipulation.
The benchmarks were carefully selected to reflect the real-world challenges LLMs face. For example, addition tasks ranged from simple to complex, while anagram difficulty depended on factors such as letter frequency and word length. The locality benchmark drew from global city data, and the science benchmark included questions from the OpenBookQA (OBQA), and graduate-level Google-proof question-and-answer (GPQA) benchmarks.
Due to the complexity of assessing correctness, relevance, and verbosity, a mix of algorithmic and manual evaluations was used to score LLM responses. To mimic real-world usage, a diverse range of prompt templates was designed to reflect natural human interaction with LLMs. The study tested various models such as generative pre-trained transformers (GPT), LLM Meta AI (LLaMA), and big science large open-science open-access multilingual LLM (BLOOM), applying different settings and scales to ensure robust analysis across all tasks.
Performance Trends by Difficulty Level
The authors presented an analysis of the performance of various models from the GPT and LLaMA families across five domains: addition, anagram, locality, science, and transforms. As models were scaled and shaped, a steady increase in correct responses was observed, particularly evident in the last column of the results.
The researchers quantified model performance by examining the average results across 15 prompt templates for each benchmark, revealing that correctness decreased as difficulty increased. This correlation was captured using several proxies for difficulty, including the number of carry operations in addition and human-judged difficulty for science.
Despite the increase in correctness with model scaling, a phenomenon of difficulty discordance arose, where even easy tasks often led to incorrect outputs. Notably, shaped-up models tended to produce more confident yet incorrect answers compared to their raw counterparts, indicating a shift from avoidance to incorrectness. This trend was less pronounced in the LLaMA family.
Prompt sensitivity varied across models, with raw models exhibiting higher sensitivity to prompt variations. While shaped-up models demonstrated increased stability, they still revealed pockets of unreliability. The findings suggested that current methods of user supervision may not effectively mitigate the remaining unreliability of model outputs, emphasizing the need for more sophisticated prompt engineering.
Enhancing Human-AI Verification and Reliability
The researchers conducted two human studies. One examined the correlation between perceived and actual difficulty in responding to AI inputs, and the other investigated whether humans were prone to incorrectly accepting wrong AI outputs as correct. The findings suggested that optimizing difficulty alignment and reducing verification errors should be considered when training models.
However, limitations were noted, including a predominantly non-expert participant base and the absence of data reflecting real-world prompt frequency. The authors emphasized the importance of addressing reliability issues in LLMs like GPT, LLaMA, and BLOOM while advocating for improved methodologies to shape future LLM development.
Conclusion
In conclusion, the researchers highlighted significant limitations in LLMs as they scaled, revealing their tendency to generate incorrect outputs, even in simple tasks. Despite improvements in performance through fine-tuning and human feedback, reliability remained a critical concern, particularly in high-stakes applications.
The study's insights into difficulty concordance, task avoidance, and prompt sensitivity emphasized the need for new strategies in AI development. By optimizing difficulty alignment and reducing verification errors, future methodologies could enhance the reliability and effectiveness of LLMs, ensuring safer and more trustworthy deployment in essential fields.
Journal reference:
- Zhou, L., Schellaert, W., Martínez-Plumed, F., Moros-Daval, Y., Ferri, C., & Hernández-Orallo, J. (2024). Larger and more instructable language models become less reliable. Nature. DOI: 10.1038/s41586-024-07930-y, https://www.nature.com/articles/s41586-024-07930-y