Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonSep 29 2024

Despite their power, larger AI models are prone to surprising errors, generating wrong answers with confidence—researchers call for new strategies to improve reliability in critical areas.

Research: Larger and more instructable language models become less reliable

In an article published in the journal Nature, researchers examined the limitations of large language models (LLMs) as they were scaled up and refined. They found that while these models became more powerful, they often made surprising errors in easy tasks and generated plausible-sounding but incorrect responses to complex questions. The authors highlighted the need for a more robust approach to developing artificial intelligence (AI), especially in critical applications where systematic errors were particularly problematic.

Background

LLMs have gained widespread use in a range of disciplines, such as education, medicine, and administration. Despite advancements in scaling and shaping these models with techniques like human feedback, their reliability remains a concern. Earlier work demonstrated that increasing model size and incorporating techniques like fine-tuning and reinforcement learning from human input improved performance but also introduced new inconsistencies, particularly in user-perceived reliability.

Researchers found that "difficulty discordance" leads to models making mistakes on both easy and complex tasks, regardless of their size.

Previous studies have explored issues like prompt sensitivity and task avoidance; however, the underlying causes of model errors and unpredictable behavior remain unclear. This paper addresses this gap by analyzing key factors affecting LLM reliability—specifically difficulty concordance, task avoidance, and prompt stability—across different model families and benchmarks, offering insights into how these elements interact to shape model behavior.

Methodology and Experimental Design

The authors evaluated LLMs using five benchmarks: addition, anagram, locality, science and transforms. These tasks covered a wide range of numerical, linguistic, geographical, and information-processing skills to assess the models' performance across varied difficulty levels.

Each task represented distinct cognitive abilities, such as addition, which involved arithmetic calculations; anagrams, which tested vocabulary and problem-solving; locality, which required geographical reasoning; science, which examined the ability to handle basic and advanced science questions; and transforms, which simulated real-world data manipulation.

The values are split by correct, avoidant and incorrect results. For each combination of model and benchmark, the result is the average of 15 prompt templates (see Supplementary Tables 1 and 2). For each benchmark, we show its chosen intrinsic difficulty, monotonically calibrated to human expectations on the x axis for ease of comparison between benchmarks. The x axis is split into 30 equal-sized bins, for which the ranges must be taken as indicative of different distributions of perceived human difficulty across benchmarks. For ‘science’, the transparent yellow bars at the bottom represent the random guess probability (25% of the non-avoidance answers)

The benchmarks were carefully selected to reflect the real-world challenges LLMs face. For example, addition tasks ranged from simple to complex, while anagram difficulty depended on factors such as letter frequency and word length. The locality benchmark drew from global city data, and the science benchmark included questions from the OpenBookQA (OBQA), and graduate-level Google-proof question-and-answer (GPQA) benchmarks.

Due to the complexity of assessing correctness, relevance, and verbosity, a mix of algorithmic and manual evaluations was used to score LLM responses. To mimic real-world usage, a diverse range of prompt templates was designed to reflect natural human interaction with LLMs. The study tested various models such as generative pre-trained transformers (GPT), LLM Meta AI (LLaMA), and big science large open-science open-access multilingual LLM (BLOOM), applying different settings and scales to ensure robust analysis across all tasks.

Performance Trends by Difficulty Level

The authors presented an analysis of the performance of various models from the GPT and LLaMA families across five domains: addition, anagram, locality, science, and transforms. As models were scaled and shaped, a steady increase in correct responses was observed, particularly evident in the last column of the results.

The researchers quantified model performance by examining the average results across 15 prompt templates for each benchmark, revealing that correctness decreased as difficulty increased. This correlation was captured using several proxies for difficulty, including the number of carry operations in addition and human-judged difficulty for science.

Despite the increase in correctness with model scaling, a phenomenon of difficulty discordance arose, where even easy tasks often led to incorrect outputs. Notably, shaped-up models tended to produce more confident yet incorrect answers compared to their raw counterparts, indicating a shift from avoidance to incorrectness. This trend was less pronounced in the LLaMA family.

Prompt sensitivity varied across models, with raw models exhibiting higher sensitivity to prompt variations. While shaped-up models demonstrated increased stability, they still revealed pockets of unreliability. The findings suggested that current methods of user supervision may not effectively mitigate the remaining unreliability of model outputs, emphasizing the need for more sophisticated prompt engineering.

Enhancing Human-AI Verification and Reliability

The researchers conducted two human studies. One examined the correlation between perceived and actual difficulty in responding to AI inputs, and the other investigated whether humans were prone to incorrectly accepting wrong AI outputs as correct. The findings suggested that optimizing difficulty alignment and reducing verification errors should be considered when training models.

However, limitations were noted, including a predominantly non-expert participant base and the absence of data reflecting real-world prompt frequency. The authors emphasized the importance of addressing reliability issues in LLMs like GPT, LLaMA, and BLOOM while advocating for improved methodologies to shape future LLM development.

Conclusion

In conclusion, the researchers highlighted significant limitations in LLMs as they scaled, revealing their tendency to generate incorrect outputs, even in simple tasks. Despite improvements in performance through fine-tuning and human feedback, reliability remained a critical concern, particularly in high-stakes applications.

The study's insights into difficulty concordance, task avoidance, and prompt sensitivity emphasized the need for new strategies in AI development. By optimizing difficulty alignment and reducing verification errors, future methodologies could enhance the reliability and effectiveness of LLMs, ensuring safer and more trustworthy deployment in essential fields.

Journal reference:

Zhou, L., Schellaert, W., Martínez-Plumed, F., Moros-Daval, Y., Ferri, C., & Hernández-Orallo, J. (2024). Larger and more instructable language models become less reliable. Nature. DOI: 10.1038/s41586-024-07930-y, ‌https://www.nature.com/articles/s41586-024-07930-y

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, September 29). Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers. AZoAi. Retrieved on July 05, 2025 from https://www.azoai.com/news/20240929/Scaling-Large-Language-Models-Makes-Them-Less-Reliable-Producing-Confident-but-Incorrect-Answers.aspx.
MLA
Nandi, Soham. "Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers". AZoAi. 05 July 2025. <https://www.azoai.com/news/20240929/Scaling-Large-Language-Models-Makes-Them-Less-Reliable-Producing-Confident-but-Incorrect-Answers.aspx>.
Chicago
Nandi, Soham. "Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers". AZoAi. https://www.azoai.com/news/20240929/Scaling-Large-Language-Models-Makes-Them-Less-Reliable-Producing-Confident-but-Incorrect-Answers.aspx. (accessed July 05, 2025).
Harvard
Nandi, Soham. 2024. Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers. AZoAi, viewed 05 July 2025, https://www.azoai.com/news/20240929/Scaling-Large-Language-Models-Makes-Them-Less-Reliable-Producing-Confident-but-Incorrect-Answers.aspx.