Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers

Despite their power, larger AI models are prone to surprising errors, generating wrong answers with confidence—researchers call for new strategies to improve reliability in critical areas.

Research: Larger and more instructable language models become less reliableResearch: Larger and more instructable language models become less reliable

In an article published in the journal Nature, researchers examined the limitations of large language models (LLMs) as they were scaled up and refined. They found that while these models became more powerful, they often made surprising errors in easy tasks and generated plausible-sounding but incorrect responses to complex questions. The authors highlighted the need for a more robust approach to developing artificial intelligence (AI), especially in critical applications where systematic errors were particularly problematic.

Background

LLMs have gained widespread use in a range of disciplines, such as education, medicine, and administration. Despite advancements in scaling and shaping these models with techniques like human feedback, their reliability remains a concern. Earlier work demonstrated that increasing model size and incorporating techniques like fine-tuning and reinforcement learning from human input improved performance but also introduced new inconsistencies, particularly in user-perceived reliability.

Previous studies have explored issues like prompt sensitivity and task avoidance; however, the underlying causes of model errors and unpredictable behavior remain unclear. This paper addresses this gap by analyzing key factors affecting LLM reliability—specifically difficulty concordance, task avoidance, and prompt stability—across different model families and benchmarks, offering insights into how these elements interact to shape model behavior.

Methodology and Experimental Design

The authors evaluated LLMs using five benchmarks: addition, anagram, locality, science and transforms. These tasks covered a wide range of numerical, linguistic, geographical, and information-processing skills to assess the models' performance across varied difficulty levels.

Each task represented distinct cognitive abilities, such as addition, which involved arithmetic calculations; anagrams, which tested vocabulary and problem-solving; locality, which required geographical reasoning; science, which examined the ability to handle basic and advanced science questions; and transforms, which simulated real-world data manipulation.

The benchmarks were carefully selected to reflect the real-world challenges LLMs face. For example, addition tasks ranged from simple to complex, while anagram difficulty depended on factors such as letter frequency and word length. The locality benchmark drew from global city data, and the science benchmark included questions from the OpenBookQA (OBQA), and graduate-level Google-proof question-and-answer (GPQA) benchmarks.

Due to the complexity of assessing correctness, relevance, and verbosity, a mix of algorithmic and manual evaluations was used to score LLM responses. To mimic real-world usage, a diverse range of prompt templates was designed to reflect natural human interaction with LLMs. The study tested various models such as generative pre-trained transformers (GPT), LLM Meta AI (LLaMA), and big science large open-science open-access multilingual LLM (BLOOM), applying different settings and scales to ensure robust analysis across all tasks.

Performance Trends by Difficulty Level

The authors presented an analysis of the performance of various models from the GPT and LLaMA families across five domains: addition, anagram, locality, science, and transforms. As models were scaled and shaped, a steady increase in correct responses was observed, particularly evident in the last column of the results.

The researchers quantified model performance by examining the average results across 15 prompt templates for each benchmark, revealing that correctness decreased as difficulty increased. This correlation was captured using several proxies for difficulty, including the number of carry operations in addition and human-judged difficulty for science.

Despite the increase in correctness with model scaling, a phenomenon of difficulty discordance arose, where even easy tasks often led to incorrect outputs. Notably, shaped-up models tended to produce more confident yet incorrect answers compared to their raw counterparts, indicating a shift from avoidance to incorrectness. This trend was less pronounced in the LLaMA family.

Prompt sensitivity varied across models, with raw models exhibiting higher sensitivity to prompt variations. While shaped-up models demonstrated increased stability, they still revealed pockets of unreliability. The findings suggested that current methods of user supervision may not effectively mitigate the remaining unreliability of model outputs, emphasizing the need for more sophisticated prompt engineering.

Enhancing Human-AI Verification and Reliability

The researchers conducted two human studies. One examined the correlation between perceived and actual difficulty in responding to AI inputs, and the other investigated whether humans were prone to incorrectly accepting wrong AI outputs as correct. The findings suggested that optimizing difficulty alignment and reducing verification errors should be considered when training models.

However, limitations were noted, including a predominantly non-expert participant base and the absence of data reflecting real-world prompt frequency. The authors emphasized the importance of addressing reliability issues in LLMs like GPT, LLaMA, and BLOOM while advocating for improved methodologies to shape future LLM development.

Conclusion

In conclusion, the researchers highlighted significant limitations in LLMs as they scaled, revealing their tendency to generate incorrect outputs, even in simple tasks. Despite improvements in performance through fine-tuning and human feedback, reliability remained a critical concern, particularly in high-stakes applications.

The study's insights into difficulty concordance, task avoidance, and prompt sensitivity emphasized the need for new strategies in AI development. By optimizing difficulty alignment and reducing verification errors, future methodologies could enhance the reliability and effectiveness of LLMs, ensuring safer and more trustworthy deployment in essential fields.

Journal reference:
  • Zhou, L., Schellaert, W., Martínez-Plumed, F., Moros-Daval, Y., Ferri, C., & Hernández-Orallo, J. (2024). Larger and more instructable language models become less reliable. Nature. DOI: 10.1038/s41586-024-07930-y, ‌https://www.nature.com/articles/s41586-024-07930-y
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2024, September 29). Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers. AZoAi. Retrieved on September 30, 2024 from https://www.azoai.com/news/20240929/Scaling-Large-Language-Models-Makes-Them-Less-Reliable-Producing-Confident-but-Incorrect-Answers.aspx.

  • MLA

    Nandi, Soham. "Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers". AZoAi. 30 September 2024. <https://www.azoai.com/news/20240929/Scaling-Large-Language-Models-Makes-Them-Less-Reliable-Producing-Confident-but-Incorrect-Answers.aspx>.

  • Chicago

    Nandi, Soham. "Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers". AZoAi. https://www.azoai.com/news/20240929/Scaling-Large-Language-Models-Makes-Them-Less-Reliable-Producing-Confident-but-Incorrect-Answers.aspx. (accessed September 30, 2024).

  • Harvard

    Nandi, Soham. 2024. Scaling Large Language Models Makes Them Less Reliable, Producing Confident but Incorrect Answers. AZoAi, viewed 30 September 2024, https://www.azoai.com/news/20240929/Scaling-Large-Language-Models-Makes-Them-Less-Reliable-Producing-Confident-but-Incorrect-Answers.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.