Despite their fluency, AI language models stumble on basic comprehension tasks, revealing critical gaps compared to human understanding.
Research: Testing AI on language comprehension tasks reveals insensitivity to underlying meaning. Image Credit: Rob Hyrons / Shutterstock
In a paper published in the journal Scientific Reports, researchers assessed seven state-of-the-art large language models (LLMs) on a new benchmark, discovering that these models performed at chance accuracy and produced inconsistent answers. Humans outperformed the models both quantitatively and qualitatively.
The study indicated that current LLMs still need more human-like language understanding, a shortfall attributed to the absence of a "compositional operator" for effectively handling grammatical and semantic information. This operator is essential for mapping linguistic structures to meanings in a way that generalizes across contexts.
Background
Past work highlighted that LLMs excel in tasks ranging from translation to answering domain-specific queries, such as in law or medicine. Despite their fluency, they often need help with more straightforward linguistic tasks, revealing inconsistencies in language understanding compared to humans.
Researchers questioned whether models truly grasp meaning or merely predict tokens based on data patterns, pointing out that errors in comprehension can have significant real-world consequences, such as misleading chatbot interactions that could impact industries like customer service or healthcare.
LLM Comprehension Evaluation
The study evaluated the language comprehension abilities of seven LLMs using a set of 40 comprehension questions designed to minimize grammatical complexity. These prompts included only affirmative sentences, avoided negations, and used common verbs to reduce ambiguity. Each prompt was tested multiple times to assess answer stability, with models responding in open-length and one-word settings. Human performance was compared using the same questions administered to 400 English-speaking participants, equally split by gender, recruited from the Prolific platform.
Each LLM was tested in December 2023 through OpenAI, Google, and HuggingFace interfaces. Models like ChatGPT-3.5, ChatGPT-4, Bard, and Gemini, which leverage reinforcement learning from human feedback (RLHF), were included. Prompts were randomized and presented in both settings to ensure robust comparisons.
Accuracy was coded leniently, favoring the models where possible to gauge their best performance. For instance, ambiguous answers in the open-length condition were marked correct if they contained no clear errors, even if they lacked precision. A total of 1,680 LLM replies were analyzed, with models prompted three times per question to mirror human testing.
The human study, approved by the ethics committee at Humboldt-Universität zu Berlin, involved 400 participants tested under similar conditions. Each participant answered 20 prompts, each repeated thrice, resulting in 24,000 replies.
Participants were divided into open-length and one-word groups. Each question was administered in random order alongside two attention checks.
Responses were coded for accuracy and stability, and those who failed attention checks were excluded. Human replies were collected using the jsPsych toolkit, with a median experiment completion time of 13.4 minutes.
The researchers also highlighted how systematic testing allowed comparisons between the stability of human and LLM responses under identical conditions, revealing key performance gaps.
Human-LLM Performance Comparison
The study compared the language comprehension of seven LLMs and human participants, focusing on accuracy and stability. Accuracy analyses used generalized linear mixed effect models (GLMMs) to evaluate model performance. Statistical testing confirmed that LLMs, as a group, performed at chance accuracy, with significant variability between models. ChatGPT-4 emerged as the most accurate LLM, outperforming others significantly.
The study revealed higher accuracy in one-word responses compared to open-length settings. Falcon and ChatGPT-4 were noted for consistently providing accurate responses. However, LLM Meta AI 2 (Llama2) and Mixtral showed chance-level performance, with Bard's accuracy dropping below chance. In contrast, humans performed above chance regardless of response type, indicating robust and contextually grounded comprehension abilities.
Stability assessments measured the consistency of answers across repeated prompts, coded as stable if identical or unstable if varied. Stability varied significantly between models, with Falcon proving to be the most stable. Bard and Mixtral demonstrated lower consistency, while Gemini displayed stability despite providing inaccurate answers.
A setting effect was observed, with responses being more stable in the one-word condition. Comparatively, LLMs were less stable than humans, especially in open-length settings. This highlights the inability of LLMs to replicate the inherent consistency of human language comprehension.
Comparative analyses between humans and LLMs highlighted notable differences. Humans outperformed LLMs in accuracy and stability, even when comparing top performers like ChatGPT-4 to humans achieving ceiling performance. Statistical models revealed that the performance gap between LLMs and humans widened in open-length settings, while the one-word setting reduced this discrepancy.
Despite ChatGPT-4's high performance, it did not match the best human participants. The data suggested that humans maintained superior comprehension and stability, even when LLMs benefited from favorable coding rules. For instance, humans provided concise, error-free answers aligned with task instructions, while LLMs frequently added redundant or irrelevant content.
Conclusion
To sum up, the study revealed that while LLMs demonstrated utility in various tasks, they performed at chance accuracy on a language comprehension benchmark. Their responses were inconsistent and included non-human errors, suggesting critical limitations in understanding linguistic meaning beyond surface-level patterns.
The results indicated that current AI models lack the necessary compositional operators for effectively regulating grammatical and semantic information. These findings call for a reevaluation of claims about LLMs achieving human-like linguistic capabilities, particularly when applied in real-world contexts where misinterpretation can have serious consequences.