As children and adults effortlessly solve analogies across familiar and new domains, large language models falter, exposing the rigid limitations of artificial intelligence in understanding abstract relationships.
Research: Can Large Language Models generalize analogy solving like people can? Image Credit: Shutterstock AI
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article submitted to the arXiv preprint* server, researchers at the University of Amsterdam, the Netherlands, and the Santa Fe Institute, USA, investigated whether large language models (LLMs) can generalize analogy solving to new domains as humans can.
Children, adults, and LLMs were tested on letter-string analogies in the Latin alphabet, Greek alphabet, and a list of symbols. While humans easily generalized their knowledge across domains, LLMs encountered difficulties with transfer. This difference highlights the significant challenge LLMs face in achieving human-like analogical reasoning.
Background
Past work has explored whether LLMs can generalize analogical reasoning to new domains, like humans. While children and adults can quickly transfer knowledge across domains, LLMs have difficulty, especially with more abstract or far-transfer analogies.
Letter-string analogies have been used to study this, showing that LLMs perform comparably to humans in familiar domains but face challenges in novel ones. This raises questions about whether LLMs truly understand analogical reasoning or simply mimic patterns.
Comparing Performance Across Models
The study compared the performance of 42 children (aged 7-9), 62 adults, and 54 attempts by each of four prominent LLMs—Claude-3.5, Gemma-2 27B, generative pretrained transformer (GPT-4o), and Llama-3.1 405B—on a letter-string analogy task.
The analogies involved alphabetic string transformations, where participants had to generalize a transformation rule from one string to another.
The task presented in three alphabets—Latin, Greek, and Symbol—tested how well participants could transfer learned patterns across familiar and unfamiliar alphabets. The task was designed using simple transformations, such as successor and predecessor shifts or letter repetitions, that children were expected to recognize.
The letter-string analogy task was adapted for each alphabet. A series of transformations, such as "abc" changing to "abd," were used for the Latin alphabet. These transformations were adapted to Greek (for near transfer) and a unique Symbol alphabet (for far transfer).
The Greek alphabet was chosen because it visually resembles the Latin alphabet but is unfamiliar to children. In contrast, the Symbol alphabet was designed to be an entirely new and abstract domain. The goal was to test how participants generalized the transformation rules across different symbol sets.
Data collection for human participants was conducted online for adults and in person for children. Adults were recruited through Prolific and completed the task in a web browser, while children aged 7-9 were recruited from local Montessori schools and completed the task on tablets.
Both groups were given initial practice items to check understanding before completing the main task, which involved five items for each alphabet. Children were instructed verbally, while adults followed written instructions. In total, 42 children and 62 adults participated, with a few exclusions based on predefined criteria.
For the LLMs, data was collected from six different models, including Claude-3.5, Gemma-2 27B, GPT-4o, and Llama-3.1 405B. The models were presented with the same task conditions as the human participants, with Greek and Symbol alphabet modifications. The models were prompted in a zero-shot setting and were administered the task using specialized prompt templates optimized for LLM performance.
Each model's performance was evaluated across variations in the tasks to ensure robust comparisons, and the results showed that the larger models outperformed smaller ones. In contrast, others, like Mistral and Qwen, showed poorer performance.
LLMs' Alphabet Performance
The study aimed to compare the performance of adults, children, and LLMs on letter-string analogy problems in different alphabets.
Mixed analysis of variance (ANOVAs) were conducted to evaluate (1) the differences in performance between participant groups (Adults, Children, and LLMs) on the Latin alphabet and (2) the ability of these groups to generalize analogy-solving across alphabets (Latin, Greek, and Symbol).
The results revealed that, as expected, adults and some LLMs outperformed children in solving analogies with the Latin alphabet. OpenAI's GPT-4o performed similarly to adults, while Meta's Llama-3.1 405B followed closely behind. In contrast, Gemma-2 27B and Claude-3.5 had weaker performances in this domain.
The study found that while adults and children performed consistently across alphabets, LLMs' performance degraded from Latin to Greek and Symbol, particularly in the Symbol domain. LLMs excelled at simple transformations but struggled with more complex transformations, such as second successor rules.
To better understand the LLMs' struggles, a Next-Previous Letter Task was designed, where LLMs were asked to identify the previous and next letters in a sequence.
The results showed that while LLMs successfully handled simple transformations, they needed help with complex transformations, particularly in less familiar alphabets.
Further error analysis revealed that the LLMs often relied on the "Literal rule," copying the final character rather than applying the correct transformation rule.
Conclusion
To sum up, the study found that while LLMs performed well on letter-string analogies in the familiar Latin alphabet, their performance deteriorated in less familiar alphabets like Greek and Symbol.
The LLMs needed help generalizing abstract rules and were prone to simpler errors when transformations involved unfamiliar symbols. Unlike humans, who can quickly adapt to novel alphabets, LLMs’ rigid abstraction methods hindered their performance.
These findings highlight LLMs' challenges in transferring analogical reasoning across domains, indicating a fundamental difference between human and artificial general intelligence.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Stevenson, C. E., Pafford, A., J., H. L., & Mitchell, M. (2024). Can Large Language Models generalize analogy solving as people can? arXiv. DOI: 10.48550/arXiv.2411.02348, https://arxiv.org/abs/2411.02348