In an article recently published to the arxiv* server, researchers evaluated whether the abstract reasoning skills demonstrated by recent language models (LMs) are general and transferable or specialized to tasks seen during pretraining. Using a framework of counterfactual task variants, the authors found that while LMs exhibited some abstract task-solving skills, their performance degraded significantly on these variants, indicating reliance on narrow, non-transferable procedures.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
The impressive empirical successes of LMs suggest that next-word prediction at scale can effectively distill knowledge from large-scale text corpora into general-purpose interactive agents. LMs have achieved remarkable results across various NLP benchmarks, passing professional exams, and even surpassing human performance on complex tasks.
Ideally, general-purpose LMs should not only generalize to unseen instances of known tasks but also adapt to novel tasks, similar to human cognitive flexibility. However, past work has primarily focused on instance-level generalization, often complicated by data contamination issues. Less systematic attention has been given to task-level generalizability.
This paper addressed this gap by introducing counterfactual task variants that deviated from standard task conditions, allowing an evaluation of LMs’ general reasoning skills. The study evaluated generative pre-trained transformer (GPT)-4, GPT-3.5, Claude, and pathways language model (PaLM)-2 on 11 counterfactual tasks, revealing that while these models showed some task generalizability, their performance significantly degraded on counterfactual variants, suggesting reliance on non-transferable, task-specific procedures.
Evaluating Machine Learning Models with Counterfactual Tasks
Counterfactual tasks involved assessing LMs' abilities by altering the conditions under which tasks were performed, instead of just varying the inputs. Traditional evaluations might suffer from data contamination, so this approach helped measure a model's generalization ability beyond default assumptions. Various tasks are analyzed.
- Arithmetic: Evaluating numerical reasoning in different bases like base-8, 9, 11, and 16 to test arithmetic generalization.
- Programming: Testing coding skills under 1-based indexing in a fictional language, ThonPy, similar to Python.
- Syntactic reasoning: Identifying subjects and verbs in sentences with different word orders.
- Logical reasoning: Evaluating entailment with premises that contradicted common sense.
- Spatial reasoning: Determining object positions using transformed coordinate systems.
- Drawing: Generating code to draw objects in rotated or flipped orientations.
- Music: Providing correct chord placements for string instruments with altered tunings and retrieving notes from transposed melodies.
- Chess: Checking the legality of chess openings with swapped initial positions of knights and bishops.
- SET game: Identifying cards that completed a set under a modified rule for the number attribute.
Each task included comprehension checks to ensure models understood the specified counterfactual conditions. These evaluations aimed to uncover LMs' robust understanding and generalization abilities.
Results and Analysis of Language Model Performance on Counterfactual Tasks
The authors evaluated the performance of four closed-source LMs – GPT-4, GPT-3.5, Claude, and PaLM-2 – on various tasks, both under default conditions and counterfactual scenarios. The models' ability to reason step-by-step, zero-shot chain-of-thought prompting, was also examined.
The results indicated that models consistently performed worse on counterfactual tasks, with a notable drop in performance compared to default tasks. However, models exhibited above-random performance on counterfactual tasks, suggesting some level of inherent capability.
PaLM-2 often produced truncated or malformed code due to its shorter context length. Models showed better performance on more common or less complex counterfactual conditions, suggesting a memorization-like effect. For example, GPT-4 performed better with familiar guitar tunings like drop-D. The researchers also explored how task difficulty, proximity to default conditions, and the number of few-shot demonstrations affected performance. While zero-shot chain-of-thought generally helped, it could hinder performance in simpler tasks by causing overthinking.
Challenges and Insights in Evaluating LMs
Humans may struggle with unfamiliar counterfactual conditions under time constraints but can generalize given ample time, unlike current LMs. LMs, trained on frequent default tasks, performed worse on counterfactuals, highlighting overfitting. While prompts could reduce this gap, they could not eliminate it entirely.
Task-specific reasoning was not inherently bad but limited generalization. Counterfactual tasks gauged LMs' reasoning abilities, though their difficulty and model familiarity could skew results. Careful prompt design and considering LMs' potential overfitting were crucial. Ultimately, LMs should develop general abstractions for better performance across varied conditions.
Conclusion
In conclusion, the researchers illuminated significant challenges in assessing the general reasoning abilities of LMs through counterfactual tasks. While LMs like GPT-4, GPT-3.5, Claude, and PaLM-2 demonstrated some capacity for abstract reasoning, their performance notably declined under unfamiliar task conditions, indicating reliance on specific, non-transferable procedures.
This highlighted the pervasive issue of overfitting to default conditions in LM training, necessitating nuanced evaluation frameworks that disentangled surface-level proficiency from true task generalizability. Future research should explore more grounded LM approaches to enhance adaptability across diverse task settings and improve overall robustness.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Wu, Z., Qiu, L., Ross, A., Akyürek, E., Chen, B., Wang, B., Kim, N., Andreas, J., & Kim, Y. (n.d.). Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. DOI: 10.48550/arXiv.2307.02477, https://arxiv.org/abs/2307.02477