Despite being optimized for reasoning, OpenAI’s o1 model continues to show sensitivity to probability and task frequency, revealing the deep-rooted impact of autoregressive training even in cutting-edge AI systems.
Research: When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers investigated whether OpenAI's o1, a language model optimized for reasoning, overcame limitations seen in previous large language models (LLMs). The study showed that while o1 performed significantly better, especially on rare tasks, it still exhibited sensitivity to probability, a trait from its autoregressive origins. This suggests that while optimizing for reasoning enhances performance, it might not entirely eliminate the probabilistic biases that remain embedded in the model.
Background
LLMs, such as generative pre-trained transformers (GPT), have traditionally been trained using autoregressive techniques, which predict the next word in a sequence based on prior input. While this method has produced models capable of impressive feats in natural language understanding, research highlights key limitations.
One notable issue is that LLMs are biased toward producing high-probability sequences, leading to challenges in tasks where the expected output is rare or unconventional. These "embers of autoregression" influence performance across various tasks, even those unrelated to basic next-word prediction. In previous research, these trends were evident even when models were used for complex tasks like reasoning.
Earlier findings revealed that LLMs perform better on tasks with high-probability outputs but struggle with less likely sequences, especially in uncommon task variants. These limitations prompted researchers to analyze OpenAI’s o1 model to determine whether optimization for reasoning could address these biases.
Sensitivity to Output Probability and Task Frequency
The researchers assessed the performance of OpenAI’s o1 model on a variety of tasks, examining whether it exhibited sensitivity to the probability of output and the frequency of task types. They tested two primary factors: output probability (how likely the model’s answer is based on common language patterns) and task frequency (how often a particular task variant occurs in training data).
For output probability, the researchers evaluated o1 on four tasks: decoding shift ciphers, decoding Pig Latin messages, article swapping, and reversing word lists. Their results indicated that o1 performed better on high-probability examples compared to low-probability ones.
For instance, o1’s accuracy on the shift cipher task ranged from 47% on low-probability examples to 92% on high-probability examples. In addition to better performance on high-probability tasks, o1 used fewer tokens in these cases, highlighting its sensitivity to output probability.
Next, the authors explored whether o1 performed differently on common versus rare task variants. They tested five task types with both common and rare variants, including decoding ciphers, forming acronyms, and sorting lists. The researchers found that o1 outperformed other LLMs, particularly on rare task variants. This suggests that o1 is less sensitive to task frequency compared to earlier models, though some effects remained.
However, to ensure that these findings weren’t limited by "ceiling effects" (where tasks were too easy for differences to be noticeable), they introduced more challenging variants of some tasks. In these harder cases, o1's performance dropped significantly for rare task variants, reinforcing its sensitivity to task frequency in more difficult scenarios.
For example, when the sorting task was made more challenging by using words with the same first letter, o1 performed significantly better on the common variant (alphabetical sorting) than the rare one (reverse alphabetical sorting). Similarly, in cipher decoding tasks with medium- and low-probability examples, o1 performed better on the common cipher than on the rare one. This performance gap was also reflected in token usage, with o1 consuming more tokens for rare task variants, indicating their increased difficulty.
In essence, while o1 exhibited less sensitivity to task frequency than earlier LLMs, it still showed some dependence on output probability and task frequency in more challenging scenarios. Token usage data corroborated these trends, with o1 using more tokens for low-probability tasks and rare task variants, even when accuracy was comparable. The results highlighted that while o1 represented a substantial improvement over previous models, the influence of probabilistic training objectives remained evident in its behavior.
Conclusion
In conclusion, OpenAI's o1 model demonstrated notable improvements over previous LLMs, particularly in handling rare task variants. Despite these advancements, o1 still exhibited significant sensitivity to output probability and task frequency, echoing patterns observed in earlier LLMs. While the model showed progress in reducing these biases, its performance in more challenging scenarios suggests that probabilistic judgments are deeply ingrained.
The findings suggested that while optimizing for reasoning enhanced performance, the "embers of autoregression" persisted. Future developments may require innovative approaches that reduce reliance on probabilistic judgments to fully address the inherent limitations associated with autoregression, potentially incorporating non-probabilistic components to overcome these biases.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Griffiths, T. L. (2024). When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1. ArXiv. https://arxiv.org/abs/2410.01792