OpenAI's o1 Model Excels in Reasoning But Struggles with Rare and Complex Tasks

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonOct 9 2024

Despite being optimized for reasoning, OpenAI’s o1 model continues to show sensitivity to probability and task frequency, revealing the deep-rooted impact of autoregressive training even in cutting-edge AI systems.

Research: When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers investigated whether OpenAI's o1, a language model optimized for reasoning, overcame limitations seen in previous large language models (LLMs). The study showed that while o1 performed significantly better, especially on rare tasks, it still exhibited sensitivity to probability, a trait from its autoregressive origins. This suggests that while optimizing for reasoning enhances performance, it might not entirely eliminate the probabilistic biases that remain embedded in the model.

Background

Across the four tasks we considered (shift ciphers, Pig Latin, article swapping, and reversal), all six LLMs evaluated here—including o1—show sensitivity to output probability, with higher accuracies on examples that have a high output probability than on examples that have a low output probability. The results for all models except o1 are from McCoy et al. (2023). The intervals around the lines show one standard error.

LLMs, such as generative pre-trained transformers (GPT), have traditionally been trained using autoregressive techniques, which predict the next word in a sequence based on prior input. While this method has produced models capable of impressive feats in natural language understanding, research highlights key limitations.

One notable issue is that LLMs are biased toward producing high-probability sequences, leading to challenges in tasks where the expected output is rare or unconventional. These "embers of autoregression" influence performance across various tasks, even those unrelated to basic next-word prediction. In previous research, these trends were evident even when models were used for complex tasks like reasoning.

Earlier findings revealed that LLMs perform better on tasks with high-probability outputs but struggle with less likely sequences, especially in uncommon task variants. These limitations prompted researchers to analyze OpenAI’s o1 model to determine whether optimization for reasoning could address these biases.

Sensitivity to Output Probability and Task Frequency

The researchers assessed the performance of OpenAI’s o1 model on a variety of tasks, examining whether it exhibited sensitivity to the probability of output and the frequency of task types. They tested two primary factors: output probability (how likely the model’s answer is based on common language patterns) and task frequency (how often a particular task variant occurs in training data).

For output probability, the researchers evaluated o1 on four tasks: decoding shift ciphers, decoding Pig Latin messages, article swapping, and reversing word lists. Their results indicated that o1 performed better on high-probability examples compared to low-probability ones.

For instance, o1’s accuracy on the shift cipher task ranged from 47% on low-probability examples to 92% on high-probability examples. In addition to better performance on high-probability tasks, o1 used fewer tokens in these cases, highlighting its sensitivity to output probability.

Next, the authors explored whether o1 performed differently on common versus rare task variants. They tested five task types with both common and rare variants, including decoding ciphers, forming acronyms, and sorting lists. The researchers found that o1 outperformed other LLMs, particularly on rare task variants. This suggests that o1 is less sensitive to task frequency compared to earlier models, though some effects remained.

However, to ensure that these findings weren’t limited by "ceiling effects" (where tasks were too easy for differences to be noticeable), they introduced more challenging variants of some tasks. In these harder cases, o1's performance dropped significantly for rare task variants, reinforcing its sensitivity to task frequency in more difficult scenarios.

Left: We evaluated LLMs on two variants of five tasks—a variant that is common in Internet text (e.g., forming acronyms from the first letter of each word in a sequence) and a variant that is rare (e.g., forming acronyms from the second letter of each word in a sequence). On these datasets, the five LLMs other than o1 showed much higher accuracy on the common variants than the rare ones, but o1 showed similar performance on common and rare variants. The results for models other than o1 are from McCoy et al. (2023). Top right: On datasets based on challenging sorting tasks, o1 performs better on the common type of sorting (i.e., sorting into alphabetical order) than on the rare type of sorting (i.e., sorting into reverse alphabetical order). Bottom right: When decoding shift ciphers, o1 shows roughly the same performance on the common cipher type and on the rare cipher type when the examples are ones with a high output probability. However, when it is instead evaluated on examples with medium or low probability, its accuracy is higher for the common cipher type than the rare one. The error intervals in all plots show one standard error.

For example, when the sorting task was made more challenging by using words with the same first letter, o1 performed significantly better on the common variant (alphabetical sorting) than the rare one (reverse alphabetical sorting). Similarly, in cipher decoding tasks with medium- and low-probability examples, o1 performed better on the common cipher than on the rare one. This performance gap was also reflected in token usage, with o1 consuming more tokens for rare task variants, indicating their increased difficulty.

In essence, while o1 exhibited less sensitivity to task frequency than earlier LLMs, it still showed some dependence on output probability and task frequency in more challenging scenarios. Token usage data corroborated these trends, with o1 using more tokens for low-probability tasks and rare task variants, even when accuracy was comparable. The results highlighted that while o1 represented a substantial improvement over previous models, the influence of probabilistic training objectives remained evident in its behavior.

Conclusion

In conclusion, OpenAI's o1 model demonstrated notable improvements over previous LLMs, particularly in handling rare task variants. Despite these advancements, o1 still exhibited significant sensitivity to output probability and task frequency, echoing patterns observed in earlier LLMs. While the model showed progress in reducing these biases, its performance in more challenging scenarios suggests that probabilistic judgments are deeply ingrained.

The findings suggested that while optimizing for reasoning enhanced performance, the "embers of autoregression" persisted. Future developments may require innovative approaches that reduce reliance on probabilistic judgments to fully address the inherent limitations associated with autoregression, potentially incorporating non-probabilistic components to overcome these biases.

Journal reference:

Preliminary scientific report. McCoy, R. T., Yao, S., Friedman, D., Hardy, M. D., & Griffiths, T. L. (2024). When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1. ArXiv. https://arxiv.org/abs/2410.01792

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, October 09). OpenAI's o1 Model Excels in Reasoning But Struggles with Rare and Complex Tasks. AZoAi. Retrieved on July 01, 2025 from https://www.azoai.com/news/20241009/OpenAIs-o1-Model-Excels-in-Reasoning-But-Struggles-with-Rare-and-Complex-Tasks.aspx.
MLA
Nandi, Soham. "OpenAI's o1 Model Excels in Reasoning But Struggles with Rare and Complex Tasks". AZoAi. 01 July 2025. <https://www.azoai.com/news/20241009/OpenAIs-o1-Model-Excels-in-Reasoning-But-Struggles-with-Rare-and-Complex-Tasks.aspx>.
Chicago
Nandi, Soham. "OpenAI's o1 Model Excels in Reasoning But Struggles with Rare and Complex Tasks". AZoAi. https://www.azoai.com/news/20241009/OpenAIs-o1-Model-Excels-in-Reasoning-But-Struggles-with-Rare-and-Complex-Tasks.aspx. (accessed July 01, 2025).
Harvard
Nandi, Soham. 2024. OpenAI's o1 Model Excels in Reasoning But Struggles with Rare and Complex Tasks. AZoAi, viewed 01 July 2025, https://www.azoai.com/news/20241009/OpenAIs-o1-Model-Excels-in-Reasoning-But-Struggles-with-Rare-and-Complex-Tasks.aspx.