As AI models grow more powerful, they aren’t just getting smarter—they’re making the same mistakes.
Research: Great Models Think Alike and this Undermines AI Oversight. Image Credit: aniqpixel / Shutterstock
![](https://dq8l4o3au0fto.cloudfront.net/images/pdf-dl-cta.png)
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
As language models grow increasingly sophisticated, their evaluation and supervision become more difficult for humans. This has led to a growing reliance on AI oversight, where other AI models are used to assess and refine machine learning outputs. However, this paper highlights a crucial flaw in this approach: as AI models become more powerful, their errors become more correlated, which introduces risks to the oversight process itself.
The authors, affiliated with institutions such as the ELLIS Institute Tübingen, the Max Planck Institute for Intelligent Systems, the University of Tübingen, IIIT Hyderabad, Contextual AI, and Stanford University, introduce a new metric, Chance Adjusted Probabilistic Alignment (CAPA), which accounts for accuracy differences and distinguishes between different types of mistakes by incorporating probability distributions. This improvement over traditional similarity measures, such as error consistency and Cohen’s κ, allows for a more precise assessment of functional similarity between models. Unlike previous measures that focus on accuracy or internal architecture, CAPA evaluates how models behave when they make mistakes, incorporating the likelihood of different types of errors. By applying this metric, the researchers uncover systemic issues in AI oversight, particularly in the areas of AI judges and AI-assisted training.
AI Judges and the Problem of Affinity Bias
One method of AI oversight is the use of large language models as evaluators of other models. In leaderboards and competitions, these AI judges assess the quality of outputs, ostensibly in an objective manner. However, this study finds that AI judges exhibit a strong preference for models that are similar to themselves, a phenomenon that mirrors affinity bias in human evaluators.
Using CAPA, the authors demonstrate that AI judges consistently rate models that share their error patterns more favorably, even when controlling for overall accuracy. In fact, statistical analysis confirms this bias, with Pearson correlation values averaging 0.84 across different AI judge-model comparisons, indicating a strong and significant preference for similar models. This means that models are not necessarily judged by their absolute performance but rather by how closely they resemble the judging model. This bias raises concerns for AI evaluation systems, as it can create misleading assessments that favor certain model families while penalizing others. The implications extend to real-world applications where AI evaluations play a critical role in selecting models for deployment. If AI judges are inherently biased toward models that “think” like them, the industry may unknowingly reinforce systemic weaknesses.
Training AI with AI: The Limits of Weak-to-Strong Generalization
Another aspect of AI oversight is the process of training stronger models using data generated by weaker models. This approach, known as weak-to-strong generalization, assumes that a more advanced model can refine and improve upon the annotations provided by a weaker one. However, the effectiveness of this method depends on whether the weaker model has complementary knowledge—insights and errors that differ from those of the stronger model.
Through extensive analysis, the researchers show that weak-to-strong training is most effective when the two models are functionally dissimilar. The study identifies two primary mechanisms behind this training approach: (1) elicitation, where a weaker model helps reveal latent knowledge in the stronger model, and (2) complementary knowledge transfer, where the weaker model contributes insights that the stronger model lacks. When weak and strong models share too many similarities, training gains diminish because the strong model simply inherits the same biases and blind spots. The authors confirm this with statistical modeling, showing that similarity between weak and strong models is inversely correlated (r = -0.85) with training improvements. This challenges the common assumption that weak-to-strong training is universally beneficial. Instead, the findings suggest that AI developers should prioritize diverse training sources rather than relying on increasingly similar models to refine their outputs.
![Our Main Contributions. We develop a novel probabilistic metric for model similarity, CAPA (κp), which adjusts for chance agreement due to accuracy. Using this, we find (1) LLM-asa- judge scores are biased towards more similar models controlling for the model’s capability (2) Gain from training strong models on annotations of weak supervisors (weak-to-strong generalization) is higher when the two models are more different, (3) Concerningly, model errors are getting more correlated as capabilities increase.](https://dq8l4o3au0fto.cloudfront.net/images/news/ImageForNews_6273_17391489896131153.png)
Our Main Contributions. We develop a novel probabilistic metric for model similarity, CAPA (κp), which adjusts for chance agreement due to accuracy. Using this, we find (1) LLM-asa- judge scores are biased towards more similar models controlling for the model’s capability (2) Gain from training strong models on annotations of weak supervisors (weak-to-strong generalization) is higher when the two models are more different, (3) Concerningly, model errors are getting more correlated as capabilities increase.
The Growing Risk of Correlated Errors
One of the study’s most alarming discoveries is that as language models become more capable, their mistakes become more similar. While increasing accuracy might seem like an unequivocally positive development, the convergence of error patterns introduces new risks. When all models fail in the same way, AI oversight loses its ability to detect and correct errors effectively.
Analyzing over 130 large language models, the authors find a clear trend: as models improve, their functional similarity increases. This trend is particularly evident in instruction-tuned models, which are optimized to better align with human preferences but also become more homogeneous in their weaknesses. The study finds that these models exhibit a stronger tendency for correlated errors compared to non-instruction-tuned models, raising concerns about their long-term reliability. This means that the diverse range of perspectives and independent failure points that once existed among models is shrinking. In practice, this could lead to systemic AI failures, where a previously unnoticed blind spot in one model becomes a universal flaw across an entire generation of AI systems.
This phenomenon is especially pronounced in instruction-tuned models, which are optimized to follow human directives more closely. The very techniques designed to make AI more controllable and reliable may also be making them more homogeneous in their weaknesses. If this trend continues, AI oversight mechanisms will become less effective over time as models become less capable of independently verifying one another’s outputs. In particular, this could undermine the effectiveness of AI juries—systems that use multiple models to evaluate AI outputs—since the collective judgment would be influenced by the same underlying biases.
Reevaluating AI Oversight for the Future
Given these findings, the authors call for a fundamental reassessment of AI oversight strategies. AI judges must be carefully selected to avoid reinforcing similarity biases, and evaluation systems should account for functional similarity rather than just accuracy. Instead of relying on AI judges that belong to the same family as the evaluated models, future research should explore how to introduce more diverse assessment mechanisms.
Similarly, training pipelines must be restructured to ensure that weak-to-strong generalization benefits from truly complementary knowledge. The study suggests that AI developers should incorporate models with distinct error patterns into training data rather than relying solely on scaling up similar architectures. This may involve selecting weaker models with distinctly different error patterns or incorporating human oversight to identify areas where AI-generated annotations may be misleading. The risk of correlated errors also underscores the importance of monitoring error convergence in AI development. To mitigate systemic failures, the authors propose that developers track error similarity trends across AI model generations and introduce deliberate variations in model architectures. As models become increasingly similar in their mistakes, developers must find ways to reintroduce diversity, either through alternative training methods or by deliberately introducing variation into model design.
Conclusion
The paper presents a compelling case that AI oversight is not a neutral or infallible process. Instead, it is shaped by the functional similarities and biases inherent in the models themselves. The introduction of CAPA provides a valuable tool for quantifying these similarities and understanding their impact on AI evaluation and training.
By demonstrating that AI judges favor models that resemble themselves and that weak-to-strong training is only beneficial when models are sufficiently different, the authors highlight the urgent need for more nuanced AI oversight strategies. Additionally, the study emphasizes that the increasing homogeneity of high-performing models could lead to widespread AI failures if left unaddressed. Perhaps most critically, their discovery that errors are becoming more correlated as models improve suggests that the AI community must take active steps to prevent widespread systemic failures. If left unaddressed, the increasing similarity between high-performing models could undermine the very oversight mechanisms designed to ensure their reliability.
![](https://dq8l4o3au0fto.cloudfront.net/images/pdf-dl-cta.png)
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Goel, S., Struber, J., Auzina, I. A., Chandra, K. K., Kumaraguru, P., Kiela, D., Prabhu, A., Bethge, M., & Geiping, J. (2025). Great Models Think Alike and this Undermines AI Oversight. ArXiv. https://arxiv.org/abs/2502.04313