AI Researchers Put DeepSeek to the Test

Download PDF Copy

By Joel ScanlonFeb 9 2025

Researchers from Carnegie Mellon and Harvard put DeepSeek to the test against top LLMs, revealing that while it trails Claude in accuracy, its affordability and strong classification performance make it a rising competitor in AI-driven text analysis.

Research: A Comparison of DeepSeek and Other LLMs. Image Credit: Krot_Studio / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

The rapid evolution of large language models (LLMs) has been marked by breakthroughs in artificial intelligence, with DeepSeek emerging as a key player in recent months. Since its latest version was released on January 20, 2025, DeepSeek has gained attention, particularly in classification tasks where it performs competitively against models like Claude-3.5-sonnet, Gemini, and GPT-4o-mini. Researchers and industry experts alike have taken an interest in evaluating DeepSeek’s capabilities, particularly in predictive text analysis.

This study, conducted by researchers from Carnegie Mellon University and Harvard University, compares DeepSeek to four widely used LLMs: OpenAI’s GPT-4o-mini, Google’s Gemini-1.5-flash, Meta’s Llama-3.1-8b, and Anthropic’s Claude-3.5-sonnet. The authors focus on two key classification tasks: authorship classification, which determines whether a text was written by a human or an AI, and citation classification, which categorizes academic citations based on their significance. The analysis assesses model accuracy, computational efficiency, cost, and output similarity to determine how DeepSeek fares against its competitors.

Authorship Classification

The proliferation of AI-generated text across various digital platforms has raised concerns about misinformation and the ability to distinguish human writing from AI output. This study employs a dataset called MADStat, which consists of 83,331 abstracts from statistical journals spanning 1975 to 2015. The authors generate three types of text samples from this dataset:

Human-written abstracts (hum) – Unedited abstracts from the MADStat dataset.
AI-generated abstracts (AI) – New abstracts produced by GPT-4o-mini based on paper titles.
AI-edited human abstracts (humAI) – Original abstracts modified using GPT-4o-mini.

The authors evaluate five LLMs on two classification problems. The first (AC1) distinguishes between human-written and AI-generated texts, while the second (AC2) differentiates between human-written texts and AI-edited versions. The results indicate that Claude-3.5-sonnet achieves the highest classification accuracy in AC1, while DeepSeek-R1 ranks second. However, in AC2, DeepSeek outperforms all models, making it the most effective at detecting AI-edited human text.

Interestingly, the models display varying levels of agreement in their classifications. DeepSeek’s predictions align most closely with those of Claude and Gemini, while GPT and Llama show high similarity but perform poorly in classification accuracy. GPT and Llama's error rates in authorship classification were nearly equivalent to random guessing, indicating significant weaknesses in their ability to detect AI-generated content. The study highlights that while DeepSeek demonstrates strong performance, its slower processing speed remains a notable drawback.

Citation Classification

Evaluating academic research impact requires more than just citation counts; the context and intent of citations play a crucial role in understanding their significance. The authors introduce a novel dataset, CitaStat, comprising 3,000 manually labeled citation instances extracted from statistical journals. Citations are classified into four categories:

Fundamental Idea (FI) – Citing work that provides key theoretical insights.
Technical Basis (TB) – Referencing crucial methodologies or datasets.
Background (BG) – Citing prior work to provide context or support.
Comparison (CP) – Referencing studies for comparative analysis.

Two classification tasks are performed. The first (CC1) assigns citations to one of the four categories, while the second (CC2) simplifies the task into two broader classes: Significant (FI and TB) and Incidental (BG and CP). In CC1, DeepSeek ranked fourth, behind Claude, Gemini, and GPT, but in CC2, DeepSeek performed significantly better, ranking second overall. The rankings remain similar to those observed in authorship classification, with DeepSeek outperforming Gemini, GPT, and Llama but requiring more computational time.

The agreement between models is also analyzed, revealing that Claude and Gemini exhibit the highest consistency in classification, while DeepSeek’s predictions are most aligned with Claude and Gemini. Llama consistently performs the worst across all tasks, often approaching random guessing in accuracy.

Results and Contributions

The study finds that Claude-3.5-sonnet consistently delivers the most accurate classifications, although it comes at a significantly higher cost. For CC1 and CC2 combined, Claude’s processing costs amount to $12.30, whereas DeepSeek, Gemini, and GPT cost no more than $0.30 per task, making DeepSeek a cost-effective alternative despite its slower processing time. DeepSeek demonstrates strong performance, often ranking second in accuracy while maintaining a much lower cost. However, its computational speed is considerably slower, making it less practical for real-time applications.

The study contributes in three major ways:

Benchmarking DeepSeek against Established LLMs – The comparison provides insights into DeepSeek’s strengths and weaknesses, highlighting its potential in predictive tasks.
Introducing Citation Classification as a Research Tool – The categorization of citations based on significance opens new avenues for assessing academic impact.
Providing Public Datasets for Further Research – The CitaStat and MadStatAI datasets offer valuable benchmarks for evaluating AI-generated text and citation classification, facilitating further advancements in AI research.

Discussion

The findings of this study suggest that DeepSeek, while not yet outperforming Claude, is a promising LLM with competitive classification accuracy. DeepSeek’s lower training costs suggest it has room for improvement, and with further refinements, it could close the performance gap with more expensive models like Claude. Its lower cost makes it an attractive alternative, particularly in scenarios where high accuracy is required but computational efficiency is less critical. This cost advantage may make DeepSeek particularly appealing for large-scale academic or enterprise applications where minimizing expenses is a priority.

Future research could expand this analysis to additional domains, such as natural language processing and computer vision, to further assess DeepSeek’s capabilities. Additionally, integrating statistical and machine learning techniques to refine classification prompts could lead to improved accuracy. For example, the study suggests leveraging statistical tools to identify discriminative language patterns in AI-generated versus human-generated text, which could enhance classification precision across different datasets. The datasets introduced in this study can serve as foundational resources for ongoing research into AI-generated content detection and citation analysis.

Ultimately, this comparison underscores the evolving landscape of LLMs, where new entrants like DeepSeek challenge industry leaders and push the boundaries of AI-driven text analysis. While DeepSeek has room for improvement, its rapid development and cost efficiency position it as a formidable competitor in the LLM space.

Journal reference:

Preliminary scientific report. Gao, T., Jin, J., Ke, Z. T., & Moryoussef, G. (2025). A Comparison of DeepSeek and Other LLMs. ArXiv. https://arxiv.org/abs/2502.03688

Posted in: AI Research News

Comments (0)

Written by

Joel Scanlon

Joel relocated to Australia in 1995 from the United Kingdom and spent five years working in the mining industry as an exploration geotechnician. His role involved utilizing GIS mapping and CAD software. Upon transitioning to the North Coast of NSW, Australia, Joel embarked on a career as a graphic designer at a well-known consultancy firm. Subsequently, he established a successful web services business catering to companies across the eastern seaboard of Australia. It was during this time that he conceived and launched News-Medical.Net. Joel has been an integral part of AZoNetwork since its inception in 2000. Joel possesses a keen interest in exploring the boundaries of technology, comprehending its potential impact on society, and actively engaging with AI-driven solutions and advancements.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Scanlon, Joel. (2025, February 09). AI Researchers Put DeepSeek to the Test. AZoAi. Retrieved on April 18, 2025 from https://www.azoai.com/news/20250209/AI-Researchers-Put-DeepSeek-to-the-Test.aspx.
MLA
Scanlon, Joel. "AI Researchers Put DeepSeek to the Test". AZoAi. 18 April 2025. <https://www.azoai.com/news/20250209/AI-Researchers-Put-DeepSeek-to-the-Test.aspx>.
Chicago
Scanlon, Joel. "AI Researchers Put DeepSeek to the Test". AZoAi. https://www.azoai.com/news/20250209/AI-Researchers-Put-DeepSeek-to-the-Test.aspx. (accessed April 18, 2025).
Harvard
Scanlon, Joel. 2025. AI Researchers Put DeepSeek to the Test. AZoAi, viewed 18 April 2025, https://www.azoai.com/news/20250209/AI-Researchers-Put-DeepSeek-to-the-Test.aspx.