Large language models (LLMs) such as ChatGPT have become ubiquitous for generating text at scale. In a recent paper submitted to the arXiv* server, researchers explored ChatGPT’s performance as a detector for artificial intelligence (AI)-generated text, inspired by its role as a data labeler. The study aimed to evaluate ChatGPT’s zero-shot performance in distinguishing human-written text from AI-generated text, with findings suggesting potential applications in automated detection pipelines.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
LLMs are deep neural networks that model natural language using transformer-based architectures. They are of two types: autoregressive language models predict the next word based on previous tokens, while masked language models are bidirectional and predict masked tokens in a sequence. Autoregressive models excel at language generation, while masked models are better at language understanding due to their bi-directional attention mechanisms. LLMs are trained on extensive internet-scale data and use various sampling strategies, such as greedy, top-k, and nucleus sampling, to generate text one token at a time.
The advancement of LLMs capable of producing high-quality, human-like text has revolutionized various tasks by assisting humans. However, the widespread accessibility of such models has raised concerns about potential misuse by malicious actors. There is a risk of these actors using LLMs to spread misinformation, create fake websites, and generate misleading content. Additionally, inexperienced users may overestimate the capabilities of these models, leading to flawed outputs and serious consequences in critical situations.
Related work
Recent language models, including ChatGPT and Generative Pre-trained Transformer 4 (GPT-4), have displayed impressive performance across various natural language processing (NLP) tasks, such as natural language inference, sentiment analysis, and fact-checking. They have also been evaluated as annotators and controllers for AI tasks. However, evidence suggests that ChatGPT might not perform as well in subjective NLP tasks. The increasing prevalence of LLMs and conversational AI assistants has normalized the use of AI-generated text, leading to growing concerns about potential misuse.
Consequently, research on detecting AI-generated text has garnered significant attention, with various computational methods explored, including feature-based, statistical, and fine-tuned language models. Commercial AI content detectors, such as OpenAI and ZeroGPT detectors, have also been introduced. Additionally, watermarking techniques for embedding indistinguishable artifacts have been studied for detection purposes.
ChatGPT as an AI text-detector
Using ChatGPT as an AI-text detector to determine if ChatGPT can distinguish between human-written and AI-generated text. The TuringBench dataset provides AI-generated text from 19 different generators and human-written news articles from Cable News Network (CNN) and The Washington Post. The experiments involve ChatGPT and GPT-4 as detectors with specific prompts for classification. The temperature parameter is set to zero for stable output. The results classify each article into ‘human-written,’ ‘AI-generated,’ or ‘unclear’ categories. The experiments are conducted on the test split of the datasets, with GPT-4 limited to 500 samples due to rate constraints.
Study Results
The study focused on ChatGPT's ability to detect AI-generated text and yielded interesting findings. It struggled to identify AI-generated text from various generators, performing satisfactorily only with GPT-1 and a few others. As model size increased, false negatives also grew, indicating larger models produced more ‘human-like’ text. Conversely, GPT-4 excelled at detecting AI-generated text but faced challenges differentiating it from human-written text, leading to misclassifications and raising concerns about its reliability.
These results shed light on the complexities of AI text detection and highlight important considerations about the reliability of state-of-the-art language models. While ChatGPT struggles with AI-generated text, GPT-4’s overconfidence in labeling everything as ‘AI-generated’ can lead to misleading outcomes. Ongoing research and vigilance are necessary to address the limitations and challenges posed by advanced language models in distinguishing human and AI-generated text effectively.
Additional experiments assessed ChatGPT and GPT-4's sensitivity to human-written text styles from various datasets. ChatGPT consistently identified human-written text across different sources, with only a small fraction misclassified as AI-generated. However, GPT-4 exhibited varying performance, performing well on some datasets such as NeuralNews (The New York Times) and the Internet Movie Database (IMDb) texts but struggling with human texts from TuringBench, possibly due to dataset-specific characteristics and noise. This indicated that GPT-4 is less reliable in identifying human-written text than ChatGPT, confirming previous studies' findings.
Performance on ChatGPT-generated text: To evaluate ChatGPT-generated text, a new dataset called ChatNews was created, mimicking TuringBench. The results showed ChatGPT misclassified over 50% of ChatNews articles as human-written, whereas GPT-4 correctly identified around 38%. This suggests GPT-4’s potential for detecting text from older models such as ChatGPT.
Conclusions
In summary, researchers explored ChatGPT’s ability to detect AI-generated text and found it struggles with this task while performing well on human-written text. This asymmetry in performance could be utilized to build detectors that focus on identifying human-written text, indirectly solving the AI-generated text detection problem. The experiments showed that ChatGPT outperformed GPT-4 in identifying AI-generated vs. human-written text and GPT-4’s sensitivity to noise and dataset artifacts, affecting its reliability. Future work will investigate the reasons behind this performance difference, exploring the impact of the training data and leveraging ChatGPT’s capability for automated detection pipelines using few-shot prompting and ensemble methods.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.