In a research paper recently submitted to the arxiv* server, the authors addressed multilingual aspects of machine-generated text (MGT) detection, evaluating detector generalization across languages and providing a benchmark dataset called MULTITuDE, featuring texts in 11 languages generated by various Large Language Models (LLMs).
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
There has been a major advancement in MGT in the last few months facilitated by LLMs like Chat Generative Pre-trained Transformer (GPT) and GPT-4. Although various datasets and MGT detection methods have been developed in response to the rise of generative language models, most of them primarily focus on English text, and only some include non-English languages.
Current detection methods include stylometric, deep learning-based, statistical, and hybrid approaches. Although statistics-based techniques out-perform deep learning models in some aspects, they are not perfect, which highlights the need for combining both approaches for robust and high-performance MGT detection. In order to address these shortcomings, the present study focuses on multilingual MGT detection, which involves binary classification to distinguish between text generated by humans and machines.
Proposed Solution
The authors introduced a new benchmark dataset called MULTITuDE, aimed at evaluating the detection of MGT in a multilingual context. It comprises human-written news articles in 11 languages, obtained from the MassiveSumm dataset, with titles used as prompts for eight state-of-the-art LLMs to generate corresponding machine texts. The dataset includes train and test splits for fine-tuning and evaluating detectors.
Major languages from three different language families were selected as training languages (English, Spanish, Russian), and related test languages were chosen for each of them. The dataset contains approximately 1000 human texts and corresponding MGT for each training language, along with test data for other languages. The evaluation and linguistic analysis confirmed the LLMs' ability to generate texts in the requested languages with high success rates, except for LLaMA 65B, which had slightly lower performance in Arabic and Chinese texts. The MGTs are comparable in terms of sentences and words to human texts, making them challenging for detection. However, some artifacts in MGT may still be present, as a detailed analysis of the generated texts was not conducted.
Detection Methods
Black-Box Detectors: These are zero-shot methods that provide limited information about their underlying model or method for detection. The two black-box detectors evaluated are ZeroGPT4 and GPTZero, both of which are commercial paid services focusing on non-English languages. Their training methodologies and specific data used for detection are undisclosed.
Statistical Detectors: This benchmark includes various statistical detectors, including baseline and state-of-the-art models that have shown high performance on English datasets. These detectors, such as Log-likelihood, Rank, Log-Rank, Entropy, GLTR Test-2, and DetectGPT, rely on probability calculations for distinguishing MGT from human-written texts. They assessed the likelihood of each word in a text. For this multilingual evaluation, mGPT, a multilingual GPT-based model, is used to compute word probabilities.
Fine-Tuned Detectors: Seven popular HuggingFace language models representing state-of-the-art techniques, while considering multilinguality, are selected. These models undergo fine-tuning for the MGT detection task using various combinations of source languages and text generation models from the MULTITuDE dataset. The fine-tuning was performed for different scenarios, including individual training languages (English, Spanish, Russian), all training languages combined, and English with three times more training samples, resulting in 315 fine-tuned detection methods in total.
Experiment and Results
The authors conducted experiments to evaluate the performance of MGT detection methods. They aimed to assess the ability of these detectors to identify machine-generated content in various languages, explore the generalization of detectors trained in monolingual settings to other languages, investigate the effectiveness of multilingual training, and analyze how detectors perform across different LLMs.
- Zero-Shot Setting
- Statistical detectors performed poorly in a multilingual context, achieving only around a 47% F1 score.
- Black-box detectors could not reliably distinguish between human and machine-generated texts across various languages.
- Monolingual Generalization
- Detectors fine-tuned in monolingual settings could generalize to other languages, although with some performance degradation. The degree of success varied depending on the linguistic similarity between languages.
- English is an outlier language, with relatively weak performance compared to other languages.
- There was a correlation between the performance of similar languages.
- Multilingual Generalization
- Detectors fine-tuned on multilingual data better recognized MGT in unseen languages than monolingually fine-tuned detectors.
- The increased performance of multilingual fine-tuned detectors was not solely due to the higher number of training samples, as evidenced by English fine-tuned detectors.
- The results suggested that multilingual fine-tuning strengthens detectors' transferability to other languages.
- Cross-Generator Generalization:
- Detectors' performance was influenced by the similarity of the underlying LLMs.
- LLMs developed by OpenAI (Group 1) have different performance characteristics compared to Meta AI-based models (Group 2). Models within each group exhibited similar performance on the dataset.
Conclusion
To sum up, the MULTITuDE dataset introduced in this study is a novel benchmark that covers 11 languages and 8 state-of-the-art language models, facilitating comprehensive evaluations of machine-generated text detection methods. The study results demonstrated that fine-tuning detectors using multilingual language models is the most effective approach, and the linguistic similarity between languages significantly influences the generalization of detectors. The authors plan to expand the benchmark dataset with more diverse languages, scripts, and text from various domains, particularly social media.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Macko, D., Moro, R., Uchendu, A., Lucas, J. S., Yamashita, M., Pikuliak, M., Srba, I., Le, T., Lee, D., Simko, J., & Bielikova, M. (2023, October 20). MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark. ArXiv.org. https://doi.org/10.48550/arXiv.2310.13606, https://arxiv.org/abs/2310.13606