In an article recently posted to the Meta Research website, researchers introduced a balanced evaluation of language evaluation for better language engineering (BELEBELE). This multiple-choice machine reading comprehension dataset spanned 122 language variants.
This dataset significantly expanded language coverage in natural language understanding benchmarks, enabling the evaluation of text models across high-, medium--, and low-resource languages. It allowed direct comparison of model performance across languages, revealing that smaller multilingual models often outperformed English-centric large language models in understanding diverse languages. BELEBELE opened new opportunities for evaluating multilingual natural language processing (NLP) systems.
Background
Past work in cross-lingual evaluation has produced several NL Understanding (NLU) datasets covering under 30 languages, mostly high- or medium-resource. Despite advancements like multilingual interface for taking knowledge (MINTAKA) for large language models (LLMs) and cross-lingual abstractive summarization (XLSUM) for abstractive summarization, these datasets still need extensive language coverage.
Challenges include the difficulty of creating parallel datasets for low-resource languages and ensuring consistent quality across diverse linguistic and cultural contexts. Additionally, many existing datasets require complex cross-lingual knowledge transfer, making evaluating languages even more challenging.
BELEBELE Dataset Overview
The BELEBELE dataset was developed by creating multiple-choice questions and answers in English and translating them into other languages. This approach ensures that samples are comparable across languages, facilitating direct score comparisons. The process involved constructing the dataset in English first, choosing multiple-choice questions to maintain fairness in evaluation, and avoiding tasks that require higher-level reasoning.
The creation of BELEBELE included a rigorous quality assurance process involving both manual and automatic inspections. The team trained annotators to follow detailed guidelines and incorporated feedback through several iterative rounds. To ensure high quality, the dataset was evaluated for potential biases and excessive simplicity, with statistical tests and lexical overlap analyses used to refine the questions.
Experts fluent in English and the target languages translated the text, ensuring alignment with the original passages. This meticulous process, including proofreading and editing, helped maintain question difficulty across languages. BELEBELE features 900 multiple-choice questions across 122 languages, with 488 distinct passages, 29 unique scripts, and 27 language families represented, providing a consistent challenge in text comprehension and facilitating cross-lingual model evaluations.
Evaluation Summary
The BELEBELE dataset enables extensive evaluation of various models across 122 languages by comparing their performance on multiple-choice questions. This benchmark allows for a thorough assessment of masked LM (MLMs) and LLMs, focusing primarily on accuracy. Due to the four-choice format of the questions, accuracy is expected to be 25% for random guesses.
Several MLMs were assessed in the experiments, including XLM-V, INFOXLM, and XLM-R, all pre-trained on multilingual corpora. These models are designed to perform well across a wide range of languages by adjusting for resource availability in different languages. The evaluation also included LLMs such as generative pre-trained transformer 3.5-turbo (GPT-3.5-TURBO), fast and light-weight conversational neural network.
(FALCON), and LLAMA (1 and 2). While GPT-3.5-TURBO excels in high-resource languages, its performance drops in less common languages than models that handle diverse languages more effectively.
The evaluation settings included full model fine-tuning in English with cross-lingual transfer, five-shot in-context learning, and zero-shot evaluation and fine-tuning involved adapting models to answer multiple-choice questions in various languages, while zero-shot settings tested models' capabilities without additional training on the target languages. The results revealed that while LLMs generally outperform MLMs in high-resource languages, they are more robust in handling low-resource languages.
Overall, the dataset's difficulty highlights the challenges faced by current models, with human performance significantly surpassing that of the evaluated models. The findings underscore the impact of pretraining data distribution, model size, and vocabulary on multilingual generalization. Machine translation techniques for zero-shot tasks showed mixed results and models with larger vocabulary performed better in low-resource languages. Additionally, script impact varied, with native scripts yielding better performance than Latin scripts for certain languages.
Conclusion
To sum up, evaluating language models' performance in low—and moderate-resource languages has been a significant challenge, and annotated benchmarks have been needed. This paper introduced BELEBELE, a comprehensive dataset featuring passages and multiple-choice questions in 122 languages, which facilitated a thorough assessment of reading comprehension across both high—and low-resource languages.
BELEBELE stood out as the first of its kind for many medium- and low-resource languages, offering valuable insights into the multilingual capabilities of language models. Results showed that while large vocabulary sizes and balanced pretraining data were crucial for optimal performance in these languages, even models primarily trained in English could effectively generalize to over 30 languages.
Looking ahead, BELEBELE was anticipated to enable a more in-depth exploration of language models and their capabilities. The dataset provided a valuable resource for examining specific model skills, such as reasoning and understanding their relationship with multilingual performance. It was believed that BELEBELE would drive further advancements in NLP, enhancing systems' capabilities across a broader range of languages beyond those with abundant resources.