With groundbreaking benchmarks and a diverse, multilingual dataset, PANGEA sets a new standard in AI for global language and cultural inclusivity, making it the ultimate tool for cross-lingual understanding.
Research: Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages. Image Credit: Owlie Productions / Shutterstock
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted by Carnegie Mellon University researchers to the arXiv preprint* server, researchers presented a novel model called PANGEA, a multilingual and multimodal large language model (MLLM) trained on a dataset (PangeaINS) that spans 6 million instruction samples across 39 languages with tasks specifically curated for cultural relevance and cross-lingual inclusivity.
They introduced an evaluation suite, PangeaBench, designed to comprehensively assess model performance across 47 languages, capturing diverse linguistic and cultural contexts. Results showed that PANGEA outperformed other open-source models in both English and multilingual tasks, underscoring the role of strategically balanced multilingual data in model success. All resources were open-sourced to promote wider access to culturally inclusive artificial intelligence (AI) development.
Background
This paper addressed the limitations of current MLLMs, which predominantly focus on English-centric datasets, limiting their effectiveness across diverse languages and cultures. Previous research highlights that such models often underperform in multilingual settings, fail to align with cultural norms, and struggle with object recognition from various regions.
The authors introduced PANGEA, a multilingual MLLM trained on a newly curated, linguistically diverse dataset called PangeaINS. This dataset spans 39 languages and is structured to include a balanced ratio of English to non-English content to address these gaps. Additionally, the paper presented PangeaBench, a robust evaluation framework consisting of multimodal and text-only tasks designed to assess PANGEA’s understanding across both cross-lingual and cross-cultural dimensions.
Through this approach, PANGEA demonstrated significant performance improvements in multilingual and cross-cultural settings, surpassing other open-source MLLMs. The study provides fully open-source resources to support inclusive, globally accessible AI development.
Building PangeaINS
PangeaINS was a curated multilingual and multicultural dataset of six million samples across 39 languages, designed to enhance large language models (LLMs) in multilingual, multimodal contexts. To address the challenges of limited multilingual datasets, the creation of PangeaINS employed three main strategies: scaling machine-translated, high-quality English instructions into multiple languages, creating culturally diverse and contextually relevant tasks, and integrating existing open-source, multilingual datasets.
The dataset first utilized carefully machine-translated instructions from English, balancing quality and scalability. Initially using various open-source models, researchers ultimately chose the proprietary Gemini 1.5 Pro model, which showed superior accuracy, particularly for complex scenarios. A post-processing pipeline was implemented to correct translation inconsistencies, ensuring alignment across languages.
Additionally, PangeaINS included instructions tailored for cultural understanding by curating 1 million culturally diverse images from the LAION-multi dataset, filtered and enriched with detailed captions and instructions to capture cultural and linguistic nuances. This process added authenticity to the images by generating captions in the corresponding native language, enhancing cultural relevance.
Lastly, PangeaINS incorporated open-source datasets, such as ALLaVA-4V and document visual question-answering (Doc-VQA), broadening the linguistic and contextual coverage. The result was a balanced dataset with a notable 60% of non-English content, promoting effective cross-lingual transfer. By combining diverse language tasks, PangeaINS supported the training of models like PANGEA, which is tailored to understand and interact effectively across global linguistic and cultural landscapes.
PangeaBench
PangeaBench was a comprehensive evaluation suite developed to assess the performance of the MLLM PANGEA across various languages, cultures, and tasks. It included both multimodal and text-only tasks to ensure a holistic assessment of PANGEA’s abilities in cross-lingual, cross-cultural, and multimodal understanding. Multimodal tasks covered five areas: multimodal chat, captioning, cultural understanding, multilingual visual question answering (VQA), and multi-subject reasoning.
Each category within PangeaBench used specific datasets, such as xChatBench for multilingual chat, XM100 for image captioning in 36 languages, and culturally diverse benchmarks like the CVQA and MaRVL datasets for culturally relevant content. Additional datasets, such as xGQA for cross-lingual VQA and MaXM for VQA, along with xMMMU and M3Exam, assessed reasoning across multiple subjects.
Text-only tasks evaluated PANGEA’s capabilities in deep linguistic understanding across diverse languages, with datasets including TydiQA for question-answering, FLORES-Sub for machine translation, and reasoning skills via datasets like XStoryCloze and multilingual grade school math (MGSM). By incorporating both multimodal and text-only tasks, PangeaBench established an extensive framework for assessing PANGEA’s competence in diverse linguistic and cultural contexts, offering a robust benchmark for multilingual multimodal models.
Experiments and Discussion
The PANGEA project employed a comprehensive setup to evaluate its MLLM, PANGEA-7B, which was trained on the PangeaINS dataset of six million samples across 39 languages. PANGEA-7B leveraged the LLaVA-Next architecture and Qwen2-7B-Instruct model. The training involved a batch size of 512 and a learning rate of 2e-5, with comparisons against several leading open-source and proprietary models.
Results showed that PANGEA-7B not only surpassed other open-source models in English and multilingual tasks but also demonstrated balanced language capabilities. While it led among open-source models, some gaps remained compared to proprietary models like GPT-4o. In text-only tasks, PANGEA-7B excelled in text comprehension thanks to its inclusion of math-related instructions in the training phase.
The study also examined the impact of instruction quantity and the role of English data in improving multilingual performance. Findings indicated a strategic balance between training samples in various languages and task performance, suggesting optimized sample allocation could improve model outcomes.
Notably, while PANGEA excelled in multimodal chat tasks, challenges persisted in multilingual optical character recognition (OCR). Preliminary efforts in OCR training indicated potential for enhancement, especially in Latin-based languages, and future work aims to further develop OCR capacity.
Conclusion
In conclusion, the PANGEA project presented an innovative MLLM designed to enhance linguistic and cultural understanding across a diverse range of languages. Utilizing the richly curated PangeaINS dataset, comprising six million samples, the model demonstrated significant performance gains over existing open-source models, particularly in cross-lingual and culturally specific tasks.
The comprehensive PangeaBench evaluation suite confirmed PANGEA’s superior performance while also identifying ongoing challenges, such as support for low-resource languages and improvements in multilingual OCR. By open-sourcing all components, including PANGEA models, PangeaINS, and PangeaBench resources, the project encourages further advancements in inclusive AI research and development.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Sources:
Journal reference:
- Preliminary scientific report.
Yue, X., Song, Y., Asai, A., Kim, S., Jean, Khanuja, S., Kantharuban, A., Sutawika, L., Ramamoorthy, S., & Neubig, G. (2024). Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages. ArXiv.org. DOI:10.48550/arXiv.2410.16153, https://arxiv.org/abs/2410.16153