Addressing Geographic Disparities in Text-to-Image Models

Download PDF Copy

By Muhammad OsamaReviewed by Susha Cheriyedath, M.Sc.Jun 27 2024

In an article recently submitted to the arXiv* server, researchers investigated potential biases in text-to-image generative models, focusing on geographic disparities in their outputs. They introduced three innovative indicators to evaluate the generated images across different regions of the world.

*Study: Addressing Geographic Bias in Text-to-Image Models. Image Credit: SObeR 9426/Shutterstock*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

Recent advancements in artificial intelligence have led to the development of powerful text-to-image generative models capable of producing photorealistic images from textual descriptions. Models such as Stable Diffusion, DALL-E 2, Imagen, and Make-a-Scene have revolutionized content creation by offering plug-and-play solutions for generating diverse visual content.

However, the widespread adoption of these systems necessitates a thorough understanding of their potential biases and limitations, particularly regarding their ability to represent the world accurately and fairly. Additionally, there is a growing concern about the ethical implications of these technologies, including issues of privacy and consent when creating realistic images of individuals.

About the Research

In this paper, the authors addressed the lack of comprehensive quantitative benchmarks for evaluating geographic disparities in text-to-image generative models. They proposed three novel indicators: the region indicator, the object-region indicator, and the object consistency indicator. These indicators utilize image generation metrics such as precision, coverage, and reference-free evaluation metric for image captioning (CLIPscore) to assess the diversity, realism, and consistency of generated images across different geographic locations.

The region indicator measures disparities in the diversity and realism of the generated images across various regions, while the object-region indicator focuses on object-specific diversity and realism. The object consistency metric evaluates the consistency of generated images concerning the input prompts, especially when geographic information is included.

Additionally, the researchers evaluated five state-of-the-art image generative models, including different variants of latent diffusion models and a diffusion model utilizing contrastive language-image pre-training (CLIP) image embeddings. They analyzed the models' performance using prompts of different geographic specificity, ranging from simple object descriptions to prompts incorporating region and country information.

Research Findings

The study identified significant geographic biases in text-to-image generative models. These models produced less realistic and diverse images for locations in West Asia and Africa than in Europe, often relying on stereotypical representations. For example, images of cars from Africa frequently represent boxy sports utility vehicles (SUVs) in rural or desert-like settings, which do not accurately reflect the diversity of vehicles and environments in Africa.

Adding geographic details to prompts frequently reduced the quality of the generated images. This suggested that while geographic context could provide useful information, it also introduced biases. The models exhibited greater disparities across regions for certain objects, sometimes due to differences in real-world data but more often due to embedded stereotypes. For instance, images of "stoves" appeared more realistic for Europe than for Africa or West Asia, likely influenced by biases in the training data or model inconsistencies.

The publicly accessible diffusion model consistently outperformed the newer version in terms of realism, diversity, and consistency. The diffusion model with CLIP Latents (DM w/ CLIP Latents) demonstrated notable strength in realism and consistency whereas guided language-to-image diffusion for generation and editing (GLIDE) struggled with diversity and consistency. This highlighted the significant impact of model design, training data, and available resources on performance and bias.

The study also revealed a concerning trend where improvements in image quality and consistency on standard benchmarks sometimes compromised accurate geographic representation. Despite benefiting from more training data, the newer latent diffusion model scored lower across all three indicators compared to an earlier version. This underscored the importance of carefully considering evaluation metrics and the potential risks of prioritizing image quality alone.

Conclusion

In summary, the paper provided a comprehensive analysis of geographic biases in text-to-image generative models, emphasizing the need for responsible content creation systems. The introduced indicators served as crucial tools for evaluating and benchmarking these models, enabling developers to identify and mitigate potential biases effectively. The study underscored the necessity of employing diverse and representative datasets during model training, alongside continuous monitoring and evaluation of model performance.

To address these concerns, developers can ensure that text-to-image generative models produce accurate, diverse, and unbiased world representations, thereby promoting inclusivity and equity for all users. Moving forward, the study acknowledged limitations such as dataset constraints, biases in feature extraction, and challenges in replacing qualitative evaluations. Future work directions could explore the impact of training data on perpetuating geographic biases, investigate the role of text encoders in this context, and examine the influence of linguistic diversity on model prompting and performance beyond English.

Journal reference:

Preliminary scientific report. Hall, M., et, al. DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity. arXiv, 2024, 2308, 06198v3. DOI: 10.48550/arXiv.2308.06198, https://arxiv.org/abs/2308.06198.

Posted in: AI Research News

Comments (0)

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Osama, Muhammad. (2024, June 27). Addressing Geographic Disparities in Text-to-Image Models. AZoAi. Retrieved on July 11, 2025 from https://www.azoai.com/news/20240627/Addressing-Geographic-Disparities-in-Text-to-Image-Models.aspx.
MLA
Osama, Muhammad. "Addressing Geographic Disparities in Text-to-Image Models". AZoAi. 11 July 2025. <https://www.azoai.com/news/20240627/Addressing-Geographic-Disparities-in-Text-to-Image-Models.aspx>.
Chicago
Osama, Muhammad. "Addressing Geographic Disparities in Text-to-Image Models". AZoAi. https://www.azoai.com/news/20240627/Addressing-Geographic-Disparities-in-Text-to-Image-Models.aspx. (accessed July 11, 2025).
Harvard
Osama, Muhammad. 2024. Addressing Geographic Disparities in Text-to-Image Models. AZoAi, viewed 11 July 2025, https://www.azoai.com/news/20240627/Addressing-Geographic-Disparities-in-Text-to-Image-Models.aspx.