Enhancing Geographic Diversity in Text-to-Image Models with c-VSG

Download PDF Copy

By Soham NandiReviewed by Susha Cheriyedath, M.Sc.Jul 1 2024

In an article recently submitted to the arxiv* server, researchers addressed the lack of geographic diversity in text-to-image generative models. They introduced an inference-time intervention called contextualized Vendi score guidance (c-VSG) to enhance the diversity of generated images. Evaluations using geographically representative datasets showed that c-VSG improved image diversity for underrepresented regions while maintaining or enhancing image quality and consistency.

*Study: Enhancing Geographic Diversity in Text-to-Image Models with c-VSG. Image Credit: SObeR 9426/Shutterstock.com*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

Text-to-image systems have demonstrated remarkable results and are widely used, prompting research into their potential risks and biases. Studies have identified disparities in the demographic traits of people represented in generated images, leading to mitigation strategies like textual interventions, attention-weight modification, and semantic guidance. However, biases are not limited to human-centric representations; they extend to objects and their surroundings globally.

Prior research has shown that improvements in image quality often compromise representation diversity and text-image consistency, particularly affecting regional object diversity and reinforcing geographic stereotypes. Despite these findings, no direct mitigation strategies have targeted geo-diversity in text-to-image systems.

This paper addressed this gap by introducing the c-VSG, an inference-time intervention designed to enhance the diversity of images generated by latent diffusion models (LDMs). By leveraging the Vendi Score (VS) to drive diversity and contextualizing generations with real exemplar images, c-VSG aimed to increase object representation diversity while maintaining image quality and text-image consistency.

Evaluations using geographically diverse datasets, geographically diverse evaluation datasets for object recognition (GeoDE), and DollarStreet, demonstrated that c-VSG significantly improved diversity and reduced regional performance disparities, outperforming existing methods. This approach highlighted the potential for more accurate and diverse image generation reflecting real-world geographic diversity.

A Methodological Approach with VSG

The researchers were focused on enhancing the diversity of LDMs, specifically denoising diffusion implicit models (DDIMs), used in text-to-image generation. The researchers introduced VSG, an innovative methodology aimed at guiding the generation process of LDMs to produce more diverse and representative images.

Initially, the authors outlined the foundational concepts of LDMs, emphasizing their backward diffusion process and the application of the VS metric. DDIMs operated by iteratively refining a noisy sample towards a denoised version using learned coefficients and a denoising network.

The key innovation, VSG, adapted the traditional score function of LDMs by incorporating the VS, a metric typically used to evaluate diversity in datasets. VSG modified the generation process to prioritize samples that differed significantly from previously generated images stored in a memory bank, thus enhancing overall image diversity.

Moreover, the authors introduced c-VSG, which further refined VSG by integrating a small set of real-world exemplar images. This contextualization ensured that generated samples not only increased diversity but also remained contextually grounded in real-world object representations.
The formulation of c-VSG balanced between augmenting diversity and maintaining fidelity to exemplar images, achieved through a dual-step guidance process during image generation.

Algorithmically, c-VSG was implemented using a controlled application frequency (Gfreq) to optimize computational efficiency while maximizing diversity. This approach significantly improved the diversity of generated images across various datasets, as evaluated through metrics such as VS, image quality assessments, and text-image consistency checks.

Comprehensive Evaluation of Diversity and Quality in Image Generation

The researchers evaluated the diversity and quality of image generation using LDMs on two geographically diverse datasets: GeoDE and DollarStreet. They reported metrics such as F1, recall, precision, and CLIPScore.

The authors compared several baseline methods, both with and without additional information, including synonyms, paraphrasing, semantic guidance, feedback guidance, and textual inversion. The primary method, VSG, and its contextualized version, c-VSG, which used exemplar images, demonstrated significant improvements in diversity and quality over these baselines.

In terms of results, c-VSG showed substantial improvements in both average and worst-region F1 scores on GeoDE and DollarStreet, outperforming other methods by up to 25% and 37.9%, respectively. VSG improved diversity, reflected in higher recall values and qualitative examples showing varied object attributes and backgrounds. The c-VSG method with exemplar images, further enhanced the consistency and quality of generated images, as measured by precision and CLIPScore.

Ablation studies showed that combining VSG with contextualizing images yielded the best diversity results. Adjusting the weight of exemplar images influenced the trade-off between precision and recall, with more exemplar images generally enhancing quality. Finally, selecting exemplar images stratified by region slightly improved worst-region recall compared to random selection.

Conclusion

In conclusion, the researchers effectively demonstrated that c-VSG significantly enhanced geographic diversity in text-to-image generative models. By leveraging exemplar images and the VS, c-VSG improved representation for underrepresented regions while maintaining image quality and consistency.

Evaluations on datasets like GeoDE and DollarStreet showed substantial improvements in diversity and reduced regional disparities, outperforming existing methods. This innovative approach highlighted the potential for more accurate and inclusive image generation, reflecting real-world geographic diversity, and sets a promising direction for future research in mitigating biases in generative models.

Journal reference:

Preliminary scientific report. Hemmat, R. A., Hall, M., Sun, A., Ross, C., Drozdzal, M., & Romero-Soriano, A. (2024, June 6). Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance. ArXiv.org. DOI: 10.48550/arXiv.2406.04551, https://arxiv.org/abs/2406.04551

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, July 01). Enhancing Geographic Diversity in Text-to-Image Models with c-VSG. AZoAi. Retrieved on June 30, 2025 from https://www.azoai.com/news/20240701/Enhancing-Geographic-Diversity-in-Text-to-Image-Models-with-c-VSG.aspx.
MLA
Nandi, Soham. "Enhancing Geographic Diversity in Text-to-Image Models with c-VSG". AZoAi. 30 June 2025. <https://www.azoai.com/news/20240701/Enhancing-Geographic-Diversity-in-Text-to-Image-Models-with-c-VSG.aspx>.
Chicago
Nandi, Soham. "Enhancing Geographic Diversity in Text-to-Image Models with c-VSG". AZoAi. https://www.azoai.com/news/20240701/Enhancing-Geographic-Diversity-in-Text-to-Image-Models-with-c-VSG.aspx. (accessed June 30, 2025).
Harvard
Nandi, Soham. 2024. Enhancing Geographic Diversity in Text-to-Image Models with c-VSG. AZoAi, viewed 30 June 2025, https://www.azoai.com/news/20240701/Enhancing-Geographic-Diversity-in-Text-to-Image-Models-with-c-VSG.aspx.