In an article recently submitted to the arxiv* server, researchers addressed the lack of geographic diversity in text-to-image generative models. They introduced an inference-time intervention called contextualized Vendi score guidance (c-VSG) to enhance the diversity of generated images. Evaluations using geographically representative datasets showed that c-VSG improved image diversity for underrepresented regions while maintaining or enhancing image quality and consistency.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Text-to-image systems have demonstrated remarkable results and are widely used, prompting research into their potential risks and biases. Studies have identified disparities in the demographic traits of people represented in generated images, leading to mitigation strategies like textual interventions, attention-weight modification, and semantic guidance. However, biases are not limited to human-centric representations; they extend to objects and their surroundings globally.
Prior research has shown that improvements in image quality often compromise representation diversity and text-image consistency, particularly affecting regional object diversity and reinforcing geographic stereotypes. Despite these findings, no direct mitigation strategies have targeted geo-diversity in text-to-image systems.
This paper addressed this gap by introducing the c-VSG, an inference-time intervention designed to enhance the diversity of images generated by latent diffusion models (LDMs). By leveraging the Vendi Score (VS) to drive diversity and contextualizing generations with real exemplar images, c-VSG aimed to increase object representation diversity while maintaining image quality and text-image consistency.
Evaluations using geographically diverse datasets, geographically diverse evaluation datasets for object recognition (GeoDE), and DollarStreet, demonstrated that c-VSG significantly improved diversity and reduced regional performance disparities, outperforming existing methods. This approach highlighted the potential for more accurate and diverse image generation reflecting real-world geographic diversity.
A Methodological Approach with VSG
The researchers were focused on enhancing the diversity of LDMs, specifically denoising diffusion implicit models (DDIMs), used in text-to-image generation. The researchers introduced VSG, an innovative methodology aimed at guiding the generation process of LDMs to produce more diverse and representative images.
Initially, the authors outlined the foundational concepts of LDMs, emphasizing their backward diffusion process and the application of the VS metric. DDIMs operated by iteratively refining a noisy sample towards a denoised version using learned coefficients and a denoising network.
The key innovation, VSG, adapted the traditional score function of LDMs by incorporating the VS, a metric typically used to evaluate diversity in datasets. VSG modified the generation process to prioritize samples that differed significantly from previously generated images stored in a memory bank, thus enhancing overall image diversity.
Moreover, the authors introduced c-VSG, which further refined VSG by integrating a small set of real-world exemplar images. This contextualization ensured that generated samples not only increased diversity but also remained contextually grounded in real-world object representations.
The formulation of c-VSG balanced between augmenting diversity and maintaining fidelity to exemplar images, achieved through a dual-step guidance process during image generation.
Algorithmically, c-VSG was implemented using a controlled application frequency (Gfreq) to optimize computational efficiency while maximizing diversity. This approach significantly improved the diversity of generated images across various datasets, as evaluated through metrics such as VS, image quality assessments, and text-image consistency checks.
Comprehensive Evaluation of Diversity and Quality in Image Generation
The researchers evaluated the diversity and quality of image generation using LDMs on two geographically diverse datasets: GeoDE and DollarStreet. They reported metrics such as F1, recall, precision, and CLIPScore.
The authors compared several baseline methods, both with and without additional information, including synonyms, paraphrasing, semantic guidance, feedback guidance, and textual inversion. The primary method, VSG, and its contextualized version, c-VSG, which used exemplar images, demonstrated significant improvements in diversity and quality over these baselines.
In terms of results, c-VSG showed substantial improvements in both average and worst-region F1 scores on GeoDE and DollarStreet, outperforming other methods by up to 25% and 37.9%, respectively. VSG improved diversity, reflected in higher recall values and qualitative examples showing varied object attributes and backgrounds. The c-VSG method with exemplar images, further enhanced the consistency and quality of generated images, as measured by precision and CLIPScore.
Ablation studies showed that combining VSG with contextualizing images yielded the best diversity results. Adjusting the weight of exemplar images influenced the trade-off between precision and recall, with more exemplar images generally enhancing quality. Finally, selecting exemplar images stratified by region slightly improved worst-region recall compared to random selection.
Conclusion
In conclusion, the researchers effectively demonstrated that c-VSG significantly enhanced geographic diversity in text-to-image generative models. By leveraging exemplar images and the VS, c-VSG improved representation for underrepresented regions while maintaining image quality and consistency.
Evaluations on datasets like GeoDE and DollarStreet showed substantial improvements in diversity and reduced regional disparities, outperforming existing methods. This innovative approach highlighted the potential for more accurate and inclusive image generation, reflecting real-world geographic diversity, and sets a promising direction for future research in mitigating biases in generative models.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Hemmat, R. A., Hall, M., Sun, A., Ross, C., Drozdzal, M., & Romero-Soriano, A. (2024, June 6). Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance. ArXiv.org. DOI: 10.48550/arXiv.2406.04551, https://arxiv.org/abs/2406.04551