In an article recently posted to the Meta Research website, researchers introduced decomposed indicators of disparities in image generation (Decomposed-DIG), a set of metrics for separately measuring geographic disparities in objects and backgrounds of generated images.
By auditing a latent diffusion model, the study revealed that generated objects were more realistic than backgrounds, with significant regional disparities, particularly in Africa. The paper also presented a new prompting structure that improved background diversity in generated images.
Background
Recent advancements in text-to-image generative systems have greatly improved visual content creation and downstream discriminative model training. However, these models often exhibit social biases, especially geographic disparities in image realism and representation diversity.
Prior works identified that generated images frequently represent regions like Africa with stereotypical and inaccurate depictions. Existing evaluation metrics, such as those comparing generated images holistically to real images, fail to attribute these biases to specific image components like objects and backgrounds.
To address these gaps, the paper introduced Decomposed-DIG, a set of metrics that separately measured disparities in object and background depictions in generated images. This innovative approach provided a more detailed analysis of geographic biases, revealing that generated objects are typically more realistic than backgrounds and that backgrounds exhibit greater regional disparities. The study also proposed a prompting technique that significantly improved background diversity, offering a more accurate and representative generative model.
Detailed Benchmarking Protocol for Analyzing Geographic Disparities in Image Generation
The process involved three main steps.
- Object and background segmentation: Images were segmented into object and background components using the segment anything model (SAM) facilitated by LangSAM. SAM utilized bounding boxes generated by GroundingDINO for object detection, producing precise segmentation masks. Any image regions not identified as objects were categorized as backgrounds, ensuring a clear division for subsequent analysis.
- Decomposed image features: Vision transformer (ViT) was employed for feature extraction, focusing on object-specific and background-specific patches within the segmented images. ViT's ability to isolate features based on patches allowed for detailed measurements of realism and diversity specific to objects and backgrounds separately. This method contrasted with traditional convolutional neural network (CNN)-based approaches by leveraging patch-level attention scores to refine feature extraction.
- Object and background-specific measurements: Using ViT features, the protocol calculated precision and coverage metrics separately for object-only ("Obj-only") and background-only ("BG-only") contexts across different geographic regions. This analysis helped in pinpointing disparities more accurately compared to previous holistic evaluations, which considered entire images without segmenting objects and backgrounds.
Decomposed-DIG enhanced the granularity of evaluation by focusing on specific components of generated images, enabling a more detailed assessment of geographic biases. This approach ensured that disparities in realism and representation diversity could be attributed to distinct parts of the image, facilitating targeted improvements in generative models to reduce biases effectively.
Analysis of Geographic Disparities in Generated Images
The authors applied the Decomposed-DIG to analyze geographic biases in the widely used latent diffusion model (LDM) 1.5.3. They focused on dissecting the disparities between object and background components in generated images across different geographic regions.
Initially, it was found that objects generally exhibited higher realism compared to backgrounds, as indicated by higher precision scores in Obj-only evaluations than in BG-only evaluations. This disparity suggested that while generated objects aligned more closely with real counterparts, backgrounds often depicted settings less representative of real-world diversity, such as rural scenes in Africa or historical architecture in Europe.
Furthermore, the analysis revealed that backgrounds displayed significantly larger geographic disparities than objects. Coverage metrics in BG-only setups varied notably across regions, indicating a broader range of representation diversity issues compared to objects. The researchers substantiated these findings with qualitative insights into generation patterns, identifying specific failure modes where the LDM struggled to depict diverse backgrounds or realistic objects in certain regions. For instance, backgrounds in Africa may lack diversity in neutral scenes, while objects like modern vehicles were inadequately represented.
Early Mitigations via New Prompt Template
To address regional disparities in generated images, the researchers explored using adjective descriptors in prompts, such as, “European bag”, instead of noun-based descriptors, like, “bag in Europe”. Results showed that this new prompting template significantly improved background diversity by 52% for the worst-performing region and 20% on average, with minimal impact on object realism and diversity. Adjective-based prompts resulted in more varied and neutral backgrounds, reducing stereotypical representations. This approach led to a slight improvement in background realism for the worst-performing group and overall improvements in object depiction.
Conclusion
In conclusion, the researchers introduced Decomposed-DIG as a benchmarking tool to uncover geographic disparities in text-to-image models, focusing on object and background components. They highlighted that backgrounds exhibit larger regional disparities than objects, impacting realism and diversity in generated images.
The authors identified specific model shortcomings, such as inadequate depiction of object diversity in Africa and unrealistic backgrounds in Europe. By proposing a new prompting strategy based on adjectives, the study demonstrated significant improvements in background diversity without compromising object realism. These findings showed the importance of detailed evaluation metrics and targeted mitigations to enhance the accuracy and inclusivity of generative visual content.