Evaluating Text-to-Image Generative Models: A Global Perspective

In an article recently submitted to the arxiv* server, researchers emphasized the importance of evaluating text-to-image generative models for realism, diversity, and cultural relevance. They found that automated metrics needed sufficient to capture diverse human preferences across regions like Africa, Europe, and Southeast Asia. The study gathered 65,000+ image annotations and 20 surveys, highlighting regional disparities in geographic representation and visual appeal perceptions, contrasting human assessments with automated metrics.

Study: Evaluating Text-to-Image Generative Models: A Global Perspective. Image Credit: SObeR 9426/Shutterstock
Study: Evaluating Text-to-Image Generative Models: A Global Perspective. Image Credit: SObeR 9426/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

Past work has made significant advances in generative models for visual content creation, achieving photorealistic image generation. However, it's crucial to assess if these images truly reflect real-world diversity, capturing various people, objects, and scenes across different regions while visually appealing and textually consistent.

Despite their utility, automated metrics like Fréchet inception distance (FID) and contrastive language-image pre-training score (CLIPScore) face challenges such as reliance on pre-trained extractors and constrained representations that fail to encompass human preferences. Human evaluations, though the gold standard, also grapple with regional and cultural subjectivity, and varying task designs impact results.

Evaluating Text-to-Image Models

The evaluation criteria for text-to-image models include geographic representation, visual appeal, and object consistency. Geographic representation, which encompasses realism and diversity, measures how well-generated images reflect different regions' real-world variability and nuances. Realism refers to how closely generated images resemble the real world, while diversity indicates the extent of variability captured.

Automatic metrics such as precision and coverage quantify these aspects. Precision measures the proportion of generated images within the manifold of real photos, while coverage assesses the breadth of variation captured by the generated images. Similarity, foundational to realism and diversity, evaluates whether two images closely resemble each other, relying on feature extractors like Inceptionv3, CLIP ViT-B/32, and data-efficient image transformers' vision transformer-large/14 (DINOv2') to approximate human perceptions.

Visual appeal relates to the attractiveness or interest of images, with some works using precision metrics to measure appeal by comparing generated images to a manifold of real images. Object consistency checks whether images include all prompt components, ensuring visual concreteness and meaningful representation. For this, the CLIPScore metric is employed, which evaluates the consistency of objects in generated images with the given prompts. Text embeddings corresponding to the intended objects in the images are used for this measurement and are termed "Object-CLIPScore."

The analysts collected the annotations to understand regional variations in human perceptions of the evaluation criteria for text-to-image models. Real images from the GeoDE dataset and generated images from models like "DM w/ CLIP" and " language-driven model  (LDM 2.1)" were used, focusing on objects like bags, cars, cooking pots, dogs, plates of food, and storefronts across Africa, Europe, and Southeast Asia.

Task 1 involved image comparisons and object consistency checks using triplets of images, asking annotators to assess object presence, similarity, and visual appeal. Task 2 focused on geographic representation and object consistency by asking annotators to identify the region to which the object and background in a single image could belong.

Annotators from Africa, Europe, and Southeast Asia were engaged, ensuring diverse perspectives. Each task was annotated by individuals from each region, with a rigorous quality check filtering process applied to the annotations. A voluntary survey was also conducted to gather insights into how annotators interpreted questions about geographic representation. The survey included determining object similarity to regional counterparts and background context.

Responses were categorized using an inductive coding approach, resulting in descriptive codes that were validated for consistency. This comprehensive annotation and survey process allowed for analyzing human perceptions of text-to-image model outputs across different geographic regions.

Critical Evaluation Insights

In the results section, analyses corresponding to human interpretations leverage only annotation and survey data. The study discusses the interaction between human and automatic metrics in evaluating text-to-image models across geographic representation, visual appeal, and object consistency. Geographic representation involves assessing how well images capture real-world nuances across different regions.

Annotator perceptions highlight discrepancies between in- and out-of-region perspectives, particularly in identifying stereotypical features in generated images. These insights underscore the need for balanced geographical representation in model evaluation, ensuring robustness against regional biases.

Visual appeal assessments reveal varying levels of agreement among annotators, influenced by individual preferences and regional perspectives. Object consistency evaluations demonstrate challenges in maintaining accuracy across different datasets, with models often prioritizing geographic fidelity over object-specific details. Recommendations emphasize the integration of diverse geographical viewpoints and refinement of metric methodologies to foster more equitable evaluations in text-to-image generation.

Conclusion

To sum up, the study explored regional variations in annotators' perceptions of text-to-image model evaluation criteria and assessed the effectiveness of automatic metrics. Recommendations stressed inclusive geographic annotations and highlighted CLIP and DINOv2's superiority over Inceptionv3 in image similarity evaluation. The team noted the challenges in interpreting visual appeal and object consistency, advocating for nuanced evaluation approaches beyond majority-vote aggregations to ensure broader geographic inclusion in future studies.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, June 25). Evaluating Text-to-Image Generative Models: A Global Perspective. AZoAi. Retrieved on January 07, 2025 from https://www.azoai.com/news/20240625/Evaluating-Text-to-Image-Generative-Models-A-Global-Perspective.aspx.

  • MLA

    Chandrasekar, Silpaja. "Evaluating Text-to-Image Generative Models: A Global Perspective". AZoAi. 07 January 2025. <https://www.azoai.com/news/20240625/Evaluating-Text-to-Image-Generative-Models-A-Global-Perspective.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Evaluating Text-to-Image Generative Models: A Global Perspective". AZoAi. https://www.azoai.com/news/20240625/Evaluating-Text-to-Image-Generative-Models-A-Global-Perspective.aspx. (accessed January 07, 2025).

  • Harvard

    Chandrasekar, Silpaja. 2024. Evaluating Text-to-Image Generative Models: A Global Perspective. AZoAi, viewed 07 January 2025, https://www.azoai.com/news/20240625/Evaluating-Text-to-Image-Generative-Models-A-Global-Perspective.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Evaluating AI Video Models with WorldSimBench to Simulate Real-World Tasks