Tackling Text-to-Image AI Flaws

Researchers tackle distorted human figures in AI-generated images with innovative models and datasets, paving the way for more accurate and realistic text-to-image technology.

Research: Detecting Human Artifacts from Text-to-Image Models

Research: Detecting Human Artifacts from Text-to-Image Models

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers at Adobe Research focused on detecting and correcting human artifacts in text-to-image generated outputs, such as distorted or missing body parts.* They introduced the Human Artifact Dataset (HAD), comprising 37,554 images annotated with 84,852 labeled instances, and developed Human Artifact Detection Models (HADM) to identify and localize artifacts. These models improved generator accuracy through feedback and enabled artifact correction using inpainting techniques, significantly reducing artifacts in advanced diffusion models and enhancing image coherence across generative domains.

Background

Text-to-image generation has advanced significantly, enabling applications in image editing, representation learning, and creative content production. State-of-the-art diffusion models, trained on large proprietary datasets, have improved overall image quality. However, these models often fail to accurately generate human figures, resulting in artifacts such as distorted, missing, or extra body parts. These issues compromise human structural coherence, undermining the fidelity of generated images.

While previous works have addressed general artifacts or focused on specific issues like hand generation, they lack a framework to detect and localize the full range of human artifacts. This study addressed these gaps by introducing HAD, the first large-scale dataset specifically curated to detect and localize diverse human artifacts, and by developing HADM, which guides improvements in generative models through fine-tuning and automated inpainting workflows.

Example annotations from different generators in Human Artifact Dataset.Example annotations from different generators in Human Artifact Dataset.

Human Feedback for Human Artifacts

This study introduced the HAD, an extensive dataset featuring 37,554 images and 84,852 labeled instances across local and global artifact categories. HAD was built using 4,426 prompts generated by generative pre-trained transformers (GPT-4) and images created by models such as Stable Diffusion XL (SDXL), DALLE-2, DALLE-3, and Midjourney. Artifacts were categorized into 12 classes, divided into local artifacts (such as poorly rendered body parts) and global artifacts (such as missing or extra limbs), and were annotated with bounding boxes for precise localization.

Examples of predictions from our HADM considered mistakes during evaluation on SDXL (a), DALLE-3 (b), DALLE-2 (c), and Midjourney (d). FP: false positive, FN: false negative. Red bounding boxes represent the detected artifact with top prediction scores, blue bounding boxes represent other detected bounding boxes with the same class label.

Examples of predictions from our HADM considered mistakes during evaluation on SDXL (a), DALLE-3 (b), DALLE-2 (c), and Midjourney (d). FP: false positive, FN: false negative. Red bounding boxes represent the detected artifact with top prediction scores, blue bounding boxes represent other detected bounding boxes with the same class label. 

The researchers developed two detection models, HADM local (HADM-L) for local artifacts and HADM global (HADM-G) for global artifacts, using a ViTDet-based architecture with Cascade-RCNN detection heads. These models were trained on HAD and augmented with real human datasets to enhance robustness. The evaluation showed that HADM outperformed existing methods in detecting diverse human artifacts, with average precision (AP50) scores demonstrating its superiority over state-of-the-art baselines.

Additionally, the authors demonstrated the use of HADM to guide the fine-tuning of diffusion models, integrating artifact predictions into training pipelines. By employing techniques such as LoRA weights and freezing variational autoencoders, the fine-tuned models showed a notable reduction in artifact generation. Iterative inpainting workflows guided by HADM further improved image quality, correcting issues with high precision.

(a), (b): Top predictions of HADM on PixArt-Σ. (c), (d): Top predictions of HADM on FLUX.1-dev.(a), (b): Top predictions of HADM on PixArt-Σ. (c), (d): Top predictions of HADM on FLUX.1-dev.

Experiments and Discussion

The authors evaluated the HADM for identifying artifacts in human-like images across diverse domains, including in-domain datasets (e.g., the HAD validation set) and out-of-domain datasets like SD1.4, PixArt-Σ, FLUX.1-dev, and 300W. HADM's performance was measured using the AP50 metric, revealing consistent detection accuracy across various scenarios.

Advanced generators such as DALLE-3 and Midjourney exhibited fewer and subtler artifacts, making detection more challenging, while models like SDXL and DALLE-2, which frequently produced human artifacts, achieved higher detection accuracy. Challenges included annotation ambiguities and false positives in out-of-domain datasets. HADM’s robustness was validated by its ability to detect nuanced issues, such as subtle abnormalities in images from unseen generators or real datasets.

More examples of annotations from Human Artifact Dataset for global human artifacts. First row: SDXL. Second row: DALLE-2. Third row: DALLE-3. Last row: Midjourney.

More examples of annotations from Human Artifact Dataset for global human artifacts. First row: SDXL. Second row: DALLE-2. Third row: DALLE-3. Last row: Midjourney.

Refining Human Artifact Detection

The researchers improved human artifact detection by finetuning diffusion models over 80,000 iterations, using diversified prompts generated with advanced language models. Fine-tuning involved freezing the variational autoencoder (VAE) and text encoder, training LoRA weights, and refining detection based on top confidence scores from HADM. A user preference study with 15 participants confirmed the superiority of the fine-tuned model, with participants favoring its artifact-reduced images in 55% of cases.

Additionally, HADM-L was employed for artifact correction through inpainting workflows, leveraging multiple random seeds for enhanced results. This approach effectively mitigated local and global artifacts in diverse scenarios, showcasing the potential to refine advanced diffusion models.

Workflow illustrating the reduction of human artifacts through iterative inpainting. Starting with the initial image (left), we identify artifacts using HADM-L. These artifacts are iteratively corrected by applying inpainting to the top predictions within the corresponding bounding boxes. For each bounding box, multiple inpainting operations are performed in parallel using different random seeds. From these results, HADM-L is reapplied, and the sample with the lower confidence score is selected for each result (middle). This iterative process integrates HADM-L into the inpainting pipeline, automating artifact correction and producing a refined final image (right).

Workflow illustrating the reduction of human artifacts through iterative inpainting. Starting with the initial image (left), we identify artifacts using HADM-L. These artifacts are iteratively corrected by applying inpainting to the top predictions within the corresponding bounding boxes. For each bounding box, multiple inpainting operations are performed in parallel using different random seeds. From these results, HADM-L is reapplied, and the sample with the lower confidence score is selected for each result (middle). This iterative process integrates HADM-L into the inpainting pipeline, automating artifact correction and producing a refined final image (right).

Conclusion

In conclusion, the researchers presented a comprehensive approach to detecting and correcting human artifacts in text-to-image generated outputs. They introduced HAD, the first dataset dedicated to localizing and classifying human artifacts, and trained HADM to identify such flaws with remarkable precision. By integrating HADM predictions, diffusion models were fine-tuned to reduce artifact generation and improve image coherence. Additionally, HADM-enabled workflows demonstrated strong artifact correction capabilities through iterative inpainting techniques. These findings establish a robust framework for addressing challenges in text-to-image models, significantly enhancing generative image quality and structural coherence.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
  • Preliminary scientific report. Wang, K., Zhang, L., & Zhang, J. (2024). Detecting Human Artifacts from Text-to-Image Models. ArXiv. https://arxiv.org/abs/2411.13842
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2024, December 02). Tackling Text-to-Image AI Flaws. AZoAi. Retrieved on December 04, 2024 from https://www.azoai.com/news/20241202/Tackling-Text-to-Image-AI-Flaws.aspx.

  • MLA

    Nandi, Soham. "Tackling Text-to-Image AI Flaws". AZoAi. 04 December 2024. <https://www.azoai.com/news/20241202/Tackling-Text-to-Image-AI-Flaws.aspx>.

  • Chicago

    Nandi, Soham. "Tackling Text-to-Image AI Flaws". AZoAi. https://www.azoai.com/news/20241202/Tackling-Text-to-Image-AI-Flaws.aspx. (accessed December 04, 2024).

  • Harvard

    Nandi, Soham. 2024. Tackling Text-to-Image AI Flaws. AZoAi, viewed 04 December 2024, https://www.azoai.com/news/20241202/Tackling-Text-to-Image-AI-Flaws.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.