Researchers tackle distorted human figures in AI-generated images with innovative models and datasets, paving the way for more accurate and realistic text-to-image technology.
Research: Detecting Human Artifacts from Text-to-Image Models
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers at Adobe Research focused on detecting and correcting human artifacts in text-to-image generated outputs, such as distorted or missing body parts.* They introduced the Human Artifact Dataset (HAD), comprising 37,554 images annotated with 84,852 labeled instances, and developed Human Artifact Detection Models (HADM) to identify and localize artifacts. These models improved generator accuracy through feedback and enabled artifact correction using inpainting techniques, significantly reducing artifacts in advanced diffusion models and enhancing image coherence across generative domains.
Background
Text-to-image generation has advanced significantly, enabling applications in image editing, representation learning, and creative content production. State-of-the-art diffusion models, trained on large proprietary datasets, have improved overall image quality. However, these models often fail to accurately generate human figures, resulting in artifacts such as distorted, missing, or extra body parts. These issues compromise human structural coherence, undermining the fidelity of generated images.
While previous works have addressed general artifacts or focused on specific issues like hand generation, they lack a framework to detect and localize the full range of human artifacts. This study addressed these gaps by introducing HAD, the first large-scale dataset specifically curated to detect and localize diverse human artifacts, and by developing HADM, which guides improvements in generative models through fine-tuning and automated inpainting workflows.
Example annotations from different generators in Human Artifact Dataset.
Human Feedback for Human Artifacts
This study introduced the HAD, an extensive dataset featuring 37,554 images and 84,852 labeled instances across local and global artifact categories. HAD was built using 4,426 prompts generated by generative pre-trained transformers (GPT-4) and images created by models such as Stable Diffusion XL (SDXL), DALLE-2, DALLE-3, and Midjourney. Artifacts were categorized into 12 classes, divided into local artifacts (such as poorly rendered body parts) and global artifacts (such as missing or extra limbs), and were annotated with bounding boxes for precise localization.
Examples of predictions from our HADM considered mistakes during evaluation on SDXL (a), DALLE-3 (b), DALLE-2 (c), and Midjourney (d). FP: false positive, FN: false negative. Red bounding boxes represent the detected artifact with top prediction scores, blue bounding boxes represent other detected bounding boxes with the same class label.
The researchers developed two detection models, HADM local (HADM-L) for local artifacts and HADM global (HADM-G) for global artifacts, using a ViTDet-based architecture with Cascade-RCNN detection heads. These models were trained on HAD and augmented with real human datasets to enhance robustness. The evaluation showed that HADM outperformed existing methods in detecting diverse human artifacts, with average precision (AP50) scores demonstrating its superiority over state-of-the-art baselines.
Additionally, the authors demonstrated the use of HADM to guide the fine-tuning of diffusion models, integrating artifact predictions into training pipelines. By employing techniques such as LoRA weights and freezing variational autoencoders, the fine-tuned models showed a notable reduction in artifact generation. Iterative inpainting workflows guided by HADM further improved image quality, correcting issues with high precision.
(a), (b): Top predictions of HADM on PixArt-Σ. (c), (d): Top predictions of HADM on FLUX.1-dev.
Experiments and Discussion
The authors evaluated the HADM for identifying artifacts in human-like images across diverse domains, including in-domain datasets (e.g., the HAD validation set) and out-of-domain datasets like SD1.4, PixArt-Σ, FLUX.1-dev, and 300W. HADM's performance was measured using the AP50 metric, revealing consistent detection accuracy across various scenarios.
Advanced generators such as DALLE-3 and Midjourney exhibited fewer and subtler artifacts, making detection more challenging, while models like SDXL and DALLE-2, which frequently produced human artifacts, achieved higher detection accuracy. Challenges included annotation ambiguities and false positives in out-of-domain datasets. HADM’s robustness was validated by its ability to detect nuanced issues, such as subtle abnormalities in images from unseen generators or real datasets.
More examples of annotations from Human Artifact Dataset for global human artifacts. First row: SDXL. Second row: DALLE-2. Third row: DALLE-3. Last row: Midjourney.
Refining Human Artifact Detection
The researchers improved human artifact detection by finetuning diffusion models over 80,000 iterations, using diversified prompts generated with advanced language models. Fine-tuning involved freezing the variational autoencoder (VAE) and text encoder, training LoRA weights, and refining detection based on top confidence scores from HADM. A user preference study with 15 participants confirmed the superiority of the fine-tuned model, with participants favoring its artifact-reduced images in 55% of cases.
Additionally, HADM-L was employed for artifact correction through inpainting workflows, leveraging multiple random seeds for enhanced results. This approach effectively mitigated local and global artifacts in diverse scenarios, showcasing the potential to refine advanced diffusion models.
Workflow illustrating the reduction of human artifacts through iterative inpainting. Starting with the initial image (left), we identify artifacts using HADM-L. These artifacts are iteratively corrected by applying inpainting to the top predictions within the corresponding bounding boxes. For each bounding box, multiple inpainting operations are performed in parallel using different random seeds. From these results, HADM-L is reapplied, and the sample with the lower confidence score is selected for each result (middle). This iterative process integrates HADM-L into the inpainting pipeline, automating artifact correction and producing a refined final image (right).
Conclusion
In conclusion, the researchers presented a comprehensive approach to detecting and correcting human artifacts in text-to-image generated outputs. They introduced HAD, the first dataset dedicated to localizing and classifying human artifacts, and trained HADM to identify such flaws with remarkable precision. By integrating HADM predictions, diffusion models were fine-tuned to reduce artifact generation and improve image coherence. Additionally, HADM-enabled workflows demonstrated strong artifact correction capabilities through iterative inpainting techniques. These findings establish a robust framework for addressing challenges in text-to-image models, significantly enhancing generative image quality and structural coherence.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Wang, K., Zhang, L., & Zhang, J. (2024). Detecting Human Artifacts from Text-to-Image Models. ArXiv. https://arxiv.org/abs/2411.13842