By drastically reducing the data needed for training, EvolveDirector makes AI-generated content more accessible and efficient, outperforming industry-leading models in generating high-quality, diverse images.
Research: EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers at the Show Lab at the National University of Singapore and Alibaba Group explored the possibility of training a high-quality text-to-image (T2I) generation model using publicly available resources. They proposed a framework called EvolveDirector that interacted with advanced models through their application programming interface (APIs) to collect data and train a base model.
To overcome the limitations of large datasets, the authors introduced a method using pre-trained vision language models (VLMs) to guide the training process and refine the dataset. This approach reduced the required data volume from 11 million samples to just 100,000 while maintaining high performance. The final model, Edgen, demonstrated superior performance compared to the advanced models it was trained on, such as PixArt-α.
Background
Artificial intelligence (AI)-generated content has advanced significantly in recent years, with models like DALL·E 3, Midjourney, and Stable Diffusion producing realistic and creative images. These models typically rely on high-quality, proprietary datasets, making them commercially successful but limiting reproducibility due to their private parameters.
Prior efforts to train open-source models with public datasets have encountered challenges such as high computational costs and inefficient data usage. Despite using large-scale benchmarks like JourneyDB, these models remain less accessible due to the expenses involved in collecting massive datasets.
This paper addressed these gaps by introducing EvolveDirector, a framework that utilized VLMs to dynamically curate training data for a base model. EvolveDirector significantly reduced the required data by guiding the selection and refinement of training samples, cutting the data size by orders of magnitude. The framework refined training samples through operations like discrimination and mutation, making the training process more efficient.
Experimental results showed that the proposed model, Edgen, achieved performance comparable to advanced models such as Stable Diffusion 3 and DeepFloyd IF with far fewer data samples. Furthermore, Edgen surpassed even the most advanced models by leveraging VLMs to select high-quality training samples. This approach democratized T2I generation by making it more accessible and resource-efficient.
Dynamic Training and Data Curation
The EvolveDirector framework was designed to efficiently train T2I models using limited, high-value data. It consisted of three main components: interaction with advanced T2I models, maintenance of a dynamic training set, and training of a base model. First, EvolveDirector interacted with APIs from advanced T2I models to generate images from text prompts.
These images were evaluated by a VLM, which selected the best match and updated the dynamic training set. The VLM continuously evaluated the base model's performance, retaining high-value samples where the base model underperformed and discarding samples where it performed comparably to advanced models. This process ensured efficient learning by focusing on data that the base model needed to improve.
The VLM also generated variations of text prompts to diversify the dataset and avoid data redundancy. To further stabilize training, layer normalization techniques were applied to the multi-head cross-attention blocks within the diffusion transformer (DiT) base model. Finally, a multi-scale training strategy was employed, allowing the base model to handle images of various resolutions and aspect ratios.
Experimental Setup and Results
The researchers outlined the experimental setup and results for training a base model using the EvolveDirector framework and comparing it with several advanced models. The base model was trained on 16 A100 graphics processing units (GPUs) for 240 GPU days, utilizing a batch size of 128 for 512-pixel (px) images and 32 for 1024px images. The training involved both open-source and closed-source models, with EvolveDirector interacting via APIs.
The VLMs were evaluated across multiple criteria, such as discrimination and diversity, using human raters. LLaVA-Next and generative pre-trained transformers (GPT)-4V achieved the highest alignment with human preferences, making LLaVA-Next the selected model for EvolveDirector due to its superior performance and free access.
Ablation studies were conducted to assess different configurations of the EvolveDirector framework. The results indicated that models trained with EvolveDirector on a dynamic dataset of 100,000 (K) samples achieved performance comparable to models trained on 10 million (M) samples. Moreover, models trained with VLM-guided expansion and mutation functions outperformed others, underscoring the framework's efficiency in reducing training data while maintaining high performance.
Qualitative and quantitative comparisons demonstrated that Edgen surpassed advanced models such as Stable Diffusion 3, DeepFloyd IF, and PixArt-α in generating diverse, high-quality images, particularly in tasks involving complex text, human generation, and multi-object generation.
Conclusion
In conclusion, the researchers introduced EvolveDirector, a framework for training a high-quality T2I model, Edgen, using publicly available resources. By leveraging VLMs to refine training datasets through API interactions with advanced models, EvolveDirector achieved a remarkable reduction in required training data, lowering it from 11 million to 100,000 samples while enhancing performance.
Experimental results showed that Edgen outperformed established models like PixArt-α, Stable Diffusion 3, and DeepFloyd IF, demonstrating its effectiveness in generating diverse, high-quality images with fewer training samples. This approach not only democratized T2I generation but also made it more resource-efficient, paving the way for broader accessibility and further advancements in AI-generated content.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Source:
Journal reference:
- Preliminary scientific report.
Zhao, R., Yuan, H., Wei, Y., Zhang, S., Gu, Y., Ran, L., Wang, X., Wu, Z., Zhang, J., Zhang, Y., & Shou, M. Z. (2024). EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models. DOI:10.48550/arXiv.2410.07133, https://arxiv.org/abs/2410.07133v1.