EvolveDirector Revolutionizes AI Image Generation with Less Data and Superior Results

By drastically reducing the data needed for training, EvolveDirector makes AI-generated content more accessible and efficient, outperforming industry-leading models in generating high-quality, diverse images.

Research: EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Research: EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers at the Show Lab at the National University of Singapore and Alibaba Group explored the possibility of training a high-quality text-to-image (T2I) generation model using publicly available resources. They proposed a framework called EvolveDirector that interacted with advanced models through their application programming interface (APIs) to collect data and train a base model.

To overcome the limitations of large datasets, the authors introduced a method using pre-trained vision language models (VLMs) to guide the training process and refine the dataset. This approach reduced the required data volume from 11 million samples to just 100,000 while maintaining high performance. The final model, Edgen, demonstrated superior performance compared to the advanced models it was trained on, such as PixArt-α.

Background

Artificial intelligence (AI)-generated content has advanced significantly in recent years, with models like DALL·E 3, Midjourney, and Stable Diffusion producing realistic and creative images. These models typically rely on high-quality, proprietary datasets, making them commercially successful but limiting reproducibility due to their private parameters.

Prior efforts to train open-source models with public datasets have encountered challenges such as high computational costs and inefficient data usage. Despite using large-scale benchmarks like JourneyDB, these models remain less accessible due to the expenses involved in collecting massive datasets.

This paper addressed these gaps by introducing EvolveDirector, a framework that utilized VLMs to dynamically curate training data for a base model. EvolveDirector significantly reduced the required data by guiding the selection and refinement of training samples, cutting the data size by orders of magnitude. The framework refined training samples through operations like discrimination and mutation, making the training process more efficient.

Experimental results showed that the proposed model, Edgen, achieved performance comparable to advanced models such as Stable Diffusion 3 and DeepFloyd IF with far fewer data samples. Furthermore, Edgen surpassed even the most advanced models by leveraging VLMs to select high-quality training samples. This approach democratized T2I generation by making it more accessible and resource-efficient.

Dynamic Training and Data Curation

The EvolveDirector framework was designed to efficiently train T2I models using limited, high-value data. It consisted of three main components: interaction with advanced T2I models, maintenance of a dynamic training set, and training of a base model. First, EvolveDirector interacted with APIs from advanced T2I models to generate images from text prompts.

These images were evaluated by a VLM, which selected the best match and updated the dynamic training set. The VLM continuously evaluated the base model's performance, retaining high-value samples where the base model underperformed and discarding samples where it performed comparably to advanced models. This process ensured efficient learning by focusing on data that the base model needed to improve.

The VLM also generated variations of text prompts to diversify the dataset and avoid data redundancy. To further stabilize training, layer normalization techniques were applied to the multi-head cross-attention blocks within the diffusion transformer (DiT) base model. Finally, a multi-scale training strategy was employed, allowing the base model to handle images of various resolutions and aspect ratios.

Experimental Setup and Results

The researchers outlined the experimental setup and results for training a base model using the EvolveDirector framework and comparing it with several advanced models. The base model was trained on 16 A100 graphics processing units (GPUs) for 240 GPU days, utilizing a batch size of 128 for 512-pixel (px) images and 32 for 1024px images. The training involved both open-source and closed-source models, with EvolveDirector interacting via APIs.

The VLMs were evaluated across multiple criteria, such as discrimination and diversity, using human raters. LLaVA-Next and generative pre-trained transformers (GPT)-4V achieved the highest alignment with human preferences, making LLaVA-Next the selected model for EvolveDirector due to its superior performance and free access.

Ablation studies were conducted to assess different configurations of the EvolveDirector framework. The results indicated that models trained with EvolveDirector on a dynamic dataset of 100,000 (K) samples achieved performance comparable to models trained on 10 million (M) samples. Moreover, models trained with VLM-guided expansion and mutation functions outperformed others, underscoring the framework's efficiency in reducing training data while maintaining high performance.

Qualitative and quantitative comparisons demonstrated that Edgen surpassed advanced models such as Stable Diffusion 3, DeepFloyd IF, and PixArt-α in generating diverse, high-quality images, particularly in tasks involving complex text, human generation, and multi-object generation.

Conclusion

In conclusion, the researchers introduced EvolveDirector, a framework for training a high-quality T2I model, Edgen, using publicly available resources. By leveraging VLMs to refine training datasets through API interactions with advanced models, EvolveDirector achieved a remarkable reduction in required training data, lowering it from 11 million to 100,000 samples while enhancing performance.

Experimental results showed that Edgen outperformed established models like PixArt-α, Stable Diffusion 3, and DeepFloyd IF, demonstrating its effectiveness in generating diverse, high-quality images with fewer training samples. This approach not only democratized T2I generation but also made it more resource-efficient, paving the way for broader accessibility and further advancements in AI-generated content.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Source:
Journal reference:
  • Preliminary scientific report. Zhao, R., Yuan, H., Wei, Y., Zhang, S., Gu, Y., Ran, L., Wang, X., Wu, Z., Zhang, J., Zhang, Y., & Shou, M. Z. (2024). EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models. DOI:10.48550/arXiv.2410.07133, https://arxiv.org/abs/2410.07133v1.
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2024, October 16). EvolveDirector Revolutionizes AI Image Generation with Less Data and Superior Results. AZoAi. Retrieved on October 17, 2024 from https://www.azoai.com/news/20241016/EvolveDirector-Revolutionizes-AI-Image-Generation-with-Less-Data-and-Superior-Results.aspx.

  • MLA

    Nandi, Soham. "EvolveDirector Revolutionizes AI Image Generation with Less Data and Superior Results". AZoAi. 17 October 2024. <https://www.azoai.com/news/20241016/EvolveDirector-Revolutionizes-AI-Image-Generation-with-Less-Data-and-Superior-Results.aspx>.

  • Chicago

    Nandi, Soham. "EvolveDirector Revolutionizes AI Image Generation with Less Data and Superior Results". AZoAi. https://www.azoai.com/news/20241016/EvolveDirector-Revolutionizes-AI-Image-Generation-with-Less-Data-and-Superior-Results.aspx. (accessed October 17, 2024).

  • Harvard

    Nandi, Soham. 2024. EvolveDirector Revolutionizes AI Image Generation with Less Data and Superior Results. AZoAi, viewed 17 October 2024, https://www.azoai.com/news/20241016/EvolveDirector-Revolutionizes-AI-Image-Generation-with-Less-Data-and-Superior-Results.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.