Introducing CAR: A novel framework that enhances visual image generation by incorporating multi-scale control into pre-trained AR models, delivering improved image quality, control precision, and faster performance.
Controllable generation using CAR under various conditions. Results are 512 × 512.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article submitted to the arXiv preprint* server, researchers at Peking University, Southern University of Science and Technology, Tencent, and the University of Washington introduced controllable autoregressive modeling (CAR), a new framework for integrating conditional control into pre-trained visual AR.
CAR progressively refines and captures control representations, which are injected into each autoregressive step of the pre-trained model to guide the generation process, delivering improved controllability and image quality.
The approach also demonstrated strong generalization with fewer training resources. Specifically, CAR uses less than 10% of the data required for pre-training while maintaining high performance. This study marked the first to propose a control framework for pre-trained autoregressive visual models.
Related Work
Past work on diffusion models highlighted their success in generating high-fidelity images through iterative noise reduction, with models like the denoising diffusion probabilistic model (DDPM) setting new standards in image synthesis.
Inspired by language models, AR provided a scalable alternative for image generation, offering greater efficiency but lacking advanced controllability.
Some approaches, like the controllable vector AR (ControlVAR), attempted to address this but struggled with flexibility and efficiency, requiring substantial fine-tuning. Controllable generation has mostly focused on diffusion models, leaving AR underexplored in control mechanisms.
Controllable Image Generation Framework
In this work, the researchers proposed CAR to explore controllable image generation within AR. The task involves generating an image that adheres to a conditional control image, aiming to model the conditional distribution. They built on the "next-scale prediction" paradigm, where the model generates multi-scale token maps instead of predicting individual tokens, progressively refining the image structure across scales while incorporating control signals.
The CAR framework operates by factorizing the conditional distribution into multi-scale probabilities, allowing each token map to be generated based on previous token maps and the corresponding control information. This process leverages Bayesian inference to approximate the posterior distribution of the image token maps given the control signals, enabling precise alignment of control and image features across multiple scales.
The generated image token map is fused with the control map at each scale to implement control, producing a combined representation. This fused representation is processed through GPT-2-style transformer blocks and normalized using LayerNorm, ensuring that the model incorporates control information continuously and that the generated images conform to the specified visual conditions at each scale.
The network is optimized by minimizing the Kullback-Leibler (KL) divergence between the model's conditional and true data distribution. This approach ensures that the generated images align with the control conditions, maintain quality, exhibit coherence across scales, and accurately reflect the injected control priors.
CAR Boosts Controllability
A learnable convolutional encoder was used to extract semantic features from the control input and integrate them with the base model input. The function utilized generative pre-trained transformer 2 (GPT-2)-style transformer blocks at half the depth of the pre-trained model. At the same time, control information was concatenated with the base model output and standardized using LayerNorm.
The researchers conducted experiments using the ImageNet dataset, pseudo-labeling five conditions for training: canny edge, depth map, normal map, holistically-nested edge detection (HED) map, and sketch. A random selection of 100 categories was utilized to train the CAR, while the remaining 900 unseen categories were evaluated to test the model's generalizability and controllability.
The evaluation metrics included Fréchet Inception Distance (FID), Inception Score (IS), precision, and recall, with additional comparisons made against existing controllable generation methods like controllable network (ControlNet) and text-to-image adapter (T2I-Adapter).
Quantitative assessments indicated that the CAR model exhibited superior performance to ControlNet and T2I-Adapter, as evidenced by FID reductions of 3.3, 2.3, and 5.1 across various conditions.
The gains in image quality were attributed to recent advancements in AR, which excelled in image generation by progressively scaling resolution. Notably, the CAR model also demonstrated a fivefold increase in inference speed compared to its counterparts, highlighting its efficiency for practical applications.
The conditions with well-defined objectives, such as HED and depth maps, yielded better performance metrics. In contrast, the sketch condition, being more simplistic, resulted in more variability in image quality. The CAR model demonstrated strong controllability and high-quality image generation, successfully generalizing across unseen categories. Ablation studies further validated the importance of the learnable convolutional encoder and GPT-2-style transformer in improving performance.
Conclusion
In summary, CAR introduced a new approach to controlling autoregressive image generation by capturing multi-scale control representations and integrating them into pre-trained AR.
It outperformed existing methods in controllability and image quality while reducing computational costs. However, limitations remain, such as inefficiencies in handling long image sequences and the need for improved control precision.
Future work could explore alternative injection strategies, such as adaptive or attention-based methods, and expand the framework for complex tasks like video generation.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Source:
Journal reference:
- Preliminary scientific report.
Yao, Z., et al. (2024). CAR: Controllable Autoregressive Modeling for Visual Generation. ArXiv. DOI:10.48550/arXiv.2410.04671, https://arxiv.org/abs/2410.04671v1