Discover how ImageFolder transforms image generation by compacting tokens and accelerating inference, all while preserving high-quality details through innovative token prediction and quantization techniques.
Research: ImageFolder: Autoregressive Image Generation with Folded Tokens
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In a research paper recently submitted to the arXiv preprint* server, researchers at Carnegie Mellon University, Adobe Research, and MBZUAI explored the impact of token length on image reconstruction and generation in visual generative models, such as diffusion model (DM) and autoregressive (AR) models.
They introduced ImageFolder, a semantic tokenizer that balanced reconstruction and generation quality by offering spatially aligned image tokens. The key innovation of folded tokens was introduced, compressing token sequences during AR modeling, thereby reducing sequence length and improving generation speed without sacrificing quality.
The proposed dual-branch product quantization technique captured both semantic and pixel-level details, leveraging independent quantization across branches to enhance generation efficiency and quality without increasing token length. Extensive experiments validated the effectiveness of ImageFolder.
Background
Image generation has advanced through DMs and AR models, using tokenization to convert continuous image data into discrete tokens for generation. Prior work, such as vector quantized generative adversarial network (VQGAN), uses vector quantization to encode image features. While models like SEED and TiTok improve on tokenization by injecting semantic meaning, they face trade-offs—longer token sequences improve reconstruction but slow down generation, while shorter sequences degrade image quality.
This paper addressed these challenges by proposing ImageFolder, a semantic tokenizer that struck a balance between image reconstruction and efficient generation. The unique innovation of ImageFolder lies in its folded tokens, which allow token sequences to be compacted for parallel processing during AR generation.
Using product quantization, ImageFolder captured both pixel-level and semantic information with two branches, with a semantic branch introducing regularization to compress semantic details while the other branch focuses on pixel-level information. By folding tokens, the model reduces the computational load for AR models, significantly enhancing inference speed while maintaining high-quality image generation.
Unlike previous methods, it introduced folded tokens, allowing for shorter token lengths while maintaining high generation quality. Extensive experiments confirmed its superior performance, filling a gap between existing tokenization techniques and efficient image generation.
Product Quantization for Image Reconstruction and Generation
The researchers leveraged product quantization (PQ) to compress high-dimensional image data into multiple lower-dimensional tokens. PQ divided the input vector into sub-vectors, quantizing each independently, and reconstructed the original vector via concatenation.
The ImageFolder tokenizer, introduced here, tokenized images into spatially aligned semantic and detail tokens. Crucially, the semantic tokens underwent a regularization process, ensuring the representation was compact and semantically rich, while the detail tokens captured pixel-specific information, both contributing to high-quality reconstruction and generation.
A core feature of the method is the quantizer dropout strategy, which enables residual quantizers to encode images with different bitrates, aiding AR prediction. The AR model, trained with ImageFolder, predicted two tokens from the same logit and reconstructed the image, reducing token length while maintaining generation quality.
This parallel token prediction mechanism assumes token independence, which is a departure from traditional AR models that condition each token on the previous one. By doing so, the method drastically reduces token dependencies, resulting in faster and more efficient generation.
The method also incorporated multiple loss functions, including reconstruction, vector quantization, adversarial, perceptual, and CLIP losses. These ensured accurate image reconstruction, high-level feature alignment, and stable training. The integration of product quantization across two branches—one for semantic and one for detail tokens—ensures that each branch captures distinct aspects of the image, leading to more efficient tokenization and higher-quality generation.
Experimental Evaluation and Performance Analysis
The ImageFolder tokenizer was tested on the ImageNet 256x256 reconstruction and generation tasks, following the LlamaGen training recipe. Due to limited computational resources, the tokenizer was trained for only 200 thousand (K) iterations, with future updates planned for more advanced training schemes.
The performance was evaluated using key metrics such as Fréchet inception distance (FID), inception score (IS), precision, and recall, revealing competitive results compared to other state-of-the-art models like BigGAN, ablated DM (ADM), and visual AR (VAR).
A notable feature of the ImageFolder tokenizer was its use of multi-scale residual quantization, which significantly improved the FID score. The use of the quantizer dropout strategy allowed the model to learn progressively finer image details over multiple residual quantization steps, further refining the image representation.
Adjustments such as quantizer dropout and the adoption of product quantization resulted in a performance boost, achieving an rFID of 2.06 and gFID of 5.96. These modifications helped the model represent images across different resolutions and enriched the latent space with semantic regularization.
The experiments also highlighted the tokenizer's computational efficiency, achieving comparable or superior performance to models like VAR with reduced token length, resulting in faster inference times. The conditional image generation experiment showed that ImageFolder could produce variations in generated images based on reference inputs, demonstrating its potential for novel applications.
The ability to teacher-force detail tokens during AR modeling further demonstrates ImageFolder’s versatility for conditional image generation tasks. Linear probing on the ImageNet validation set confirmed the model's strong representation capabilities, surpassing LlamaGen and VAR in top-1 accuracy.
Conclusion
In conclusion, the researchers introduced ImageFolder, a novel semantic image tokenizer aimed at balancing token length and reconstruction quality in autoregressive modeling. By employing product quantization and semantic regularization, ImageFolder captured both pixel-level and semantic information without increasing token length.
The use of folded tokens enabled parallel token prediction, enhancing generation efficiency while reducing token dependencies. Innovations such as folded spatially aligned tokens and quantizer dropout enhanced image generation and reconstruction performance.
Extensive experiments demonstrated competitive results against state-of-the-art models, with improved efficiency and reduced token dependencies. Although effective, the authors acknowledge that further improvements can be made with advanced training schemes, which they plan to explore in future work.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Li, X., Chen, H., Qiu, K., Kuen, J., Gu, J., Raj, B., & Lin, Z. (2024). ImageFolder: Autoregressive Image Generation with Folded Tokens. ArXiv.org. DOI:10.48550/arXiv.2410.01756, https://arxiv.org/abs/2410.01756