ImageFolder: Autoregressive Image Generation with Folded Tokens

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonOct 10 2024

Discover how ImageFolder transforms image generation by compacting tokens and accelerating inference, all while preserving high-quality details through innovative token prediction and quantization techniques.

Research: ImageFolder: Autoregressive Image Generation with Folded Tokens

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In a research paper recently submitted to the arXiv preprint* server, researchers at Carnegie Mellon University, Adobe Research, and MBZUAI explored the impact of token length on image reconstruction and generation in visual generative models, such as diffusion model (DM) and autoregressive (AR) models.

They introduced ImageFolder, a semantic tokenizer that balanced reconstruction and generation quality by offering spatially aligned image tokens. The key innovation of folded tokens was introduced, compressing token sequences during AR modeling, thereby reducing sequence length and improving generation speed without sacrificing quality.

The proposed dual-branch product quantization technique captured both semantic and pixel-level details, leveraging independent quantization across branches to enhance generation efficiency and quality without increasing token length. Extensive experiments validated the effectiveness of ImageFolder.

Background

Illustration of ImageFolder tokenizer and its corresponding autoregressive (AR) modeling with parallel prediction. (a) ImageFolder utilizes product quantization to obtain two sets of spatially aligned tokens that capture distinct aspects of images. (b) With the tokens from ImageFolder, AR models can predict two tokens from one logit thus significantly shortening the sequence length and benefiting the performance.

Image generation has advanced through DMs and AR models, using tokenization to convert continuous image data into discrete tokens for generation. Prior work, such as vector quantized generative adversarial network (VQGAN), uses vector quantization to encode image features. While models like SEED and TiTok improve on tokenization by injecting semantic meaning, they face trade-offs—longer token sequences improve reconstruction but slow down generation, while shorter sequences degrade image quality.

This paper addressed these challenges by proposing ImageFolder, a semantic tokenizer that struck a balance between image reconstruction and efficient generation. The unique innovation of ImageFolder lies in its folded tokens, which allow token sequences to be compacted for parallel processing during AR generation.

Using product quantization, ImageFolder captured both pixel-level and semantic information with two branches, with a semantic branch introducing regularization to compress semantic details while the other branch focuses on pixel-level information. By folding tokens, the model reduces the computational load for AR models, significantly enhancing inference speed while maintaining high-quality image generation.

Unlike previous methods, it introduced folded tokens, allowing for shorter token lengths while maintaining high generation quality. Extensive experiments confirmed its superior performance, filling a gap between existing tokenization techniques and efficient image generation.

Product Quantization for Image Reconstruction and Generation

The researchers leveraged product quantization (PQ) to compress high-dimensional image data into multiple lower-dimensional tokens. PQ divided the input vector into sub-vectors, quantizing each independently, and reconstructed the original vector via concatenation.

The ImageFolder tokenizer, introduced here, tokenized images into spatially aligned semantic and detail tokens. Crucially, the semantic tokens underwent a regularization process, ensuring the representation was compact and semantically rich, while the detail tokens captured pixel-specific information, both contributing to high-quality reconstruction and generation.

A core feature of the method is the quantizer dropout strategy, which enables residual quantizers to encode images with different bitrates, aiding AR prediction. The AR model, trained with ImageFolder, predicted two tokens from the same logit and reconstructed the image, reducing token length while maintaining generation quality.

This parallel token prediction mechanism assumes token independence, which is a departure from traditional AR models that condition each token on the previous one. By doing so, the method drastically reduces token dependencies, resulting in faster and more efficient generation.

The method also incorporated multiple loss functions, including reconstruction, vector quantization, adversarial, perceptual, and CLIP losses. These ensured accurate image reconstruction, high-level feature alignment, and stable training. The integration of product quantization across two branches—one for semantic and one for detail tokens—ensures that each branch captures distinct aspects of the image, leading to more efficient tokenization and higher-quality generation.

Experimental Evaluation and Performance Analysis

The ImageFolder tokenizer was tested on the ImageNet 256x256 reconstruction and generation tasks, following the LlamaGen training recipe. Due to limited computational resources, the tokenizer was trained for only 200 thousand (K) iterations, with future updates planned for more advanced training schemes.

ImageFolder leverages vision transformers to encode and decode images. Given an image, two sets of KxK learnable tokens are used to generate spatially-aligned low-resolution features from the image. After that, a product quantization is used to obtain discrete image representation. A semantic regularization is applied in one of the quantizers to inject semantic constraints. The quantized tokens are concatenated to serve as input for the image decoder to reconstruct images.

The performance was evaluated using key metrics such as Fréchet inception distance (FID), inception score (IS), precision, and recall, revealing competitive results compared to other state-of-the-art models like BigGAN, ablated DM (ADM), and visual AR (VAR).

A notable feature of the ImageFolder tokenizer was its use of multi-scale residual quantization, which significantly improved the FID score. The use of the quantizer dropout strategy allowed the model to learn progressively finer image details over multiple residual quantization steps, further refining the image representation.

Adjustments such as quantizer dropout and the adoption of product quantization resulted in a performance boost, achieving an rFID of 2.06 and gFID of 5.96. These modifications helped the model represent images across different resolutions and enriched the latent space with semantic regularization.

The experiments also highlighted the tokenizer's computational efficiency, achieving comparable or superior performance to models like VAR with reduced token length, resulting in faster inference times. The conditional image generation experiment showed that ImageFolder could produce variations in generated images based on reference inputs, demonstrating its potential for novel applications.

The ability to teacher-force detail tokens during AR modeling further demonstrates ImageFolder’s versatility for conditional image generation tasks. Linear probing on the ImageNet validation set confirmed the model's strong representation capabilities, surpassing LlamaGen and VAR in top-1 accuracy.

Conclusion

In conclusion, the researchers introduced ImageFolder, a novel semantic image tokenizer aimed at balancing token length and reconstruction quality in autoregressive modeling. By employing product quantization and semantic regularization, ImageFolder captured both pixel-level and semantic information without increasing token length.

The use of folded tokens enabled parallel token prediction, enhancing generation efficiency while reducing token dependencies. Innovations such as folded spatially aligned tokens and quantizer dropout enhanced image generation and reconstruction performance.

Extensive experiments demonstrated competitive results against state-of-the-art models, with improved efficiency and reduced token dependencies. Although effective, the authors acknowledge that further improvements can be made with advanced training schemes, which they plan to explore in future work.

Journal reference:

Preliminary scientific report. Li, X., Chen, H., Qiu, K., Kuen, J., Gu, J., Raj, B., & Lin, Z. (2024). ImageFolder: Autoregressive Image Generation with Folded Tokens. ArXiv.org. DOI:10.48550/arXiv.2410.01756, https://arxiv.org/abs/2410.01756

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, October 10). ImageFolder: Autoregressive Image Generation with Folded Tokens. AZoAi. Retrieved on July 05, 2025 from https://www.azoai.com/news/20241010/ImageFolder-Autoregressive-Image-Generation-with-Folded-Tokens.aspx.
MLA
Nandi, Soham. "ImageFolder: Autoregressive Image Generation with Folded Tokens". AZoAi. 05 July 2025. <https://www.azoai.com/news/20241010/ImageFolder-Autoregressive-Image-Generation-with-Folded-Tokens.aspx>.
Chicago
Nandi, Soham. "ImageFolder: Autoregressive Image Generation with Folded Tokens". AZoAi. https://www.azoai.com/news/20241010/ImageFolder-Autoregressive-Image-Generation-with-Folded-Tokens.aspx. (accessed July 05, 2025).
Harvard
Nandi, Soham. 2024. ImageFolder: Autoregressive Image Generation with Folded Tokens. AZoAi, viewed 05 July 2025, https://www.azoai.com/news/20241010/ImageFolder-Autoregressive-Image-Generation-with-Folded-Tokens.aspx.