ImageFolder: Autoregressive Image Generation with Folded Tokens

Discover how ImageFolder transforms image generation by compacting tokens and accelerating inference, all while preserving high-quality details through innovative token prediction and quantization techniques.

Research: ImageFolder: Autoregressive Image Generation with Folded Tokens

Research: ImageFolder: Autoregressive Image Generation with Folded Tokens

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

In a research paper recently submitted to the arXiv preprint* server, researchers at Carnegie Mellon University, Adobe Research, and MBZUAI explored the impact of token length on image reconstruction and generation in visual generative models, such as diffusion model (DM) and autoregressive (AR) models.

They introduced ImageFolder, a semantic tokenizer that balanced reconstruction and generation quality by offering spatially aligned image tokens. The key innovation of folded tokens was introduced, compressing token sequences during AR modeling, thereby reducing sequence length and improving generation speed without sacrificing quality.

The proposed dual-branch product quantization technique captured both semantic and pixel-level details, leveraging independent quantization across branches to enhance generation efficiency and quality without increasing token length. Extensive experiments validated the effectiveness of ImageFolder.

Background

Image generation has advanced through DMs and AR models, using tokenization to convert continuous image data into discrete tokens for generation. Prior work, such as vector quantized generative adversarial network (VQGAN), uses vector quantization to encode image features. While models like SEED and TiTok improve on tokenization by injecting semantic meaning, they face trade-offs—longer token sequences improve reconstruction but slow down generation, while shorter sequences degrade image quality.

This paper addressed these challenges by proposing ImageFolder, a semantic tokenizer that struck a balance between image reconstruction and efficient generation. The unique innovation of ImageFolder lies in its folded tokens, which allow token sequences to be compacted for parallel processing during AR generation.

Using product quantization, ImageFolder captured both pixel-level and semantic information with two branches, with a semantic branch introducing regularization to compress semantic details while the other branch focuses on pixel-level information. By folding tokens, the model reduces the computational load for AR models, significantly enhancing inference speed while maintaining high-quality image generation.

Unlike previous methods, it introduced folded tokens, allowing for shorter token lengths while maintaining high generation quality. Extensive experiments confirmed its superior performance, filling a gap between existing tokenization techniques and efficient image generation.

Product Quantization for Image Reconstruction and Generation

The researchers leveraged product quantization (PQ) to compress high-dimensional image data into multiple lower-dimensional tokens. PQ divided the input vector into sub-vectors, quantizing each independently, and reconstructed the original vector via concatenation.

The ImageFolder tokenizer, introduced here, tokenized images into spatially aligned semantic and detail tokens. Crucially, the semantic tokens underwent a regularization process, ensuring the representation was compact and semantically rich, while the detail tokens captured pixel-specific information, both contributing to high-quality reconstruction and generation.

A core feature of the method is the quantizer dropout strategy, which enables residual quantizers to encode images with different bitrates, aiding AR prediction. The AR model, trained with ImageFolder, predicted two tokens from the same logit and reconstructed the image, reducing token length while maintaining generation quality.

This parallel token prediction mechanism assumes token independence, which is a departure from traditional AR models that condition each token on the previous one. By doing so, the method drastically reduces token dependencies, resulting in faster and more efficient generation.

The method also incorporated multiple loss functions, including reconstruction, vector quantization, adversarial, perceptual, and CLIP losses. These ensured accurate image reconstruction, high-level feature alignment, and stable training. The integration of product quantization across two branches—one for semantic and one for detail tokens—ensures that each branch captures distinct aspects of the image, leading to more efficient tokenization and higher-quality generation.

Experimental Evaluation and Performance Analysis

The ImageFolder tokenizer was tested on the ImageNet 256x256 reconstruction and generation tasks, following the LlamaGen training recipe. Due to limited computational resources, the tokenizer was trained for only 200 thousand (K) iterations, with future updates planned for more advanced training schemes.

The performance was evaluated using key metrics such as Fréchet inception distance (FID), inception score (IS), precision, and recall, revealing competitive results compared to other state-of-the-art models like BigGAN, ablated DM (ADM), and visual AR (VAR).

A notable feature of the ImageFolder tokenizer was its use of multi-scale residual quantization, which significantly improved the FID score. The use of the quantizer dropout strategy allowed the model to learn progressively finer image details over multiple residual quantization steps, further refining the image representation.

Adjustments such as quantizer dropout and the adoption of product quantization resulted in a performance boost, achieving an rFID of 2.06 and gFID of 5.96. These modifications helped the model represent images across different resolutions and enriched the latent space with semantic regularization.

The experiments also highlighted the tokenizer's computational efficiency, achieving comparable or superior performance to models like VAR with reduced token length, resulting in faster inference times. The conditional image generation experiment showed that ImageFolder could produce variations in generated images based on reference inputs, demonstrating its potential for novel applications.

The ability to teacher-force detail tokens during AR modeling further demonstrates ImageFolder’s versatility for conditional image generation tasks. Linear probing on the ImageNet validation set confirmed the model's strong representation capabilities, surpassing LlamaGen and VAR in top-1 accuracy.

Conclusion

In conclusion, the researchers introduced ImageFolder, a novel semantic image tokenizer aimed at balancing token length and reconstruction quality in autoregressive modeling. By employing product quantization and semantic regularization, ImageFolder captured both pixel-level and semantic information without increasing token length.

The use of folded tokens enabled parallel token prediction, enhancing generation efficiency while reducing token dependencies. Innovations such as folded spatially aligned tokens and quantizer dropout enhanced image generation and reconstruction performance.

Extensive experiments demonstrated competitive results against state-of-the-art models, with improved efficiency and reduced token dependencies. Although effective, the authors acknowledge that further improvements can be made with advanced training schemes, which they plan to explore in future work.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
  • Preliminary scientific report. Li, X., Chen, H., Qiu, K., Kuen, J., Gu, J., Raj, B., & Lin, Z. (2024). ImageFolder: Autoregressive Image Generation with Folded Tokens. ArXiv.org. DOI:10.48550/arXiv.2410.01756, https://arxiv.org/abs/2410.01756
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2024, October 10). ImageFolder: Autoregressive Image Generation with Folded Tokens. AZoAi. Retrieved on October 14, 2024 from https://www.azoai.com/news/20241010/ImageFolder-Autoregressive-Image-Generation-with-Folded-Tokens.aspx.

  • MLA

    Nandi, Soham. "ImageFolder: Autoregressive Image Generation with Folded Tokens". AZoAi. 14 October 2024. <https://www.azoai.com/news/20241010/ImageFolder-Autoregressive-Image-Generation-with-Folded-Tokens.aspx>.

  • Chicago

    Nandi, Soham. "ImageFolder: Autoregressive Image Generation with Folded Tokens". AZoAi. https://www.azoai.com/news/20241010/ImageFolder-Autoregressive-Image-Generation-with-Folded-Tokens.aspx. (accessed October 14, 2024).

  • Harvard

    Nandi, Soham. 2024. ImageFolder: Autoregressive Image Generation with Folded Tokens. AZoAi, viewed 14 October 2024, https://www.azoai.com/news/20241010/ImageFolder-Autoregressive-Image-Generation-with-Folded-Tokens.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.