Researchers introduce a groundbreaking discrete-state framework that transforms how visual models handle image segmentation and generation, setting new benchmarks across datasets like MS COCO and ImageNet256.
Research: [MASK] is All You Need
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In a research paper submitted to the arXiv preprint* server, researchers at Ludwig Maximilian University of Munich, Germany, explored connecting masked generative models (MGM) and non-autoregressive diffusion models using discrete-state models, focusing on scalability in the vision domain.
Using discrete states, they analyzed design elements like timestep independence, noise schedules, and guidance strength and introduced a novel mathematical framework inspired by the Kolmogorov Equation. This framework emphasizes the transition from masked tokens to data states using a combination of discrete flow matching and progressive unmasking. The study also re-cast tasks like image segmentation as an unmasking process.
The authors further highlight the advantages of implicit timestep models, which remove timestep dependence, offering significant flexibility and adaptability compared to explicit timestep models.
Related Work
Past work has established connections between diffusion models and autoregressive models, particularly in text generation. Some have extended these ideas to vision tasks.
Methods like MG image transformer (MaskGIT), MG video transform (MAGVIT), and vector quantized diffusion (VQ-Diffusion) have focused on MGM or discrete-state vision generation, often relying on heuristic sampling rules.
These studies primarily explored specific use cases without unifying MGM and diffusion models under a single framework.
Discrete Generative Framework
Diffusion and flow models influenced a novel discrete-state generative modeling framework. Building on the Kolmogorov Equation, the authors introduced discrete interpolants to guide data transitions, with noise schedules determining unmasking progression. It introduces "discrete interpolants," a method where data transitions from a fully masked state to accurate data by progressively unmasking tokens using a trained vector field.
A scheduler determines the masking and unmasking schedules, and training is optimized through a cross-entropy loss function incorporating masking and weighting mechanisms. To enhance stability and fidelity, the model explicitly masks tokens in the loss calculations, a technique shown to reduce overfitting in vision tasks.
The framework supports explicit timestep models, which depend on specific timesteps, and implicit timestep models, which are independent of timestep information. This makes it versatile for tasks like image editing or token sampling.
Various sampling methods are employed, including timestep-dependent models, timestep-independent models, and a heuristic greedy strategy similar to MaskGIT. Data tokens are progressively unmasked until fully reconstructed, with a final argmax operation ensuring high fidelity in the results.
A key innovation is the reinterpretation of segmentation tasks as a form of unmasking. This reframing allows the joint modeling of image-segmentation pairs and enables both generative and discriminative applications.
This framework also accommodates segmentation as a form of unmasking. It extends its application to multimodal learning tasks by combining generative and discriminative approaches, highlighting its potential across diverse domains in generative modeling.
Image & Video Generation
The experiments conducted in this study focused on image and video generation using various datasets and evaluation metrics.
Datasets such as ImageNet256 and Microsoft common objects in context (MS COCO) were employed for image generation, while the Cityscapes dataset was used for training image-segmentation mask pairs.
Video generation experiments primarily utilized the FaceForensics dataset. Frechet inception distance (FID) was used to evaluate image generation tasks, while Frechet video distance (FVD) assessed video generation performance.
The experiments consistently utilized SD's stable diffusion-vector quantization with factor 8 (SD-VQ-F8) discrete tokenizer, which was trained on large datasets like Open Images. Sampling strategies were systematically analyzed, including the introduction of Gumbel noise for MGM-style sampling, which added stochasticity to confidence scores, significantly improving convergence rates and performance in low-step evaluations.
The training process featured distinct methodologies, such as sampling timesteps. Unlike prior methods using adaptive step sizes, a consistent fixed step size was adopted for fair comparison. The cross-entropy loss included a weighting mechanism w(t) to optimize visual fidelity, which outperformed standard ELBO-derived weights in empirical testing.
For additional training details, such as optimizer settings and graphics processing unit (GPU) usage, the appendix provides more comprehensive information.
The models evaluated on the MS COCO dataset demonstrated that the proposed methods, including implicit and explicit timestep models, achieved competitive or superior FID scores compared to existing state-of-the-art models.
Notably, the explicit timestep model achieved an FID score of 6.03, and the implicit timestep model achieved 5.65. These models, trained on MS COCO using discrete diffusion and MGM, outperformed several autoregressive and continuous diffusion models.
Similarly, experiments on the ImageNet256 dataset showed that both timestep models produced high-fidelity images, often rivaling or surpassing established diffusion techniques.
For video generation, experiments conducted on the FaceForensics dataset validated the scalability of the proposed methods from image to video generation. Adaptations to discrete states using learnable embeddings and linear layers performed better than continuous-state counterparts.
Ablation studies highlighted the critical role of Gumbel noise, temperature scaling, and classifier-free guidance in enhancing performance. This systematic analysis provided actionable insights for optimizing MGM and diffusion-based sampling.
Conclusion
To sum up, the work extended discrete flow matching theory to vision tasks, generalizing from explicit to implicit timestep models. The intersection of diffusion and MGM was analyzed, achieving state-of-the-art results on MS-COCO and competitive performance on ImageNet 256. The method demonstrated scalability to video datasets like forensics.
The authors also acknowledged the inherent challenge of irreversible denoising errors in mask-based methods and proposed future research directions to incorporate reversible stochastic interpolants for improved robustness.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Source:
Journal reference: