By introducing TrigFlow, a trigonometric approach, OpenAI's latest research eliminates instabilities in continuous-time consistency models, scaling them to billions of parameters while drastically improving training efficiency and sample quality.
Uncurated 1-step samples generated by our sCD-XXL trained on ImageNet 512×512. Research: Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers at OpenAI addressed the instability of continuous-time consistency models (CMs) by proposing a simplified theoretical framework unifying previous parameterizations of diffusion models and CMs.
They introduced improvements in time-conditioning, network architecture, and training objectives, enabling continuous-time CMs to scale to unprecedented levels.
Using only two sampling steps, their training algorithm significantly narrowed the gap with top diffusion models.
Background
Past work on diffusion models has achieved remarkable results in generative artificial intelligence (AI) but is limited by slow and computationally intensive sampling speeds and the challenges of various distillation techniques.
CMs have addressed these issues by eliminating the need for diffusion model samples and avoiding adversarial training, but previous versions faced significant issues with discretization errors.
The main challenges with diffusion models include their slow sampling speeds, the requirement of numerous sampling steps, and the high computational cost of distillation techniques.
TrigFlow: Simplifying Diffusion Models
Triangular flow (TrigFlow) is a new formulation of diffusion models that simplifies the complexity of continuous-time CMs. TrigFlow retains the core properties of elucidated diffusion models (EDM) while simplifying key mathematical components.
The coefficients of TrigFlow are trigonometric functions of time, making them more straightforward and preserving the variance-exploding nature of diffusion processes.
In this context, the diffusion model is parameterized to reduce the computational and mathematical complexity while ensuring stability. The proposed modifications address instability issues during training using trigonometric parameterization, specifically simplifying time embeddings and applying adaptive double normalization.
Previous formulations based on complex arithmetic relationships between diffusion time variables and noise introduced significant instabilities, particularly in discrete-time models. TrigFlow resolves these issues by avoiding such discretization errors, offering more accurate predictions as time steps narrow toward the continuous-time limit.
The new formulation leads to more stable training dynamics, particularly for continuous-time CMs. As shown through comparisons in the figures, TrigFlow outperforms prior methods, especially when dealing with time derivatives in diffusion processes.
This paper introduces TrigFlow, a novel formulation for simplifying CMs while retaining the core principles of the EDM framework.
In previous works, the parameterization of CMs and the diffusion process adopted in EDM were based on complicated relationships involving the standard deviation of the noise and time variable. TrigFlow eliminates this complexity by using trigonometric functions such as cosine and sine to model the skip and output coefficients. It simplifies the diffusion process by defining the noisy sample as a combination of these trigonometric terms of the original data and noise.
By doing so, the TrigFlow framework ensures that the variance across time steps remains stable throughout the process, thereby improving model performance and robustness.
TrigFlow’s integration of flow matching with trigonometric time evolution and v-prediction parameterization enhances training efficiency while maintaining the desirable characteristics of EDM, providing a more straightforward and interpretable diffusion framework.
Moreover, the team proposes several improvements to address the instability often encountered during continuous-time CM training.
They suggest a parameterization that eliminates the instability caused by time derivatives and introduce techniques like adaptive double normalization and simplified positional embeddings to enhance stability.
Stabilizing the training dynamics allows continuous-time CMs to perform comparably to discrete-time models without compromising on the accuracy and computational efficiency of diffusion training.
Empirical results demonstrate that TrigFlow outperforms prior methods by offering a more robust and efficient framework for CM training, particularly in cases where previous models experienced numerical instabilities.
Efficient Continuous-Time Models
The study evaluates improvements in large-scale CM (sCMs) trained on various challenging datasets, emphasizing the significance of computing accurate tangent terms to enhance numerical precision and support memory-efficient attention computation.
Training large-scale diffusion models involves optimizations like half-precision (FP16) and Flash Attention techniques. However, achieving precise tangent computations, especially when time variables approach critical points like 0 or 90 degrees, is vital to ensure training stability.
The paper introduces a novel approach to tangent computation through Jacobian-vector product (JVP) rearrangement. This innovation mitigates overflow issues in intermediate layers by applying trigonometric transformations to stabilize the gradients during training. Specifically, JVP rearrangement leverages the fact that tangents can overflow during certain time points, and carefully normalizing these terms enables more stable computations across layers.
Experiments focused on consistency training (sCT) and consistency distillation (sCD) demonstrated that sCMs could produce high-quality samples on datasets like CIFAR-10 and ImageNet at reduced computational costs.
The training strategy allowed sCD models to converge rapidly, achieving comparable results to teacher models while utilizing less than 20% of the teacher's training compute.
Additionally, benchmarks revealed that sCMs consistently outperformed previous few-step methods, achieving competitive Fréchet inception distance (FID) scores against adversarial models and demonstrating significant reductions in FID values as model size increased.
The research also explored the scalability of continuous-time CMs, showcasing successful training across various configurations without introducing instability.
Findings indicated that while sCT is more efficient at smaller scales, it struggles with increased variance at larger scales, whereas sCD maintains consistent performance across different sizes.
Comparisons with variational score distillation (VSD) highlighted that sCD's two-step approach resulted in better fidelity and diversity scores, minimizing the risk of mode collapse typically associated with high guidance scales in diffusion models.
The study underscores the impact of these technical innovations, particularly the improvements in tangent normalization and adaptive weighting techniques, which contribute to scaling continuous-time CMs beyond what was previously possible. The results underscore sCD's effectiveness and scalability in training high-quality generative models.
Conclusion
To sum up, the improved formulations, architectures, and training objectives simplified and stabilized the training of continuous-time CMs, allowing scaling up to 1.5 billion parameters on ImageNet 512×512. The study ablated the impact of the TrigFlow formulation, tangent normalization using JVP rearrangement, and adaptive weighting, confirming their effectiveness.
By combining these improvements, the method demonstrated predictable scalability across datasets and model sizes, outperforming other few-step sampling approaches at larger scales. Notably, the FID gap with the teacher model was narrowed to within 10% using two-step generation, compared to state-of-the-art diffusion models requiring significantly more sampling steps.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Lu, C., & Song, Y. (2024). Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models. ArXiv. https://arxiv.org/abs/2410.11081