MIT and NVIDIA Create Lightning-Fast AI That Generates Ultra-Realistic Images

Download PDF Copy

Massachusetts Institute of TechnologyMar 20 2025

A new AI model called HART blends speed with detail, generating photorealistic images nine times faster than current models—paving the way for smarter simulations, self-driving cars, and next-gen creative tools.

Research: HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

The ability to generate high-quality images quickly is crucial for producing realistic simulated environments. These environments can be used to train self-driving cars to avoid unpredictable hazards, making them safer on real streets.

However, the generative AI techniques increasingly being used to produce such images have drawbacks. One popular type of model called a diffusion model, can create stunningly realistic images but is too slow and computationally intensive for many applications. On the other hand, the autoregressive models that power LLMs like ChatGPT are much faster, but they produce poorer-quality images that are often riddled with errors.

Researchers from MIT and NVIDIA developed a new approach that combines the best of both methods. Their hybrid image-generation tool uses an autoregressive model to quickly capture the big picture and a small diffusion model to refine the details of the image.

Their tool, known as HART (short for Hybrid Autoregressive Transformer), can generate images that match or exceed the quality of state-of-the-art diffusion models but do so about nine times faster.

The generation process consumes fewer computational resources than typical diffusion models, enabling HART to run locally on a commercial laptop or smartphone. To generate an image, a user only needs to enter one natural language prompt into the HART interface.

HART could have a wide range of applications, such as helping researchers train robots to complete complex real-world tasks and aiding designers in producing striking scenes for video games.

"If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and refine the image with smaller brush strokes, your painting could look much better. That is the basic idea with HART," says Haotian Tang, PhD '25, co-lead author of a new paper on HART.

He is joined by co-lead author Yecheng Wu, an undergraduate student at Tsinghua University; senior author Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and a distinguished scientist of NVIDIA; and others at MIT, Tsinghua University, and NVIDIA. The research will be presented at the International Conference on Learning Representations.

The best of both worlds

Popular diffusion models, such as Stable Diffusion and DALL-E, are known to produce highly detailed images. These models generate images through an iterative process in which they predict some amount of random noise on each pixel, subtract the noise, and then repeat the process of predicting and "de-noising" multiple times until they generate a new image that is completely free of noise.

Because the diffusion model de-noises all pixels in an image at each step, and there may be 30 or more steps, the process is slow and computationally expensive. But because the model has multiple chances to correct details, it got wrong, and the images are high-quality.

Autoregressive models, commonly used for predicting text, can generate images by predicting patches of an image sequentially, a few pixels at a time. They can't go back and correct their mistakes, but the sequential prediction process is much faster than diffusion.

These models use representations known as tokens to make predictions. An autoregressive model utilizes an autoencoder to compress raw image pixels into discrete tokens and reconstruct the image from predicted tokens. While this boosts the model's speed, the information loss that occurs during compression causes errors when the model generates a new image.

With HART, the researchers developed a hybrid approach that uses an autoregressive model to predict compressed, discrete image tokens and then a small diffusion model to predict residual tokens. Residual tokens compensate for the model's information loss by capturing details left out by discrete tokens.

"We can achieve a huge boost in terms of reconstruction quality. Our residual tokens learn high-frequency details, like edges of an object, or a person's hair, eyes, or mouth. These are places where discrete tokens can make mistakes," says Tang.

Because the diffusion model only predicts the remaining details after the autoregressive model has done its job, it can accomplish the task in eight steps instead of the usual 30 or more a standard diffusion model requires to generate an entire image. This minimal overhead of the additional diffusion model allows HART to retain the speed advantage of the autoregressive model while significantly enhancing its ability to generate intricate image details.

"The diffusion model has an easier job to do, which leads to more efficiency," he adds.

On-device demo for HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

Outperforming larger models

During the development of HART, the researchers encountered challenges in effectively integrating the diffusion model to enhance the autoregressive model. They found that incorporating the diffusion model in the early stages of the autoregressive process resulted in an accumulation of errors. Instead, their final design of applying the diffusion model to predict only residual tokens as the final step significantly improved generation quality.

Their method, which combines an autoregressive transformer model with 700 million parameters and a lightweight diffusion model with 37 million parameters, can generate images of the same quality as those created by a diffusion model with 2 billion parameters, but it does so about nine times faster. It uses about 31 percent less computation than state-of-the-art models.

Moreover, because HART uses an autoregressive model to do the bulk of the work - the same type of model that powers LLMs - it is more compatible for integration with the new class of unified vision-language generative models. In the future, one could interact with a unified vision-language generative model, perhaps by asking it to show the intermediate steps required to assemble a piece of furniture.

"LLMs are a good interface for all sorts of models, like multimodal models and models that can reason. This is a way to push the intelligence to a new frontier. An efficient image-generation model would unlock a lot of possibilities," he says.

In the future, the researchers want to go down this path and build vision-language models on top of the HART architecture. Since HART is scalable and generalizable to multiple modalities, they also want to apply it for video generation and audio prediction tasks.

This research was partially funded by the MIT-IBM Watson AI Lab, the MIT and Amazon Science Hub, the MIT AI Hardware Program, and the National Science Foundation. NVIDIA donated the GPU infrastructure for training this model.

Sources:

Journal reference:

Preliminary scientific report. Tang, H., Wu, Y., Yang, S., Xie, E., Chen, J., Chen, J., Zhang, Z., Cai, H., Lu, Y., & Han, S. (2024). HART: Efficient Visual Generation with Hybrid Autoregressive Transformer. ArXiv. https://arxiv.org/abs/2410.10812

Posted in: AI Research News