Lotus Model Advances Dense Prediction with State-of-the-Art Efficiency and Precision

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonOct 3 2024

By rethinking diffusion processes, the Lotus model achieves state-of-the-art performance in zero-shot depth and surface normal estimation, setting a new standard for efficiency and accuracy in dense prediction tasks.

We present Lotus, a diffusion-based visual foundation model for dense geometry prediction. With minimal training data, Lotus achieves SoTA performance in two key geometry perception tasks, i.e., zero-shot depth and normal estimation. “Avg. Rank” indicates the average ranking across all metrics, where lower values are better. Bar length represents the amount of training data used.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv* preprint server, researchers presented "Lotus," a diffusion-based visual foundation model designed to improve dense prediction tasks by addressing limitations in existing diffusion methods. By incorporating an optimized parameterization strategy and offering both generative and discriminative variants, Lotus enhanced both prediction quality and efficiency. It achieved state-of-the-art performance in zero-shot depth and normal estimation without increasing data or model size and significantly boosted inference speed for applications like three-dimensional (3D) reconstruction.

Lotus introduces a pivotal change to the standard diffusion parameterization by utilizing x0-prediction instead of the typical ϵ-prediction. This strategy directly predicts a clean signal rather than noise, reducing variance and improving prediction stability, especially in dense prediction tasks. By reformulating the diffusion process and focusing on directly predicting annotations, Lotus further enhanced both prediction quality and efficiency. It achieved state-of-the-art performance in zero-shot depth and normal estimation without increasing data or model size and significantly boosted inference speed for applications like three-dimensional (3D) reconstruction.

Moreover, the model comes in two variants: Lotus-G (generative) and Lotus-D (discriminative). While Lotus-G employs a stochastic generative process with noise input, allowing it to provide uncertainty estimates, Lotus-D eliminates the noise input and focuses on deterministic predictions, providing faster and more stable results. This flexibility ensures that the model can be tailored for different applications, such as 3D reconstruction or single-view estimation.

Background

Dense prediction is a critical task in computer vision, with applications spanning 3D/four-dimensional (4D) reconstruction, tracking, and autonomous driving. While traditional deep learning approaches are advancing the field, they face limitations due to the scale and diversity of training data, leading to poor zero-shot generalization.

Recent studies have explored leveraging diffusion priors from pre-trained text-to-image diffusion models to enhance dense prediction performance. Models like Stable Diffusion, trained on large image datasets, have shown promise for these tasks. However, prior methods such as Marigold and GeoWizard fine-tuned these models for dense prediction tasks without optimizing the diffusion formulation. This approach failed to address the inherent differences between image generation and dense prediction, leading to inefficiencies, poor optimization, and vague outputs, particularly in detailed areas.

This paper introduced Lotus, a diffusion-based visual foundation model explicitly designed for dense prediction. Unlike previous models, Lotus avoided predicting noise, which led to high variance, and adopted a single-step diffusion process, significantly improving optimization and inference speed. Additionally, the paper proposed a novel detail preserver tuning strategy to maintain fine-grained accuracy without adding network parameters. As a result, Lotus achieved state-of-the-art performance in zero-shot depth and surface normal estimation while being far more efficient than existing models.

The model’s efficiency is further enhanced by reducing the number of time steps in the diffusion process. Instead of employing multi-step diffusion, Lotus uses a single-step formulation, which simplifies the optimization process and minimizes error propagation, enabling faster adaptation and higher efficiency in dense prediction tasks.

Methodology for Diffusion-Based Dense Prediction

The researchers outlined a diffusion formulation for dense prediction tasks, leveraging the principles of Stable Diffusion, which operated in a low-dimensional latent space for efficiency. They introduced a dual auto-encoder system for mapping between red-green-blue (RGB) and latent spaces, enabling effective annotation generation.

Depth maps of multiple inferences and uncertainty maps. Areas like the sky, object edges, and intricate details (e.g., cat whiskers) typically exhibit high uncertainty.

The diffusion process included a forward noise addition and reversal denoising, utilizing two primary parameterizations. The former was typically standard for image generation but proved less effective for dense prediction due to larger pixel variance, leading to instability in predictions. The switch to x0-prediction allows Lotus to avoid the pitfalls of the original noise-prediction approach, offering more stable and accurate predictions.

The authors proposed enhancements to the model through two key methodologies: adjusting the number of time steps and implementing a detail preserver. By reducing training time steps to a single step, the model became more efficient and resilient to error propagation, optimizing performance in dense prediction tasks. This simplification enhanced adaptation and minimized computational load, making it practical for real-world applications.

Additionally, the detail preserver was introduced to maintain fine-grained details during predictions, addressing challenges in rendering intricate areas accurately. Overall, the findings advocated for refined parameterization and optimized training strategies in diffusion-based models, aiming for improved accuracy and efficiency in dense prediction tasks.

Experimental Setup and Performance Evaluation

In the experiments conducted for Lotus, the model was based on Stable Diffusion V2, with text conditioning disabled. The training setup involved fixing the time-step at 1000 and using the Adam optimizer with a learning rate of 3×10−5. Experiments were performed on eight NVIDIA A800 graphic processing units (GPU) with a batch size of 128. The discriminative variant was trained for 4,000 steps (~8.1 hours), while the generative variant extended to 10,000 steps (~20.3 hours). This dual approach of training a discriminative and generative model highlights Lotus’s adaptability and robustness across tasks.

The training data included two synthetic datasets, namely, Hypersim, which covered indoor scenes with around 39 thousand samples, and Virtual KITTI, featuring outdoor scenes with 20 thousand samples. A mixed dataset strategy was used, selecting 90% Hypersim and 10% Virtual KITTI for each batch, yielding better performance on real datasets.

For evaluation, the model was tested on several datasets for both depth estimation and surface normal prediction. Metrics included the absolute mean relative error (AbsRel) for depth and mean angular error for surface normal, among others. Lotus-G outperformed all generative baselines in zero-shot affine-invariant depth estimation, requiring only a single denoising step, significantly improving inference speed. Lotus variants also ranked highly in surface normal prediction, achieving better performance across multiple datasets compared to state-of-the-art methods.

Journal reference:

Preliminary scientific report. He, J., Li, H., Yin, W., Liang, Y., Li, L., Zhou, K., Liu, H., Liu, B., & Chen, Y.-C. (2024). Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction. ArXiv.org. DOI: 10.48550/arXiv.2409.18124, https://arxiv.org/abs/2409.18124

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, October 03). Lotus Model Advances Dense Prediction with State-of-the-Art Efficiency and Precision. AZoAi. Retrieved on July 18, 2025 from https://www.azoai.com/news/20241003/Lotus-Model-Advances-Dense-Prediction-with-State-of-the-Art-Efficiency-and-Precision.aspx.
MLA
Nandi, Soham. "Lotus Model Advances Dense Prediction with State-of-the-Art Efficiency and Precision". AZoAi. 18 July 2025. <https://www.azoai.com/news/20241003/Lotus-Model-Advances-Dense-Prediction-with-State-of-the-Art-Efficiency-and-Precision.aspx>.
Chicago
Nandi, Soham. "Lotus Model Advances Dense Prediction with State-of-the-Art Efficiency and Precision". AZoAi. https://www.azoai.com/news/20241003/Lotus-Model-Advances-Dense-Prediction-with-State-of-the-Art-Efficiency-and-Precision.aspx. (accessed July 18, 2025).
Harvard
Nandi, Soham. 2024. Lotus Model Advances Dense Prediction with State-of-the-Art Efficiency and Precision. AZoAi, viewed 18 July 2025, https://www.azoai.com/news/20241003/Lotus-Model-Advances-Dense-Prediction-with-State-of-the-Art-Efficiency-and-Precision.aspx.