New Method Boosts Object Counting Accuracy in Image Generation Models

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonSep 29 2024

With a dynamic system that refines object counts during image generation, this breakthrough method solves a persistent challenge in AI, offering faster and more accurate results in generating images with the correct number of objects.

Research: Iterative Object Count Optimization for Text-to-image Diffusion Models

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers proposed a novel solution to address the challenge of accurately generating a specified number of objects in text-to-image models. They introduced an optimization technique that uses a counting loss derived from a counting model. This approach enhanced accuracy by dynamically adjusting hyperparameters and modifying the text conditioning embedding, offering a plug-and-play solution to improve object counting in generated images with minimal additional training.

Background

Image generation models have significantly advanced, producing high-quality images through conditioning mechanisms. These models have wide applications, such as three-dimensional (3D) modeling, image editing, and improving classifiers. However, accurately generating a specific number of objects based on textual input remains a challenge.

When asked to depict a precise number, current models often generate more or fewer objects than requested. Previous efforts in visual question answering and text-image matching also faced this limitation, requiring specialized mechanisms to address counting challenges. While methods like classifier guidance and diffusion processes have been explored, they often rely on curated datasets, which limits their scalability and adaptability.

This paper introduced a novel solution that iteratively optimizes image generation with a counting loss based on object potentials. They addressed gaps by allowing for a dynamic adjustment of counting parameters during the image generation process. The proposed method leveraged pre-trained models and incrementally refined the counting accuracy, enabling precise object generation.

Additionally, the approach provided a zero-shot, plug-and-play solution that integrated seamlessly into existing generation models, filling the gaps left by previous work. Importantly, the method demonstrated significant improvements in counting accuracy without the need for additional fine-tuning or custom datasets.

We propose a plug-and-play optimization of object counting accuracy of a text-to-image model based on detection models.

We propose a plug-and-play optimization of object counting accuracy of a text-to-image model based on detection models.

Optimizing Object Count in Text-to-Image Models

The method outlined a strategy to steer text-to-image models by adjusting embeddings of a newly introduced "counting token." The process involved optimizing the counting token through an iterative diffusion model to control the number of objects generated in an image. The key aspect of the model involved generating images from random noise using a denoising diffusion process and fine-tuning them based on a target number of objects.

Our method optimizes the embeddings of a newly added token, namely the counting token. The optimization is done iteratively, where at each step, we calculate the number of objects from the desired class and derive a loss function. Once converged, the optimized token can be reused in different prompts to correct the number of depicted objects, as shown on the right.

A differentiable counting function, such as contrastive language–image pre-training (CLIP)-count, was employed to achieve accuracy. This function estimated object counts within generated images. The counting loss was minimized by comparing the estimated and desired object counts. A detection model, such as you-only-look-once (YOLO), was used to dynamically adjust the scaling factors of the potential maps to improve accuracy.

To ensure the overall image quality, a semantic loss using CLIP matching was introduced to ensure that the generated objects remained true to their intended class. The optimization process was completed by updating the embeddings of the counting token until the required accuracy was reached, with recent fast diffusion models allowing optimization in fewer steps, resulting in efficient outcomes.

Results and Benchmark Evaluation

The authors introduced a benchmark for evaluating object counting models, utilizing the few-shot object counting task (FSC)-147 dataset containing 147 categories. The team generated 3,674 samples, with 25 examples for each object class, ranging from one to 25 objects. Two detection methods, CLIP-count and YOLO, were employed to assess accuracy.

In comparison, the proposed method, which used static and dynamic scaling, outperformed baseline models like ControlNet and stable diffusion (SD). With a mean absolute error (MAE) of 4.13 using CLIP-count, the static scaling method achieved a significantly lower error than the 15.6 error of SD. For YOLO-based detection, dynamic scaling reduced errors, improving accuracy in counting generated objects.

Human annotators confirmed the method’s superior performance in depicting the correct object count and naturalness of images. The model was chosen 88% of the time for object count accuracy and 79.7% for image naturalness. Furthermore, the study highlighted the flexibility of the method, with trained tokens demonstrating robust performance across different object categories. Notably, in-domain token reuse performed slightly better than out-of-domain. The authors highlighted the novel ability to fine-tune image generation models for precise object counting, addressing a key challenge in generative artificial intelligence (AI).

Conclusion

In conclusion, the researchers developed a novel approach to enhance object count accuracy in text-to-image models using a counting loss mechanism derived from a counting model. Their method dynamically adjusted hyperparameters and modified text conditioning embeddings, optimizing object counting during image generation.

By refining object count estimates iteratively, this approach improved accuracy without needing additional optimization, showing clear improvements over previous methods. It also enabled the reuse of counting tokens across different classes, offering flexibility. This method addressed common issues in generative models, such as spatial relationships and compositional challenges, advancing plug-and-play techniques for object counting in AI-generated images.

Source:

Iterative Object Count Optimization for Text-to-image Diffusion Models - https://ozzafar.github.io/count_token

Journal reference:

Preliminary scientific report. Zafar, O., Wolf, L., & Schwartz, I. (2024). Iterative Object Count Optimization for Text-to-image Diffusion Models. ArXiv. DOI: abs/2408.11721, https://arxiv.org/abs/2408.11721

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, September 29). New Method Boosts Object Counting Accuracy in Image Generation Models. AZoAi. Retrieved on July 18, 2025 from https://www.azoai.com/news/20240929/New-Method-Boosts-Object-Counting-Accuracy-in-Image-Generation-Models.aspx.
MLA
Nandi, Soham. "New Method Boosts Object Counting Accuracy in Image Generation Models". AZoAi. 18 July 2025. <https://www.azoai.com/news/20240929/New-Method-Boosts-Object-Counting-Accuracy-in-Image-Generation-Models.aspx>.
Chicago
Nandi, Soham. "New Method Boosts Object Counting Accuracy in Image Generation Models". AZoAi. https://www.azoai.com/news/20240929/New-Method-Boosts-Object-Counting-Accuracy-in-Image-Generation-Models.aspx. (accessed July 18, 2025).
Harvard
Nandi, Soham. 2024. New Method Boosts Object Counting Accuracy in Image Generation Models. AZoAi, viewed 18 July 2025, https://www.azoai.com/news/20240929/New-Method-Boosts-Object-Counting-Accuracy-in-Image-Generation-Models.aspx.