InstantDrag Accelerates Image Editing by Eliminating Masks and Text Prompts

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonSep 19 2024

Leveraging FlowGen and FlowDiffusion, InstantDrag delivers lightning-fast, drag-based image edits, cutting computational load and simplifying the process for real-time, high-quality photo realism.

Study: InstantDrag: Improving Interactivity in Drag-based Image Editing.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article submitted to the arXiv preprint* server, researchers introduced InstantDrag. This drag-based image editing method enhanced interactivity and speed by eliminating the need for optimization, masks, or text prompts.

They used two networks, FlowGen and FlowDiffusion, to learn motion dynamics from real-world videos, enabling quick, photorealistic edits. InstantDrag simplified the editing process while maintaining image quality, making it suitable for real-time applications.

Background

While promising for its precision and control, drag-based image editing has struggled to match the speed and interactivity of text-to-image models.

Early techniques, such as DragGAN, introduced the ability to manipulate image structures with pixel-level accuracy. However, these methods relied heavily on optimization techniques, which significantly slowed the editing process.

Additionally, text-guided editing approaches allowed users to modify high-frequency features but often failed to control specific regions precisely, requiring user inputs like masks and text prompts. This increased the complexity and reduced interactivity.

Dragging results from FlowGen trained under four settings: (A) Stochastic sampling strategy, (B) 1 fixed point (nose), (C) 100 fixed grid points, (D) 900 fixed grid points. Excessive points (C, D) generate sparse motion while a single point (B) causes undesired movements. We find (A) to be the most robust, combining the advantages of the other approaches.

InstantDrag addressed these gaps by proposing an optimization-free pipeline for real-time drag-based image editing. Unlike prior works, InstantDrag eliminated the need for user inputs like masks and text prompts, allowing users to edit images by simply dragging a source point.

The method split the task into two stages: motion generation, handled by FlowGen, and motion-conditioned image generation, handled by FlowDiffusion. Through these components, the paper achieved faster, high-quality, photo-realistic edits with significantly less computational load, reducing editing time by up to 75 times compared to previous techniques. This made InstantDrag a powerful tool for interactive image editing.

Proposed Methodology for Motion and Image Generation

The authors presented a novel drag-editing approach, dividing the task into two key components: motion generation and motion-conditioned image generation.

To achieve this, they introduced two networks: FlowGen, a generative adversarial network (GAN)- based model for generating motion, and FlowDiffusion, a diffusion-based network for motion-conditioned image generation.

FlowGen utilized a Pix2Pix-like architecture enhanced with patch-based image inpainting using a PatchGAN discriminator to improve the quality of generated flow, translating user drag input (sparse flow) into dense optical flow. The generator received five channels of input—three for the image and two for the sparse drag instructions—and produced two channels of dense optical flow.

FlowGen’s training process was further refined by using random sparse flows to enhance robustness, ensuring it could handle diverse drag inputs while maintaining photorealistic motion.

FlowDiffusion was designed to reflect the motion condition in image generation. It used a 10-channel input—four for latent noise, four for the latent image, and two for the optical flow.

Unlike the Instruct-Pix2Pix model, which relied on text prompts, FlowDiffusion used flow channels to guide the denoising process and maintain consistency, except for the dragged regions.

The design enabled the model to work efficiently by processing motion without relying on additional text inputs, a key advantage that improved both interactivity and speed.

The authors also experimented with guidance scales and conditional dropout to optimize the network.

The researchers used real-world video datasets, such as CelebV-Text, and optical flow datasets, like FlyingChairs, to train FlowGen and FlowDiffusion.

These datasets were carefully selected to simulate complex motion dynamics, allowing the networks to learn meaningful motion patterns while ensuring background consistency.

They proposed a sampling strategy to select sparse points, ensuring meaningful motion generation while avoiding localized or undesired movements.

This dual-network approach demonstrated improved interactivity and efficiency in drag-based image editing, allowing users to achieve fine-grained control over motion while maintaining background consistency.

Experimental Setup and Evaluation

The experiments for face editing were conducted using the CelebV-Text dataset, which consists of 70,000 high-quality video clips sampled at ten frames per second to form eight million frame pairs.

To estimate optical flow between these pairs, FlowFormer was used, and to ensure efficient mask generation, the team employed YOLO (you-only-look-once).

For general scene editing, a two-stage fine-tuning approach was adopted for short videos. A user study was conducted on 22 samples from various domains, with additional qualitative evaluations performed for face manipulation.

Metrics such as peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), learned perceptual image patch similarity (LPIPS), and contrastive language-image pretraining (CLIP) image similarity scores were used for quantitative assessments, and all experiments ran on A6000 graphics processing units (GPU).

The authors compared the proposed method, InstantDrag, against models like DragDiffusion and DragonDiffusion. InstantDrag excelled in preserving high-frequency details and generated more plausible motion without requiring inversion or masks.

It demonstrated superior generalizability across different domains, including facial videos and non-facial scenes like cartoons and drawings.

Human evaluations further indicated the model’s strong performance in instruction-following and identity preservation. However, limitations included difficulties with handling large motions and occasional inconsistencies in non-facial scenes.

Nonetheless, the model's strengths lay in its ability to generate realistic, consistent images with fewer user inputs, making it highly interactive and efficient.

Conclusion

In conclusion, InstantDrag introduced a novel drag-based image editing method that significantly improved interactivity and speed by removing the need for optimization, masks, or text prompts.

By utilizing FlowGen and FlowDiffusion networks, InstantDrag enabled fast, photo-realistic edits, reducing editing time by up to 75 times compared to previous techniques.

The method's intuitive design simplified the editing process while maintaining image quality, making it highly suitable for real-time applications.

Despite limitations in handling large motions, InstantDrag represents a major advancement in real-time interactive image editing, enhancing user experience and efficiency across various domains.

Source:

InstantDrag for fast and light Drag Editing. arXiv2024. (2024). Joonghyuk.com. https://joonghyuk.com/instantdrag-web/

Journal reference:

Preliminary scientific report. Shin, J., Choi, D., & Park, J. (2024). InstantDrag: Improving Interactivity in Drag-based Image Editing. ArXiv.org. DOI: 10.48550/arXiv.2409.08857, https://arxiv.org/abs/2409.08857

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, September 19). InstantDrag Accelerates Image Editing by Eliminating Masks and Text Prompts. AZoAi. Retrieved on July 11, 2025 from https://www.azoai.com/news/20240919/InstantDrag-Accelerates-Image-Editing-by-Eliminating-Masks-and-Text-Prompts.aspx.
MLA
Nandi, Soham. "InstantDrag Accelerates Image Editing by Eliminating Masks and Text Prompts". AZoAi. 11 July 2025. <https://www.azoai.com/news/20240919/InstantDrag-Accelerates-Image-Editing-by-Eliminating-Masks-and-Text-Prompts.aspx>.
Chicago
Nandi, Soham. "InstantDrag Accelerates Image Editing by Eliminating Masks and Text Prompts". AZoAi. https://www.azoai.com/news/20240919/InstantDrag-Accelerates-Image-Editing-by-Eliminating-Masks-and-Text-Prompts.aspx. (accessed July 11, 2025).
Harvard
Nandi, Soham. 2024. InstantDrag Accelerates Image Editing by Eliminating Masks and Text Prompts. AZoAi, viewed 11 July 2025, https://www.azoai.com/news/20240919/InstantDrag-Accelerates-Image-Editing-by-Eliminating-Masks-and-Text-Prompts.aspx.