PANDORAPixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal

Dinh-Khoi Vo^1,2Van-Loc Nguyen^1,2Tam V. Nguyen³Minh-Triet Tran^1,2Trung-Nghia Le^1,2

¹University of Science, Ho Chi Minh City, Vietnam²Vietnam National University, Ho Chi Minh City, Vietnam³University of Dayton, Ohio, United States

{vdkhoi, nvloc}@selab.hcmus.edu.vn, tamnguyen@udayton.edu, {tmtriet, ltnghia}@fit.hcmus.edu.vn

AcceptedIEEE ICME 2026

🧾arXiv (Paper)🧾arXiv (Demo)SOON 💻Code 📦Dataset 🚀Gradio Demo 💬Feedback

Abstract

Removing objects from natural images remains a formidable challenge, often hindered by the inability to synthesize semantically appropriate content in the foreground while preserving background integrity. Existing methods often rely on fine-tuning, prompt engineering, or inference-time optimization, yet still struggle to maintain texture consistency, produce rigid or unnatural results, lack precise foreground-background disentanglement, and fail to flexibly handle multiple objects—ultimately limiting their scalability and practical applicability. In this paper, we propose a zero-shot object removal framework that operates directly on pre-trained text-to-image diffusion models—requiring no fine-tuning, no prompts, and no optimization. At the core is our Pixel-wise Attention Dissolution, which performs fine-grained, pixel-wise dissolution of object information by nullifying the most correlated keys for each masked pixel. This operation causes the object to vanish from the self-attention flow, allowing the coherent background context to seamlessly dominate the reconstruction. To complement this, we introduce Localized Attentional Disentanglement Guidance, which steers the denoising process toward latent manifolds that favor clean object removal. Together, Pixel-wise Attention Dissolution and Localized Attentional Disentanglement Guidance enable precise, non-rigid, scalable, and prompt-free multi-object erasure in a single pass. Experiments show our method outperforms state-of-the-art methods even with fine-tuned and prompt-guided baselines in both visual fidelity and semantic plausibility.

Object Removal

We propose a zero-shot object removal framework that operates directly on pre-trained diffusion models in a single pass, without any fine-tuning, prompt engineering, or inference-time optimization, thus fully leveraging their latent generative capacity for inpainting

Benchmark

The PANDORA benchmark evaluates single-object, multi-object, and mass-similar object removal with manually or automatically prepared binary masks.

Single-object samples

Multi-object samples

Mass-similar samples

🖱️Click to see results

⏱️Processing takes ~10 seconds - please be patient!

Click to see result

Clutter Erase Demo

The demo system supports manual mask drawing, prompt-based selection, click-based segmentation, and similar-object search, enabling prompt-free mass-similar and multi-object removal in a single pass.

Approach

Our framework performs zero-shot object removal directly on a pre-trained diffusion model. Given an input image I_s and a binary mask M specifying the target objects, the model produces an edited image I_t where the masked regions are erased and seamlessly reconstructed with contextually consistent background. The process begins with latent inversion to map the input image into the noise space while preserving unaffected regions in the denoising process. We then apply Pixel-wise Attention Dissolution (PAD) to disconnect masked query pixels from their most correlated keys, effectively dissolving object information at the attention level. Next, Localized Attentional Disentanglement Guidance (LADG) steers the denoising trajectory in latent space away from the object regions, refining the reconstruction to suppress residual artifacts.

Together, PAD and LADG enable precise, pixel-level control for single- and multi-object removal in a single forward pass, without any fine-tuning, prompt engineering, or inference-time optimization.

Qualitative Comparison

🔬Qualitative comparison on various object removal scenarios📊From left to right: original image with a mask, and results from different methods

🎯

Single-Object Removal

Top two rows

🎯🎯

Multi-Object Cases

Middle two rows

🎯🎯🎯

Mass-Similar Objects

Bottom two rows

⚡Zero-shot methods shown in the last four columns, with the last two columns showing our PANDORA method

Qualitative comparison of object removal methods

Additional Results

PANDORA handles diverse object-removal settings while preserving the visible background structure and texture consistency.

Additional qualitative comparison for PANDORA

Quantitative Comparison

Method	Text	FID↓	LPIPS↓	MSE↓	CLIP score↑
Fine-tuning-based methods (SD 2.1 backbone, except LaMa)
PowerPaint	✔	22.81	0.1322	0.0104	24.15
LaMa	✘	0.71	0.0012	0.0001	24.5
SD2-Inpaint	✘	17.93	0.1106	0.0073	24.06
SD2-Inpaint-wprompt	✔	18.01	0.1098	0.0072	24.32
Zero-shot methods (no retraining, SD 2.1 backbone)
CPAM	✘	25.25	0.0953	0.0048	24.49
PANDORA w/o PAD (Ours)	✘	27.3	0.0985	0.005	24.58
PANDORA w/o LADG (Ours)	✘	30.8	0.1007	0.0055	24.65
PANDORA (Ours)	✘	35.1	0.1064	0.0059	24.69
Zero-shot methods (no retraining, SD 1.5 backbone)
CPAM	✘	29.54	0.1564	0.0138	24.32
Attentive Eraser	✘	118.09	0.2567	0.027	24.42
PANDORA w/o PAD (Ours)	✘	35.59	0.1702	0.0156	24.4
PANDORA w/o LADG (Ours)	✘	42.17	0.1844	0.0171	24.55
PANDORA (Ours)	✘	44.98	0.1895	0.0184	24.57

📉Lower is better: FID, LPIPS, MSE

📈Higher is better: CLIP score

📝Text column: ✔ uses prompts; ✘ no prompts

⭐Bold numbers indicate best across all methods; yellow rows are Ours

Quantitative comparison of fine-tuned and zero-shot object removal methods averaged across all dataset types. PANDORA consistently achieves the best object removal quality with competitive background realism, without any retraining or textual prompts, demonstrating strong generalization across both Stable Diffusion v1.5 and v2.1 backbones. Removing LADG slightly reduces removal quality, while removing PAD causes a significant degradation.

Limitations

Threshold Sensitivity

A fixed percentile threshold in PAD may over-filter or under-filter attention depending on the object and image content.

Mask Quality

Accurate binary masks remain important. Undersized masks can leave residual object artifacts, while moderately oversized masks are usually tolerated.

Future Direction

Adaptive thresholding and automatic mask selection are promising directions for making the system more robust.

Related Projects

CPAM

Context-Preserving Adaptive Manipulation for zero-shot real image editing.

FocusDiff

Target-aware refocusing for tuning-free diffusion editing and 360-degree panorama editing.

BibTex

@inproceedings{Vo2026ICME,
  title = {PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal},
  author = {Vo, Dinh-Khoi and Nguyen, Van-Loc and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle = {IEEE International Conference on Multimedia and Expo (ICME)},
  year = {2026},
  url = {https://arxiv.org/abs/2603.27555},
  code = {https://github.com/vdkhoi20/PANDORA},
}

@inproceedings{Vo2026DemoICME,
  title={Zero-Shot Mass-Similar and Multi-Object Removal in Single Pass},
  author={Dinh-Khoi Vo and Van-Loc Nguyen and Tam V. Nguyen and Minh-Triet Tran and Trung-Nghia Le},
  booktitle={IEEE International Conference on Multimedia and Expo (ICME)},
  year={2026},
  url = {https://vdkhoi20.github.io/PANDORA/},
  code = {https://github.com/vdkhoi20/PANDORA},
}

@article{vo2026cpam,
  title={CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing},
  author={Vo, Dinh-Khoi and Do, Thanh-Toan and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  journal={IEEE Transactions on Multimedia},
  year={2026},
  url={https://arxiv.org/abs/2506.18438},
  code={https://github.com/vdkhoi20/CPAM}
}

@inproceedings{vo2026focusdiff,
  title={Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention},
  author={Vo, Dinh-Khoi and Le-Hinh, Nhut-Thanh and Huynh, Viet-Tham and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle={International Conference on Computational Collective Intelligence},
  year={2026},
  url={https://vdkhoi20.github.io/FocusDiff/},
  code={https://github.com/vdkhoi20/FocusDiff}
}

Acknowledgment

💰

Funding and GPU Support

This research is funded by Vietnam National University - Ho Chi Minh City (VNU-HCM) under Grant Number B2026-18-17. This research used the GPUs provided by the Intelligent Systems Lab at the Faculty of Information Technology, University of Science, VNU-HCM.

🙏

User Study Participants

We extend our heartfelt gratitude to all participants who took part in our comprehensive user study. Your valuable time, thoughtful feedback, and detailed evaluations were instrumental in validating the effectiveness and usability of our PANDORA framework. Your insights helped us understand the practical impact of our zero-shot object removal approach and provided crucial evidence of its superiority over existing methods.

🎨

Website Design Inspiration

This website design is inspired by ObjectDrop. We thank the authors for their excellent work and creative design approach.

🚀

Demo Design Inspiration

Our Gradio demo design is inspired by MimicBrush. We thank the authors for their excellent work and creative design approach.