FocusDiff : Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

Accepted Paper ICCCI 2026

Dinh-Khoi Vo^1,2 Nhut-Thanh Le-Hinh^1,2 Viet-Tham Huynh^1,2 Tam V. Nguyen³ Minh-Triet Tran^1,2 Trung-Nghia Le^1,2

¹University of Science, Ho Chi Minh, Vietnam

²Vietnam National University, Ho Chi Minh, Vietnam

³University of Dayton, Ohio, United States

{vdkhoi,lhnthanh,hvtham}@selab.hcmus.edu.vn, tamnguyen@udayton.edu, {tmtriet,ltnghia}@fit.hcmus.edu.vn

Paper Soon Code LIMB Dataset Demo Soon BibTeX

FocusDiff teaser results — FocusDiff enables precise region-specific edits from simple prompts without fine-tuning, accurately modifying small objects while preserving surrounding content and visual coherence.

Abstract

Zero-shot text-guided diffusion has significantly advanced image editing; however, practical usability remains constrained by prompt brittleness, spillover edits, and failures on small or cluttered objects. We propose FocusDiff, a tuning-free framework for precise region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while transferring object identity, structure, and appearance to the edited output. Context-preserving modules further ensure background fidelity and global coherence. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments.

Approach

FocusDiff edits an input image I_s using a source object mask M and a target prompt P. The image is processed in original and blurred latent flows. During denoising, refocusing cross-attention extracts object semantics from the blurred branch and transfers them to the edited branch, while CPAM-style context-preserving modules stabilize the unmasked background.

Overall Pipeline

Refocusing Cross-Attention

Selectively blurs non-target areas so diffusion attention concentrates on the masked object, then transfers the edited object semantics back into the target latent flow.

Context-Preserving Integration

Reuses intermediate self-attention and localized extraction modules to preserve non-edited regions and suppress prompt leakage outside the mask.

Panoramic Extension

Crops a region of interest from a 360-degree panorama, edits it locally, then aligns the edited patch back to the panorama for immersive VR workflows.

Qualitative Comparison

Competing zero-shot localized editors often fail to modify small or specific objects, or introduce unintended background changes. FocusDiff produces fine-grained local edits with faithful background preservation.

Quantitative Comparison

We evaluate on LIMB using CLIPScore for text-image alignment and LPIPS for background preservation. The SD2.1 and SDXL variants demonstrate FocusDiff's generality across diffusion backbones.

Method	CLIPScore ↑	LPIPS ↓
MasaCtrl	20.12	0.280
Blended-Diffusion	27.43	0.156
DiffEdit	27.75	0.148
LEDITS++	32.76	0.103
CPAM	33.45	0.101
FocusDiff-SD1.5 (Ours)	35.85	0.099
FocusDiff-SD2.1 (Ours)	35.61	0.068
FocusDiff-SDXL (Ours)	36.48	0.064

Ablation Study

Configuration	CLIPScore ↑	LPIPS ↓
FocusDiff-SD1.5 (Full Framework)	35.85	0.099
FocusDiff w/o RCA	31.42	0.145
FocusDiff w/o CPI	36.12	0.284
Baseline w/o Blurring Surrounding	29.80	0.312

360-Degree Panorama Editing

FocusDiff extends naturally to indoor panoramic images by cropping the target region, editing it locally, and blending it back into the full panorama. The VR interface supports mask drawing, object replacement, object removal, and preview inside an immersive scene.

Panoramic Editing Pipeline

VR editing panel — Editing panel for choosing panoramas and prompts.

VR mask generation canvas — Mask generation canvas for selecting target regions.

LIMB Benchmark

LIMB is a localized image manipulation benchmark curated from PIE-Bench for fine-grained region-specific editing. It contains multi-object scenes and annotations designed to evaluate whether a method edits the requested target while preserving the rest of the image.

30multi-object images

100localized editing annotations

Small objectschallenging cluttered-scene cases

Fair comparisonshared masks and prompts

VR User Study

The VR panoramic editing system obtains an overall System Usability Scale score of 77.12/100, indicating strong usability for interactive object replacement and removal tasks.

Related Projects & Citations

FocusDiff builds on the same line of tuning-free diffusion editing research as CPAM and related systems for localized manipulation, object removal, and interactive concept transfer.

PANDORA: Zero-Shot Object Removal

PANDORA is a related zero-shot object removal project in the same diffusion-editing research line. Visit the PANDORA project page.

CPAM: Context-Preserving Adaptive Manipulation

CPAM provides the context-preserving attention modules that inspire FocusDiff's preservation design. Visit the CPAM project page.

PANDORA (ICME 2026)

@inproceedings{Vo2026ICME,
  title = {PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal},
  author = {Vo, Dinh-Khoi and Nguyen, Van-Loc and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle = {IEEE International Conference on Multimedia and Expo (ICME)},
  year = {2026},
}

PANDORA Demo (ICME 2026)

@inproceedings{Vo2026DemoICME,
  title = {Zero-Shot Mass-Similar and Multi-Object Removal in Single Pass},
  author = {Dinh-Khoi Vo and Van-Loc Nguyen and Tam V. Nguyen and Minh-Triet Tran and Trung-Nghia Le},
  booktitle = {IEEE International Conference on Multimedia and Expo (ICME)},
  year = {2026},
}

CPAM (IEEE TMM 2026)

@article{vo2026cpam,
  title = {CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing},
  author = {Vo, Dinh-Khoi and Do, Thanh-Toan and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  journal = {IEEE Transactions on Multimedia},
  year = {2026},
  url = {https://arxiv.org/abs/2506.18438},
  code = {https://github.com/vdkhoi20/CPAM}
}

iCONTRA (CHI EA 2024)

@inproceedings{vo2024icontra,
  title = {iCONTRA: Toward Thematic Collection Design Via Interactive Concept Transfer},
  author = {Vo, Dinh-Khoi and Ly, Duy-Nam and Le, Khanh-Duy and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle = {Extended Abstracts of the CHI Conference on Human Factors in Computing Systems},
  year = {2024}
}

EPEdit (SoICT 2024)

@inproceedings{nguyen2024epedit,
  title = {EPEdit: Redefining Image Editing with Generative AI and User-Centric Design},
  author = {Nguyen, Hoang-Phuc and Vo, Dinh-Khoi and Do, Trong-Le and Nguyen, Hai-Dang and Nguyen, Tan-Cong and Nguyen, Vinh-Tiep and Nguyen, Tam V. and Le, Khanh-Duy and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle = {International Symposium on Information and Communication Technology},
  year = {2024}
}

Acknowledgment

Funding and GPU Support

This research is funded by Vietnam National University - Ho Chi Minh City (VNU-HCM) under Grant Number B2026-18-17. Experiments were conducted on NVIDIA A100 GPU resources.

User Study Participants

We thank all participants who joined the VR-based panoramic editing user study and provided valuable feedback on object replacement, object removal, and system usability.

Website Design Inspiration

This project page follows the academic layout style of CPAM and related diffusion-editing project pages, with a FocusDiff-specific visual identity and paper assets.

BibTeX

@inproceedings{vo2026focusdiff,
  title = {Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention},
  author = {Vo, Dinh-Khoi and Le-Hinh, Nhut-Thanh and Huynh, Viet-Tham and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle = {International Conference on Computational Collective Intelligence},
  year = {2026}
}