FocusDiff : Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention

Accepted Paper ICCCI 2026

1University of Science, Ho Chi Minh, Vietnam

2Vietnam National University, Ho Chi Minh, Vietnam

3University of Dayton, Ohio, United States

{vdkhoi,lhnthanh,hvtham}@selab.hcmus.edu.vn, tamnguyen@udayton.edu, {tmtriet,ltnghia}@fit.hcmus.edu.vn

FocusDiff teaser results
FocusDiff enables precise region-specific edits from simple prompts without fine-tuning, accurately modifying small objects while preserving surrounding content and visual coherence.

Abstract

Zero-shot text-guided diffusion has significantly advanced image editing; however, practical usability remains constrained by prompt brittleness, spillover edits, and failures on small or cluttered objects. We propose FocusDiff, a tuning-free framework for precise region-specific image manipulation based on refocusing cross-attention. Given a target region obtained through automated segmentation or manual selection, FocusDiff applies selective blurring to non-edit areas to guide attention toward the masked region while transferring object identity, structure, and appearance to the edited output. Context-preserving modules further ensure background fidelity and global coherence. We also extend FocusDiff to 360-degree indoor panorama editing and demonstrate its effectiveness within virtual reality environments.

Approach

FocusDiff edits an input image Is using a source object mask M and a target prompt P. The image is processed in original and blurred latent flows. During denoising, refocusing cross-attention extracts object semantics from the blurred branch and transfers them to the edited branch, while CPAM-style context-preserving modules stabilize the unmasked background.

Overall Pipeline

FocusDiff pipeline

Refocusing Cross-Attention

Selectively blurs non-target areas so diffusion attention concentrates on the masked object, then transfers the edited object semantics back into the target latent flow.

Context-Preserving Integration

Reuses intermediate self-attention and localized extraction modules to preserve non-edited regions and suppress prompt leakage outside the mask.

Panoramic Extension

Crops a region of interest from a 360-degree panorama, edits it locally, then aligns the edited patch back to the panorama for immersive VR workflows.

Qualitative Comparison

Competing zero-shot localized editors often fail to modify small or specific objects, or introduce unintended background changes. FocusDiff produces fine-grained local edits with faithful background preservation.

Qualitative comparison on LIMB

Quantitative Comparison

We evaluate on LIMB using CLIPScore for text-image alignment and LPIPS for background preservation. The SD2.1 and SDXL variants demonstrate FocusDiff's generality across diffusion backbones.

Method CLIPScore ↑ LPIPS ↓
MasaCtrl20.120.280
Blended-Diffusion27.430.156
DiffEdit27.750.148
LEDITS++32.760.103
CPAM33.450.101
FocusDiff-SD1.5 (Ours)35.850.099
FocusDiff-SD2.1 (Ours)35.610.068
FocusDiff-SDXL (Ours)36.480.064

Ablation Study

FocusDiff ablation study
Configuration CLIPScore ↑ LPIPS ↓
FocusDiff-SD1.5 (Full Framework)35.850.099
FocusDiff w/o RCA31.420.145
FocusDiff w/o CPI36.120.284
Baseline w/o Blurring Surrounding29.800.312

360-Degree Panorama Editing

FocusDiff extends naturally to indoor panoramic images by cropping the target region, editing it locally, and blending it back into the full panorama. The VR interface supports mask drawing, object replacement, object removal, and preview inside an immersive scene.

Panoramic Editing Pipeline

Panoramic editing pipeline
VR editing panel
Editing panel for choosing panoramas and prompts.
VR mask generation canvas
Mask generation canvas for selecting target regions.
Panoramic image editing results

LIMB Benchmark

LIMB is a localized image manipulation benchmark curated from PIE-Bench for fine-grained region-specific editing. It contains multi-object scenes and annotations designed to evaluate whether a method edits the requested target while preserving the rest of the image.

30multi-object images
100localized editing annotations
Small objectschallenging cluttered-scene cases
Fair comparisonshared masks and prompts

VR User Study

The VR panoramic editing system obtains an overall System Usability Scale score of 77.12/100, indicating strong usability for interactive object replacement and removal tasks.

SUS usability result

Acknowledgment

Funding and GPU Support

This research is funded by Vietnam National University - Ho Chi Minh City (VNU-HCM) under Grant Number B2026-18-17. Experiments were conducted on NVIDIA A100 GPU resources.

User Study Participants

We thank all participants who joined the VR-based panoramic editing user study and provided valuable feedback on object replacement, object removal, and system usability.

Website Design Inspiration

This project page follows the academic layout style of CPAM and related diffusion-editing project pages, with a FocusDiff-specific visual identity and paper assets.

BibTeX

@inproceedings{vo2026focusdiff,
  title = {Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention},
  author = {Vo, Dinh-Khoi and Le-Hinh, Nhut-Thanh and Huynh, Viet-Tham and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle = {International Conference on Computational Collective Intelligence},
  year = {2026}
}