CPAM: Context-Preserving Adaptive Manipulationfor Zero-Shot Real Image Editing

Dinh-Khoi Vo^1,2Thanh-Toan Do³Tam V. Nguyen⁴Minh-Triet Tran^1,2Trung-Nghia Le^1,2

¹University of Science, Ho Chi Minh City, Vietnam²Vietnam National University, Ho Chi Minh City, Vietnam³Monash University, Melbourne, Victoria, Australia⁴University of Dayton, Ohio, United States

vdkhoi@selab.hcmus.edu.vn, toan.do@monash.edu, tamnguyen@udayton.edu, {tmtriet, ltnghia}@fit.hcmus.edu.vn

AcceptedIEEE Transactions on Multimedia

arXiv Paper Code ▦IMBA Dataset 🤗Gradio DemoSOON

Abstract

Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. CPAM can be integrated with multiple diffusion backbones, including SD1.5, SD2.1, and SDXL, demonstrating strong generalization across model architectures. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques.

IMBA Benchmark

IMBA extends TEdBench with richer annotations for controllable object-level image editing. It augments each sample with object prompts, alteration masks, and explicit editing preference labels, enabling evaluation of object retention, object modification, and background alteration in multi-object real images.

104

Editing Samples

43 / 97

Retention / Modification Cases

Background Alteration Cases

Approach

We propose Context-Preserving Adaptive Manipulation (CPAM) to edit an image I_s using a source object mask M_s through the MaskInputModule, which can derive the mask in various ways, such as manual drawing, click-based extraction, or text prompts using SAM and a target text prompt P_t to generate a new image I_t that aligns with P_t. Notably, I_t may spatially differ from I_s, modifying objects or background while keeping other regions unchanged. To achieve this, we introduce a preservation adaptation module that adjusts self-attention to align the semantic content from intermediate latent noise to the current edited noise, ensuring the retention of the original object and background during the editing process. To prevent unwanted changes from the target prompt in non-desired modified regions, we propose a localized extraction module that enables targeted editing while preserving the remaining details. Additionally, we propose mask-guidance strategies for diverse image manipulation tasks. Below are the overall CPAM architecture , and the zero-shot editing algorithm.

Overall Architecture

Detail Mechanism

Zero-Shot Editing Algorithm

Mechanism Analysis

CPAM separates preservation and localized editing by coordinating self-attention and cross-attention. The analysis below highlights how localized extraction reduces prompt leakage into non-target regions and how the attention behavior supports object-level manipulation while preserving scene context.

Localized Cross-Attention Extraction

Attention Insights

Qualitative Comparison

Figure shows a qualitative comparison of CPAM against leading state-of-the-art image editing techniques. Our results demonstrate that CPAM consistently outperforms existing methods across various real image editing tasks, including object replacement, view/pose changes, object removal, background alteration, and addition of new objects. CPAM excels in its ability to modify diverse aspects of images while effectively preserving the original background and avoiding unintended modifications to non-target regions. The updated visualization includes CPAM results on SD1.5, SD2.1, and SDXL.

Additional Results

CPAM supports diverse editing intents, including object replacement, region-specific manipulation, pose and viewpoint changes, background editing, and addition or removal of objects, while preserving non-target regions.

More Visual Results

Quantitative Comparison

We conduct comprehensive quantitative evaluations and user studies to assess the effectiveness of CPAM against state-of-the-art image editing methods. We evaluate methods using multiple metrics including functional capabilities, text-image alignment (CLIPScore), background preservation (LPIPS), and subjective user ratings across key dimensions. Our evaluation dataset, IMBA (Image Manipulation BenchmArk), comprises 104 carefully curated samples with detailed annotations for diverse editing tasks including object retention, modification, and background alteration.

Functional Capabilities Comparison

Functional comparison across editing methods. ✓ indicates supported features, ✗ indicates not supported. Local Edit: region-specific editing. Obj. Removal: object removal capability. Caption-Free: no original image caption required. Mask Ctrl: mask-based region control. Hi-Guidance: compatibility with high classifier-free guidance scales.

Method	Local Edit	Obj. Removal	Caption-Free	Mask Ctrl	Hi-Guidance
SDEdit	✗	✗	✓	✗	✗
MasaCtrl	✓	✗	✓	✓	✗
PnP	✗	✗	✗	✗	✗
FPE	✗	✗	✗	✗	✗
DiffEdit	✓	✗	✓	✓	✗
Pix2Pix-Zero	✗	✗	✗	✗	✗
LEDITS++	✓	✗	✓	✓	✗
Imagic (FT)	✗	✗	✗	✗	✗
CPAM (Ours)	✓	✓	✓	✓	✓

CPAM is the only method that supports all five functional capabilities, demonstrating its versatility in handling diverse image editing tasks.

Comparison with State-of-the-Art Methods

Comparison using CLIPScore for text-image alignment and LPIPS, DreamSim, and RMSE for background preservation. Bold indicates best scores, underline indicates second best.

Method	CLIPScore ↑	LPIPS (bg) ↓	DreamSim ↓	RMSE ↓
SDEdit	28.19	0.386	0.239	57.16
MasaCtrl	28.82	0.246	0.121	32.51
PnP	29.03	0.238	0.075	24.36
FPE	29.02	0.201	0.056	22.06
DiffEdit	28.58	0.182	0.080	35.81
Pix2Pix-Zero	27.01	0.229	0.117	31.30
LEDITS++	28.74	0.210	0.104	43.30
Imagic (FT)	30.34	0.462	0.286	81.55
CPAM-SD1.5 (Ours)	29.26	0.180	0.072	23.42
CPAM-SD2.1 (Ours)	29.08	0.125	0.044	19.13
CPAM-SDXL (Ours)	29.77	0.118	0.044	18.90

CPAM-SDXL achieves the strongest background preservation, while CPAM-SD2.1 obtains the best DreamSim score, demonstrating that the method generalizes across diffusion backbones.

User Study Results

Participants rated image editing methods on a scale of 1 (very bad) to 6 (very good). Bold indicates best scores, underline indicates second best.

Method	Object Retention	Background Retention	Realistic	Satisfaction
SDEdit	3.63	3.19	3.38	2.42
MasaCtrl	4.01	4.17	4.32	3.11
PnP	4.61	4.49	4.20	2.63
FPE	4.50	4.44	4.33	2.53
DiffEdit	4.58	4.57	4.40	3.13
Pix2Pix-Zero	2.11	4.23	1.84	1.93
LEDIT++	4.38	4.95	4.57	3.26
Imagic (FT)	3.74	3.48	4.30	4.82
CPAM (Ours)	4.72	5.09	4.69	3.30

CPAM significantly outperforms existing methods, achieving the best user satisfaction scores in object retention, background retention, and realism.

Ablation and User Study

Removing Localized Extraction or Preservation Adaptation degrades background preservation and editing stability. The user study further shows that CPAM is preferred for object retention, background retention, and realism.

Module Ablation

User Study Ratings

Demo Video

Limitations

CPAM remains bounded by the representational capacity of the pretrained diffusion backbone and by mask quality. Large pose or viewpoint changes, imprecise masks, and very small target regions may still produce unstable edits.

Pose and Viewpoint Changes

Imprecise Masks

Small or Specific Regions

Application

CPAM is designed as a general, training-free attention manipulation framework that can be instantiated across diverse image editing scenarios. Below, we present representative systems that build directly on CPAM’s core mechanisms, demonstrating how its principles translate into interactive research prototypes and practical end-user applications as well as object removal and precise region-focused editing, illustrating its extensibility across problem settings.

iCONTRA — Interactive Concept Transfer (CHI '24)

iCONTRA further demonstrates CPAM’s applicability to concept-level consistency in creative workflows. It incorporates a CPAM-based zero-shot editing algorithm that progressively integrates visual information from initial exemplars without fine-tuning, enabling coherent concept transfer across generated items. This formulation allows designers to efficiently create high-quality, thematically consistent collections with reduced effort.

iCONTRA Paper →

EPEdit — Efficient Photo Editor

EPEdit packages CPAM-based zero-shot editing algorithms into a practical end-user system for comprehensive photo manipulation. By leveraging CPAM’s training-free attention control, EPEdit supports a wide range of editing tasks—including object removal, replacement, pose adjustment, background modification, and thematic collection design—while maintaining efficiency, usability, and low deployment cost.

EPEdit Paper →

PANDORA — Zero-Shot Object Removal

PANDORA represents the foundational instantiation of CPAM for prompt-free object removal. By operationalizing CPAM’s pixel-wise attention dissolution and localized attentional guidance, PANDORA enables precise, non-rigid, and scalable multi-object erasure in a single pass without fine-tuning or prompt engineering.

Visit PANDORA →

FocusDiff — Target-Aware Refocusing

Building upon the same CPAM principles, FocusDiff extends attention refocusing and preservation mechanisms to region-specific text-guided editing, addressing prompt brittleness, spillover artifacts, and failures on small or cluttered objects. CPAM’s localized preservation strategies naturally generalize to FocusDiff’s refocused cross-attention, further enabling globally consistent editing in challenging settings such as 360° indoor panoramas and virtual reality environments.

Visit FocusDiff →

CPAM (IEEE TMM 2026)

@article{vo2026cpam,
  title={CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing},
  author={Vo, Dinh-Khoi and Do, Thanh-Toan and Nguyen, Tam V and Tran, Minh-Triet and Le, Trung-Nghia},
  journal={IEEE Transactions on Multimedia},
  year={2026},
  url={https://arxiv.org/abs/2506.18438},
  code={https://github.com/vdkhoi20/CPAM}
}

FocusDiff (ICCCI 2026)

@inproceedings{vo2026focusdiff,
  title={Toward 360-Degree Indoor Panorama Editing via Tuning-Free Diffusion Model with Refocusing Cross-Attention},
  author={Vo, Dinh-Khoi and Le-Hinh, Nhut-Thanh and Huynh, Viet-Tham and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle={International Conference on Computational Collective Intelligence},
  year={2026}
}

PANDORA (ICME 2026)

@inproceedings{Vo2026ICME,
  title = {PANDORA: Pixel-wise Attention Dissolution and Latent Guidance for Zero-Shot Object Removal},
  author = {Vo, Dinh-Khoi and Nguyen, Van-Loc and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle = {IEEE International Conference on Multimedia and Expo (ICME)},
  year = {2026},
  url = {https://arxiv.org/abs/2603.27555},
  code = {https://github.com/vdkhoi20/PANDORA},
}

@inproceedings{Vo2026DemoICME,
  title={Zero-Shot Mass-Similar and Multi-Object Removal in Single Pass},
  author={Dinh-Khoi Vo and Van-Loc Nguyen and Tam V. Nguyen and Minh-Triet Tran and Trung-Nghia Le},
  booktitle={IEEE International Conference on Multimedia and Expo (ICME)},
  year={2026},
  url = {https://vdkhoi20.github.io/PANDORA/},
  code = {https://github.com/vdkhoi20/PANDORA},
}

EPEdit: Redefining Image Editing with Generative AI (2024)

@inproceedings{nguyen2024epedit,
  title={EPEdit: Redefining Image Editing with Generative AI and User-Centric Design},
  author={Nguyen, Hoang-Phuc and Vo, Dinh-Khoi and Do, Trong-Le and Nguyen, Hai-Dang and Nguyen, Tan-Cong and Nguyen, Vinh-Tiep and Nguyen, Tam V and Le, Khanh-Duy and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle={International Symposium on Information and Communication Technology},
  pages={272--283},
  year={2024},
  organization={Springer}
}

iCONTRA: Interactive Concept Transfer (CHI 2024)

@inproceedings{10.1145/3613905.3650788,
author = {Vo, Dinh-Khoi and Ly, Duy-Nam and Le, Khanh-Duy and Nguyen, Tam V. and Tran, Minh-Triet and Le, Trung-Nghia},
title = {iCONTRA: Toward Thematic Collection Design Via Interactive Concept Transfer},
year = {2024},
isbn = {9798400703317},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3613905.3650788},
doi = {10.1145/3613905.3650788},
abstract = {Creating thematic collections in industries demands innovative designs and cohesive concepts. Designers may face challenges in maintaining thematic consistency when drawing inspiration from existing objects, landscapes, or artifacts. While AI-powered graphic design tools offer help, they often fail to generate cohesive sets based on specific thematic concepts. In response, we introduce iCONTRA, an interactive CONcept TRAnsfer system. With a user-friendly interface, iCONTRA enables both experienced designers and novices to effortlessly explore creative design concepts and efficiently generate thematic collections. We also propose a zero-shot image editing algorithm, eliminating the need for fine-tuning models, which gradually integrates information from initial objects, ensuring consistency in the generation process without influencing the background. A pilot study suggests iCONTRA&apos;s potential to reduce designers&apos; efforts. Experimental results demonstrate its effectiveness in producing consistent and high-quality object concept transfers. iCONTRA stands as a promising tool for innovation and creative exploration in thematic collection design. The source code will be available at: https://github.com/vdkhoi20/iCONTRA.},
booktitle = {Extended Abstracts of the CHI Conference on Human Factors in Computing Systems},
articleno = {382},
numpages = {8},
keywords = {Diffusion model, Thematic collection design, Zero-shot image editing},
location = {Honolulu, HI, USA},
series = {CHI EA '24}
}

Acknowledgment

💰

Funding and GPU Support

This research is funded by the Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant Number 102.05-2023.31. This research used the GPUs provided by the Intelligent Systems Lab at the Faculty of Information Technology, University of Science, VNU-HCM.

🙏

User Study Participants

We extend our heartfelt gratitude to all 20 participants who took part in our comprehensive user study. Your valuable time, thoughtful feedback, and detailed evaluations across 50 randomly shuffled images were instrumental in validating the effectiveness and usability of our CPAM framework. Your insights helped us understand the practical impact of our zero-shot real image editing approach and provided crucial evidence of its superiority over existing state-of-the-art methods.

🎨

Website Design Inspiration

This website design is inspired by ObjectDrop. We thank the authors for their excellent work and creative design approach.