Skip to content

augclip/augclip_eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💡 Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing

CVPR 2025 | 📖 Paper | ✨ Project page

Authors    Yoonjeon Kim1*, Soohyun Ryu1*, Yeonsung Jung1, Hyunkoo Lee1, Joowon Kim1, June Yong Yang1, Jaeryong Hwang2, Eunho Yang1,3†
         1KAIST, 2Korea Navy Academy 3AITRICS
         *Equal Contribution, Corresponding author

teaser

Get Started

Installation

  1. Clone AugCLIP.
git clone https://github.com/augclip/augclip
cd augclip
  1. Create the environment, here we show an example using conda.
conda create -n augclip_eval python==3.8
pip3 install torch torchvision torchaudio
pip3 install matplotlib openai pillow scikit-learn torchmetrics
pip3 install git+https://github.com/openai/CLIP.git

Notebook

Open here and input openai token.

Checkpoints

You can obtain the checkpoints for evaluating Segment Consistency and DINO similarity from the provided link.

Modelname Download
BiseNet-v2 download
DINO download

🔆 Abstract

The development of vision-language and generative models has significantly advanced text-guided image editing, which seeks the preservation of core elements in the source image while implementing modifications based on the target text. However, existing metrics have a context-blindness problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text, biasing towards either modification or preservation. Directional CLIP similarity, the only metric that considers both source image and target text, is also biased towards modification aspects and attends to irrelevant editing regions of the image. We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. This is done by deriving the CLIP representation of an ideally edited image, that preserves the source image with necessary modifications to align with target text. More specifically, using a multi-modal large language model, AugCLIP augments the textual descriptions of the source and target, then calculates a modification vector through a hyperplane that separates source and target attributes in CLIP space. Extensive experiments on five benchmark datasets, encompassing a diverse range of editing scenarios, show that AugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics.


📚 Citation

@inproceedings{kim2025preserve,
  title={Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing},
  author={Kim, Yoonjeon and Ryu, Soohyun and Jung, Yeonsung and Lee, Hyunkoo and Kim, Joowon and Yang, June Yong and Hwang, Jaeryong and Yang, Eunho},
  journal={arXiv preprint arXiv:2410.11374},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published