Elad Richardson, Yuval Alaluf, Ali Mahdavi-Amiri, Daniel Cohen-Or
Tel Aviv University, Simon Fraser UniversityText-guided image generation enables the creation of visual content from textual descriptions. However, certain visual concepts cannot be effectively conveyed through language alone. This has sparked a renewed interest in utilizing the CLIP image embedding space for more visually-oriented tasks through methods such as IP-Adapter. Interestingly, the CLIP image embedding space has been shown to be semantically meaningful, where linear operations within this space yield semantically meaningful results. Yet, the specific meaning of these operations can vary unpredictably across different images. To harness this potential, we introduce pOps, a framework that trains specific semantic operators directly on CLIP image embeddings. Each pOps operator is built upon a pretrained Diffusion Prior model. While the Diffusion Prior model was originally trained to map between text embeddings and image embeddings, we demonstrate that it can be tuned to accommodate new input conditions, resulting in a diffusion operator. Working directly over image embeddings not only improves our ability to learn semantic operations but also allows us to directly use a textual CLIP loss as an additional supervision when needed. We show that pOps can be used to learn a variety of photo-inspired operators with distinct semantic meanings, highlighting the semantic diversity and potential of our proposed approach.

Different operators trained using pOps. Our method learns operators that are applied directly in the image embedding space, resulting in a variety of semantic operations that can then be realized as images using an image diffusion model.
Official implementation of the paper "pOps: Photo-Inspired Diffusion Operators"
To set up the environment with all necessary dependencies, please run:
pip install -r requirements.txt
We provide pretrained models for our different operators under an huggingface model card.
To run a binary operator, simply use the scripts.infer_binary script with the corresponding config file.
python -m scripts.infer_binary --config_path=configs/infer/texturing.yaml
# or
python -m scripts.infer_binary --config_path=configs/infer/union.yaml
# or
python -m scripts.infer_binary --config_path=configs/infer/scene.yamlThis will automatically download the pretrained model and run the inference on the default input images.
Configuration is managed by pyrallis, some useful flags to use with the scripts.infer_binary script are:
--output_dir_name: The name of the output directory where the results will be saved.--dir_a: The path to the directory containing the input images for the first input.--dir_b: The path to the directory containing the input images for the second input.--vis_mean: Show results of the mean of the two inputs.
For compositions of multiple operators note that the inference script outputs both the resulting images and the corresponding clip embeddings.
Thus, you can simply feed a directory of embeddings to either dir_a or dir_b. Useful filtering flags are:
--file_exts_a(/b): Filter to only.jpgimages or.pthembeddings.--name_filter_a(/b): Filter to only images with specific names.
To sample results with missing input conditions, use the --drop_condition_a or --drop_condition_b flags.
Finally, to use the IP-Adapter with the inference script, use the --use_ipadapter flag and to use additional depth conditioning, use the --use_depth flag.
To run the instruct operator, use the scripts.infer_instruct script with the corresponding config file.
python -m scripts.infer_instruct --config_path=configs/infer/instruct.yamlWe provide several scripts for data generation under the data_generation directory.
generate_textures.py: Generates textures data.generate_scenes.py: Generates scenes data.generate_unions.py: Generates unions data.
The scene operator also requires random backgrounds which can be generated using the generate_random_images.py script.
python -m data_generation.generate_random_images --output_dir=datasets/random_backgrounds --type=scenesThe generate_random_images.py script can also be used to generate random images for the other operators
python -m data_generation.generate_random_images --output_dir=datasets/random_images --type=objectsThese images can be used for the unconditional steps in training, as will be described in the training section.
Training itself is managed by the scripts.train script. See the configs/training directory for the different training configurations.
python -m scripts.train --config_path=configs/training/texturing.yaml
# or
python -m scripts.train --config_path=configs/training/scene.yaml
# or
python -m scripts.train --config_path=configs/training/union.yaml
# or
python -m scripts.train --config_path=configs/training/instruct.yaml
# or
python -m scripts.train --config_path=configs/training/clothes.yamlThe operator itself is defined via the --mode flag, which can be set to the specific operator.
Relevant data paths and validation paths can be set in the configuration file.
Use the optional randoms_dir flag to specify the directory of random images for the unconditional steps.
Our codebase heavily relies on the Kandinsky model
If you use this code for your research, please cite the following paper:
@article{richardson2024pops,
title={pOps: Photo-Inspired Diffusion Operators},
author={Richardson, Elad and Alaluf, Yuval and Mahdavi-Amiri, Ali and Cohen-Or, Daniel},
journal={arXiv preprint arXiv:2406.01300},
year={2024}
}