EditAR: Unified Conditional Generation with Autoregressive Models
JitengMu, Nuno Vasconcelos, Xiaolong Wang
University of California, San Diego
Diffusion models have made significant advances in text-guided synthesis tasks. Recent progress in controllable image generation and editing is largely driven by diffusion-based methods. Although diffusion models perform exceptionally well in specific tasks with tailored designs, establishing a unified model is still challenging. In contrast, autoregressive models inherently feature a unified tokenized representation, which simplifies the creation of a single foundational model for various tasks. In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. To enhance the text-to-image alignment, we further propose to distill the knowledge from foundation models into the autoregressive modeling process. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
The codebase is implemented using PyTorch 2.2.1 with python 3.10 and tested on Ubuntu 20.04.6 LTS.
-
Environment Setup. Please follow
install.shto install the packages as shown inrequirements.txt. Then you may download all pre-trained checkpoints as instructed below. -
Download text encoder model flan-t5-xl and put it as
./pretrained_models/t5-ckpt/flan-t5-xl. Download vqvae model vq_ds16_t2i.pt from LlamaGen and put it as./pretrained_models/vq_ds16_t2i.pt. -
(Required for training) Download pre-trained text-to-image model t2i_XL_stage2_512.pt from LlamaGen and put it as
./pretrained_models/t2i_XL_stage2_512.pt. -
(Required for inference) Download our trained model editar_release.pt and put it as
./checkpoints/editar/editar_release.pt.
Please run the following script to edit single image. Put the source image and instruction text in the ./examples as demonstrated, then run,
python3 autoregressive/sample/sample_edit_example.py --gpt-ckpt ./checkpoints/editar/editar_release.pt --cfg-scale 3 --seed 83
Data Preparation. For image editing, download SEED-Data-Edit-Unsplash and PIPE Dataset. For image translation, we follow ControlNet++ to download depth,canny, and Segmentation COCOStuff train set. Then each parquet dataset is then processed using process_data_HF.py by specifying the source path and target path.
The folder ends up looking like,
./data/
Seedx_Unsplash_HF/
PIPE_HF/
MultiGen-20M_depth_HF/
Captioned_COCOStuff_HF/We provide an example as shown in train.sh. Please modify train.sh accordingly to run on your system.
Data Preparation. For image editing, please refer to Direct Inversion to download the PIE-Bench dataset. For image translation, we follow ControlNet++ to download depth,canny, and Segmentation COCOStuff validation set. Then each parquet dataset is then processed using process_data_HF.py by specifying the source path and target path.
The folder ends up looking like,
./data/
PIE_Bench_Dataset/
MultiGen-20M_depth_eval_HF/
Captioned_COCOStuff_eval_HF/Please replace $TESTSET with one of PIE-bench/depth/canny/conditionsegmentation for evaluation on different benchmark datasets.
python3 autoregressive/sample/sample_edit_folder.py --gpt-ckpt ./checkpoints/editar/editar_release.pt --cfg-scale 3 --testset $TESTSET
The implementation is mainly built on top of LlamaGen. We also want to thank the authors from ControlNetPlusPlus, ControlAR, SmartEdit, Dino-v2 for the code release.
The majority of this project is licensed under MIT License. Portions of the project are under separate license of referred projects.
@article{mu2025editAR,
title={EditAR: Unified Conditional Generation with Autoregressive Models},
author={Mu, Jiteng and Vasconcelos, Nuno and Wang, Xiaolong},
journal={arXiv preprint arXiv:2501.04699},
year={2025}
}