News:
- [10/13] MonoFormer now supports interleaved multi-modal data and multi-image inputs and outputs for both image understanding and generation. The legacy single-image generation codebase has been moved to the single-image branch.
- [9/25] We released the model weights for class-conditional generation on ImageNet.
- [9/24] We released the training code and inference code.
- [9/24] We released MonoFormer: One Transformer for Both Diffusion and Autoregression. Checkout our paper on ArXiv.
- Release the training and inference code for class-conditional generation.
- Release the code for text-to-image generation.
- Release the model weights for ImageNet generation.
- Support interleaved multi-modal outputs.
- Support image understanding tasks with CLIP or VAE as visual encoder.
- Support multi-images inputs and outputs for understanding or diffusion generation.
- Release the model weights for text-to-image generation.
- Clone this repository and navigate to the project folder:
git clone https://github.com/monoformer/MonoFormer.git
cd MonoFormer- Install Packages
conda create -n monoformer python=3.9 -y
conda activate monoformer
pip install -r requirements.txt| Resolution | Dataset | Checkpoint |
|---|---|---|
| 256 | ImageNet | MonoFormer_ImageNet_256 |
| 256 | JourneyDB, UltraChat | To be released |
- Prepare dataset
MonoFormer is trained on ImageNet for class-conditional image generation, JourneyDB for text-to-image generation, UltraChat for text-to-text generation, and LLaVA Instruction Tuning for multi-modal understanding. Please prepare the appropriate dataset for each respective task.
For detailed guide on how to organize the data and how to prepare your own datasets for training, please refer to data/README.md.
Additionally, you can extract VAE features using the script in tools/extract_vae_features.py, which will help accelerate training.
- Prepare pretrained models
We initialize the LLM using TinyLLama and employ the pretrained VAE from Stability AI to extract latent representations. These models are automatically downloaded from HuggingFace. Alternatively, you can manually download them and specify the pretrained path in the config.
- Start training
We have provided training configurations for various datasets and tasks in ./configs. You can modify these configurations or create your own for training. All training scripts are located in ./scripts.
Train on ImageNet dataset for class-conditional image generation:
torchrun --nproc_per_node 8 train.py \
--output_dir results/monoformer_imagenet \
--lr 1e-4 \
--batch_size_per_gpu 16 \
--gradient_accumulation_steps 2 \
--max_grad_norm 2.0 \
--max_steps 500000 \
--checkpointing_steps 10000 --log_steps 100 \
--mixed_precision bf16 --grad_precision fp32 \
--resolution 256 \
--config_file configs/train_imagenet.pyTrain on mixture of JourneyDB and UltraChat dataset for text-to-image generation and text-to-text generation:
torchrun --nproc_per_node 8 train.py \
--output_dir results/monoformer_journeydb_ultrachat \
--lr 1e-4 \
--batch_size_per_gpu 16 \
--gradient_accumulation_steps 2 \
--max_grad_norm 2.0 \
--max_steps 500000 \
--checkpointing_steps 50000 --log_steps 100 \
--mixed_precision bf16 --grad_precision fp32 \
--resolution 256 \
--config_file configs/train_journeydb_ultrachat_9_1.pyTrain on mixture of JourneyDB and LLaVA instrunction tuning dataset for text-to-image generation and image understanding:
torchrun --nproc_per_node 8 train.py \
--output_dir results/monoformer_llava_journeydb \
--lr 1e-4 \
--batch_size_per_gpu 16 \
--gradient_accumulation_steps 2 \
--max_grad_norm 2.0 \
--max_steps 500000 \
--checkpointing_steps 50000 --log_steps 100 \
--mixed_precision bf16 --grad_precision fp32 \
--resolution 256 \
--config_file configs/train_llava_journeydb.pyPlease refer to notebooks/infer_dit.ipynb for inference in Jupyter Notebook.
Inference for image generation:
CUDA_VISIBLE_DEVICES=0 torchrun --master_port 39500 --nproc_per_node 1 infer_dit.py --ckpt $ckpt --resolution 256 --emaInferece for text generation or multi-modal generation:
CUDA_VISIBLE_DEVICES=0 torchrun --master_port 39500 --nproc_per_node 1 infer_mllm.py --ckpt $ckptSample class-conditional generation results on ImageNet and saves the sampled images in numpy for evaluation:
sample_dir='samples/imagenet_cfg_2_step_20_n_10k'
ckpt_dir='results/monoformer_imagenet_res256_bf16_bs32_lr1e-4/checkpoint-50000'
torchrun --nnodes=1 --nproc_per_node=8 sample_ddp.py --ckpt $ckpt_dir \
--per-proc-batch-size 64 \
--resolution 256 \
--ema \
--num-fid-samples 10000 \
--imagenet_labels_path ./data/imagenet_labels.json \
--cfg-scale 2 \
--num-sampling-steps 20 \
--sample-dir $sample_dir \@article{zhao2024monoformer,
title={MonoFormer: One Transformer for Both Diffusion and Autoregression},
author={Zhao, Chuyang and Song, Yuxing and Wang, Wenhao and Feng, Haocheng and Ding, Errui and Sun, Yifan and Xiao, Xinyan and Wang, Jingdong},
journal={arXiv preprint arXiv:2409.16280},
year={2024}
}This codebase borrows from DiT, Large-DiT, LLaVA, and TinyLlama. Thanks for their great works.

