MonoFormer: One Transformer for Both Diffusion and Autoregression

News:

[10/13] MonoFormer now supports interleaved multi-modal data and multi-image inputs and outputs for both image understanding and generation. The legacy single-image generation codebase has been moved to the single-image branch.
[9/25] We released the model weights for class-conditional generation on ImageNet.
[9/24] We released the training code and inference code.
[9/24] We released MonoFormer: One Transformer for Both Diffusion and Autoregression. Checkout our paper on ArXiv.

Todo

Release the training and inference code for class-conditional generation.
Release the code for text-to-image generation.
Release the model weights for ImageNet generation.
Support interleaved multi-modal outputs.
Support image understanding tasks with CLIP or VAE as visual encoder.
Support multi-images inputs and outputs for understanding or diffusion generation.
Release the model weights for text-to-image generation.

Installation

Clone this repository and navigate to the project folder:

git clone https://github.com/monoformer/MonoFormer.git
cd MonoFormer

Install Packages

conda create -n monoformer python=3.9 -y
conda activate monoformer
pip install -r requirements.txt

Model Zoo

Resolution	Dataset	Checkpoint
256	ImageNet	MonoFormer_ImageNet_256
256	JourneyDB, UltraChat	To be released

Training

Prepare dataset

MonoFormer is trained on ImageNet for class-conditional image generation, JourneyDB for text-to-image generation, UltraChat for text-to-text generation, and LLaVA Instruction Tuning for multi-modal understanding. Please prepare the appropriate dataset for each respective task.

For detailed guide on how to organize the data and how to prepare your own datasets for training, please refer to data/README.md.

Additionally, you can extract VAE features using the script in tools/extract_vae_features.py, which will help accelerate training.

Prepare pretrained models

We initialize the LLM using TinyLLama and employ the pretrained VAE from Stability AI to extract latent representations. These models are automatically downloaded from HuggingFace. Alternatively, you can manually download them and specify the pretrained path in the config.

Start training

We have provided training configurations for various datasets and tasks in ./configs. You can modify these configurations or create your own for training. All training scripts are located in ./scripts.

Train on ImageNet dataset for class-conditional image generation:

torchrun --nproc_per_node 8 train.py \
    --output_dir results/monoformer_imagenet \
    --lr 1e-4 \
    --batch_size_per_gpu 16 \
    --gradient_accumulation_steps 2 \
    --max_grad_norm 2.0 \
    --max_steps 500000 \
    --checkpointing_steps 10000 --log_steps 100 \
    --mixed_precision bf16 --grad_precision fp32 \
    --resolution 256 \
    --config_file configs/train_imagenet.py

Train on mixture of JourneyDB and UltraChat dataset for text-to-image generation and text-to-text generation:

torchrun --nproc_per_node 8 train.py \
    --output_dir results/monoformer_journeydb_ultrachat \
    --lr 1e-4 \
    --batch_size_per_gpu 16 \
    --gradient_accumulation_steps 2 \
    --max_grad_norm 2.0 \
    --max_steps 500000 \
    --checkpointing_steps 50000 --log_steps 100 \
    --mixed_precision bf16 --grad_precision fp32 \
    --resolution 256 \
    --config_file configs/train_journeydb_ultrachat_9_1.py

Train on mixture of JourneyDB and LLaVA instrunction tuning dataset for text-to-image generation and image understanding:

torchrun --nproc_per_node 8 train.py \
    --output_dir results/monoformer_llava_journeydb \
    --lr 1e-4 \
    --batch_size_per_gpu 16 \
    --gradient_accumulation_steps 2 \
    --max_grad_norm 2.0 \
    --max_steps 500000 \
    --checkpointing_steps 50000 --log_steps 100 \
    --mixed_precision bf16 --grad_precision fp32 \
    --resolution 256 \
    --config_file configs/train_llava_journeydb.py

Inference

Run in Jupyter Notebook

Please refer to notebooks/infer_dit.ipynb for inference in Jupyter Notebook.

CLI Inference

Inference for image generation:

CUDA_VISIBLE_DEVICES=0 torchrun --master_port 39500 --nproc_per_node 1 infer_dit.py --ckpt $ckpt --resolution 256 --ema

Inferece for text generation or multi-modal generation:

CUDA_VISIBLE_DEVICES=0 torchrun --master_port 39500 --nproc_per_node 1 infer_mllm.py --ckpt $ckpt

Sampling

Sample class-conditional generation results on ImageNet and saves the sampled images in numpy for evaluation:

sample_dir='samples/imagenet_cfg_2_step_20_n_10k'
ckpt_dir='results/monoformer_imagenet_res256_bf16_bs32_lr1e-4/checkpoint-50000'
torchrun --nnodes=1 --nproc_per_node=8 sample_ddp.py --ckpt $ckpt_dir \
    --per-proc-batch-size 64 \
    --resolution 256 \
    --ema \
    --num-fid-samples 10000 \
    --imagenet_labels_path ./data/imagenet_labels.json \
    --cfg-scale 2 \
    --num-sampling-steps 20 \
    --sample-dir $sample_dir \

Citation

@article{zhao2024monoformer,
  title={MonoFormer: One Transformer for Both Diffusion and Autoregression},
  author={Zhao, Chuyang and Song, Yuxing and Wang, Wenhao and Feng, Haocheng and Ding, Errui and Sun, Yifan and Xiao, Xinyan and Wang, Jingdong},
  journal={arXiv preprint arXiv:2409.16280},
  year={2024}
}

Acknowledgement

This codebase borrows from DiT, Large-DiT, LLaVA, and TinyLlama. Thanks for their great works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MonoFormer: One Transformer for Both Diffusion and Autoregression

Content

Todo

Installation

Model Zoo

Training

Inference

Run in Jupyter Notebook

CLI Inference

Sampling

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
configs		configs
data		data
diffusion		diffusion
models		models
notebooks		notebooks
scripts		scripts
tools		tools
.gitignore		.gitignore
README.md		README.md
config.py		config.py
constants.py		constants.py
imgproc.py		imgproc.py
infer_dit.py		infer_dit.py
infer_llm.py		infer_llm.py
infer_mllm.py		infer_mllm.py
requirements.txt		requirements.txt
sample_ddp.py		sample_ddp.py
sample_t2i.py		sample_t2i.py
samples		samples
train.py		train.py

MonoFormer/MonoFormer

Folders and files

Latest commit

History

Repository files navigation

MonoFormer: One Transformer for Both Diffusion and Autoregression

Content

Todo

Installation

Model Zoo

Training

Inference

Run in Jupyter Notebook

CLI Inference

Sampling

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages