- [05/28/2025] Training code of MAETok released.
- [02/05/2025] 512 and 256 SiT models and MAETok released. LightingDiT models will be updated and we will update the training scripts soon.
- [12/19/2024] 512 SiT models and DiT models released. We also updated the training scripts.
- [12/18/2024] All models have been released at: https://huggingface.co/SoftVQVAE. Checkout demo here.
conda create -n softvq python=3.10 -y
conda activate softvq
pip install -r requirements.txt
| Tokenizer | Image Size | rFID | Huggingface |
|---|---|---|---|
| MAETok-B-128 | 256 | 0.48 | Model Weight |
| MAETok-B-128-512 | 512 | 0.62 | Model Weight |
| Genenerative Model | Image Size | Tokenizer | gFID (w/o CFG) | gFID (w/ CFG) | Huggingface |
|---|---|---|---|---|---|
| SiT-XL | 256 | MAETok-B-128 | 2.31 | 1.67 | Model Weight |
| SiT-XL | 512 | MAETok-B-128-512 | 2.79 | 1.69 | Model Weight |
| Tokenizer | Image Size | rFID | Huggingface |
|---|---|---|---|
| SoftVQ-L-64 | 256 | 0.61 | Model Weight |
| SoftVQ-BL-64 | 256 | 0.65 | Model Weight |
| SoftVQ-B-64 | 256 | 0.88 | Model Weight |
| SoftVQ-L-32 | 256 | 0.74 | Model Weight |
| SoftVQ-BL-32 | 256 | 0.68 | Model Weight |
| SoftVQ-B-32 | 256 | 0.89 | Model Weight |
| SoftVQ-BL-64 | 512 | 0.71 | Model Weight |
| SoftVQ-L-32 | 512 | 0.64 | Model Weight |
| Genenerative Model | Image Size | Tokenizer | gFID (w/o CFG) | gFID (w/ CFG) | Huggingface |
|---|---|---|---|---|---|
| SiT-XL | 256 | SoftVQ-L-64 | 5.35 | 1.86 | Model Weight |
| SiT-XL | 256 | SoftVQ-BL-64 | 5.80 | 1.88 | Model Weight |
| SiT-XL | 256 | SoftVQ-B-64 | 5.98 | 1.78 | Model Weight |
| SiT-XL | 256 | SoftVQ-L-32 | 7.59 | 2.44 | Model Weight |
| SiT-XL | 256 | SoftVQ-BL-32 | 7.67 | 2.44 | Model Weight |
| SiT-XL | 256 | SoftVQ-B-32 | 7.99 | 2.51 | Model Weight |
| SiT-XL | 512 | SoftVQ-BL-64 | 7.96 | 2.21 | Model Weight |
| SiT-XL | 512 | SoftVQ-L-32 | 10.97 | 4.23 | Model Weight |
| Genenerative Model | Image Size | Tokenizer | gFID (w/o CFG) | gFID (w/ CFG) | Huggingface |
|---|---|---|---|---|---|
| DiT-XL | 256 | SoftVQ-L-64 | 5.83 | 2.93 | Model Weight |
| DiT-XL | 256 | SoftVQ-L-32 | 9.07 | 3.69 | Model Weight |
Train Tokenizer
torchrun --nproc_per_node=8 train/train_tokenizer.py --config configs/softvq-l-64.yaml
Train SiT
torchrun --nproc_per_node=8 train/train_sit.py --report-to="wandb" --allow-tf32 --mixed-precision="bf16" --seed=0 --path-type="linear" --prediction="v" --weighting="lognormal" --model="SiT-XL/1" --vae-model='softvq-l-64' --output-dir="experiments/sit" --exp-index=1 --data-dir=./imagenet/train
Train DiT
torchrun --nproc_per_node=8 train/train_dit.py --data-path ./imagenet/train --results-dir experiments/dit --model DiT-XL/1 --epochs 1400 --global-batch-size 256 --mixed-precision bf16 --vae-model='softvq-l-64' --noise-schedule cosine --disable-compile
Reconstruction
torchrun --nproc_per_node=8 inference/reconstruct_vq.py --data-path ./ImageNet/val --vq-model SoftVQVAE/softvq-l-64
SiT Generation
torchrun --nproc_per_node=8 inference/generate_sit.py --tf32 True --model SoftVQVAE/sit-xl_softvq-b-64 --cfg-scale 1.75 --path-type cosine --num-steps 250 --guidance-high 0.7 --vae-model softvq-l-64
DiT Generation
torchrun --nproc_per_node=8 inference/generate_dit.py --model SoftVQVAE/dit-xl_softvq-b-64--cfg-scale 1.75 --noise-schedule cosine --num-sampling-steps 250 --vae-model softvq-l-64
Evaluation We use ADM evaluation toolkit to compute the FID/IS of generated samples
GMM Fitting
# save training latent first
torchrun --nproc_per_node=8 inference/cache_latent.py --dataset imagenet --data-path imagenet/train --sample-dir saved_latent --vae-name maetok-b-128
# run gmm
python inference/gmm_fit.py --use_gpu 0 --exp maetok-b-128 --n_iter 500 --samples_per_class 100 --components 5 10 50 100 200 300
@inproceedings{chen2025maetok,
title={Masked Autoencoders Are Effective Tokenizers for Diffusion Models},
author={Hao Chen and Yujin Han and Fangyi Chen and Xiang Li and Yidong Wang and Jindong Wang and Ze Wang and Zicheng Liu and Difan Zou and Bhiksha Raj},
booktitle={ICML},
year={2025},
}
@inproceedings{chen2025softvqvae,
title={SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer},
author={Hao Chen and Ze Wang and Xiang Li and Ximeng Sun and Fangyi Chen and Jiang Liu and Jindong Wang and Bhiksha Raj and Zicheng Liu and Emad Barsoum},
booktitle={CVPR},
year={2025},
}
A large portion of our code are borrowed from Llamagen, VAR, ImageFolder, DiT, SiT, REPA

