[ICML25 Spotlight] Masked Autoencoders Are Effective Tokenizers for Diffusion Models

[CVPR25] SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Change Logs

[05/28/2025] Training code of MAETok released.
[02/05/2025] 512 and 256 SiT models and MAETok released. LightingDiT models will be updated and we will update the training scripts soon.
[12/19/2024] 512 SiT models and DiT models released. We also updated the training scripts.
[12/18/2024] All models have been released at: https://huggingface.co/SoftVQVAE. Checkout demo here.

Setup

conda create -n softvq python=3.10 -y
conda activate softvq
pip install -r requirements.txt

Models

MAETok Tokenizers

Tokenizer	Image Size	rFID	Huggingface
MAETok-B-128	256	0.48	Model Weight
MAETok-B-128-512	512	0.62	Model Weight

SiT-XL Models on MAETok

Genenerative Model	Image Size	Tokenizer	gFID (w/o CFG)	gFID (w/ CFG)	Huggingface
SiT-XL	256	MAETok-B-128	2.31	1.67	Model Weight
SiT-XL	512	MAETok-B-128-512	2.79	1.69	Model Weight

SoftVQ-VAE Tokenizers

Tokenizer	Image Size	rFID	Huggingface
SoftVQ-L-64	256	0.61	Model Weight
SoftVQ-BL-64	256	0.65	Model Weight
SoftVQ-B-64	256	0.88	Model Weight
SoftVQ-L-32	256	0.74	Model Weight
SoftVQ-BL-32	256	0.68	Model Weight
SoftVQ-B-32	256	0.89	Model Weight
SoftVQ-BL-64	512	0.71	Model Weight
SoftVQ-L-32	512	0.64	Model Weight

SiT-XL Models on SoftVQ-VAE

Genenerative Model	Image Size	Tokenizer	gFID (w/o CFG)	gFID (w/ CFG)	Huggingface
SiT-XL	256	SoftVQ-L-64	5.35	1.86	Model Weight
SiT-XL	256	SoftVQ-BL-64	5.80	1.88	Model Weight
SiT-XL	256	SoftVQ-B-64	5.98	1.78	Model Weight
SiT-XL	256	SoftVQ-L-32	7.59	2.44	Model Weight
SiT-XL	256	SoftVQ-BL-32	7.67	2.44	Model Weight
SiT-XL	256	SoftVQ-B-32	7.99	2.51	Model Weight
SiT-XL	512	SoftVQ-BL-64	7.96	2.21	Model Weight
SiT-XL	512	SoftVQ-L-32	10.97	4.23	Model Weight

DiT-XL Models on SoftVQ-VAE

Genenerative Model	Image Size	Tokenizer	gFID (w/o CFG)	gFID (w/ CFG)	Huggingface
DiT-XL	256	SoftVQ-L-64	5.83	2.93	Model Weight
DiT-XL	256	SoftVQ-L-32	9.07	3.69	Model Weight

Training

Train Tokenizer

torchrun --nproc_per_node=8 train/train_tokenizer.py --config configs/softvq-l-64.yaml

Train SiT

torchrun --nproc_per_node=8 train/train_sit.py --report-to="wandb" --allow-tf32 --mixed-precision="bf16" --seed=0 --path-type="linear" --prediction="v" --weighting="lognormal" --model="SiT-XL/1" --vae-model='softvq-l-64' --output-dir="experiments/sit" --exp-index=1 --data-dir=./imagenet/train

Train DiT

torchrun --nproc_per_node=8 train/train_dit.py --data-path ./imagenet/train --results-dir experiments/dit --model DiT-XL/1 --epochs 1400 --global-batch-size 256 --mixed-precision bf16 --vae-model='softvq-l-64'  --noise-schedule cosine  --disable-compile

Inference

Reconstruction

torchrun --nproc_per_node=8 inference/reconstruct_vq.py --data-path ./ImageNet/val --vq-model SoftVQVAE/softvq-l-64

SiT Generation

torchrun --nproc_per_node=8 inference/generate_sit.py --tf32 True --model SoftVQVAE/sit-xl_softvq-b-64 --cfg-scale 1.75 --path-type cosine --num-steps 250 --guidance-high 0.7 --vae-model softvq-l-64

DiT Generation

torchrun --nproc_per_node=8 inference/generate_dit.py --model SoftVQVAE/dit-xl_softvq-b-64--cfg-scale 1.75 --noise-schedule cosine --num-sampling-steps 250 --vae-model softvq-l-64

Evaluation We use ADM evaluation toolkit to compute the FID/IS of generated samples

GMM Fitting

# save training latent first
torchrun --nproc_per_node=8 inference/cache_latent.py --dataset imagenet --data-path imagenet/train --sample-dir saved_latent --vae-name maetok-b-128
# run gmm
python inference/gmm_fit.py --use_gpu 0 --exp  maetok-b-128 --n_iter 500 --samples_per_class 100 --components 5 10 50 100 200 300

Reference

@inproceedings{chen2025maetok,
    title={Masked Autoencoders Are Effective Tokenizers for Diffusion Models},
    author={Hao Chen and Yujin Han and Fangyi Chen and Xiang Li and Yidong Wang and Jindong Wang and Ze Wang and Zicheng Liu and Difan Zou and Bhiksha Raj},
    booktitle={ICML},
    year={2025},
}

@inproceedings{chen2025softvqvae,
    title={SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer},
    author={Hao Chen and Ze Wang and Xiang Li and Ximeng Sun and Fangyi Chen and Jiang Liu and Jindong Wang and Bhiksha Raj and Zicheng Liu and Emad Barsoum},
    booktitle={CVPR},
    year={2025},
}

Acknowledge

A large portion of our code are borrowed from Llamagen, VAR, ImageFolder, DiT, SiT, REPA

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
configs		configs
demo		demo
inference		inference
losses		losses
modelling		modelling
train		train
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[ICML25 Spotlight] Masked Autoencoders Are Effective Tokenizers for Diffusion Models

[CVPR25] SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Change Logs

Setup

Models

MAETok Tokenizers

SiT-XL Models on MAETok

SoftVQ-VAE Tokenizers

SiT-XL Models on SoftVQ-VAE

DiT-XL Models on SoftVQ-VAE

Training

Inference

Reference

Acknowledge

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Hhhhhhao/continuous_tokenizer

Folders and files

Latest commit

History

Repository files navigation

[ICML25 Spotlight] Masked Autoencoders Are Effective Tokenizers for Diffusion Models

[CVPR25] SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Change Logs

Setup

Models

MAETok Tokenizers

SiT-XL Models on MAETok

SoftVQ-VAE Tokenizers

SiT-XL Models on SoftVQ-VAE

DiT-XL Models on SoftVQ-VAE

Training

Inference

Reference

Acknowledge

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages