Skip to content

Hhhhhhao/continuous_tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ICML25 Spotlight] Masked Autoencoders Are Effective Tokenizers for Diffusion Models

arXiv  huggingface models 

Images generated with 128 tokens from autoencoder

[CVPR25] SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

arXiv  huggingface models 

Images generated with 32 and 64 tokens

Change Logs

Setup

conda create -n softvq python=3.10 -y
conda activate softvq
pip install -r requirements.txt

Models

MAETok Tokenizers

Tokenizer Image Size rFID Huggingface
MAETok-B-128 256 0.48 Model Weight
MAETok-B-128-512 512 0.62 Model Weight

SiT-XL Models on MAETok

Genenerative Model Image Size Tokenizer gFID (w/o CFG) gFID (w/ CFG) Huggingface
SiT-XL 256 MAETok-B-128 2.31 1.67 Model Weight
SiT-XL 512 MAETok-B-128-512 2.79 1.69 Model Weight

SoftVQ-VAE Tokenizers

Tokenizer Image Size rFID Huggingface
SoftVQ-L-64 256 0.61 Model Weight
SoftVQ-BL-64 256 0.65 Model Weight
SoftVQ-B-64 256 0.88 Model Weight
SoftVQ-L-32 256 0.74 Model Weight
SoftVQ-BL-32 256 0.68 Model Weight
SoftVQ-B-32 256 0.89 Model Weight
SoftVQ-BL-64 512 0.71 Model Weight
SoftVQ-L-32 512 0.64 Model Weight

SiT-XL Models on SoftVQ-VAE

Genenerative Model Image Size Tokenizer gFID (w/o CFG) gFID (w/ CFG) Huggingface
SiT-XL 256 SoftVQ-L-64 5.35 1.86 Model Weight
SiT-XL 256 SoftVQ-BL-64 5.80 1.88 Model Weight
SiT-XL 256 SoftVQ-B-64 5.98 1.78 Model Weight
SiT-XL 256 SoftVQ-L-32 7.59 2.44 Model Weight
SiT-XL 256 SoftVQ-BL-32 7.67 2.44 Model Weight
SiT-XL 256 SoftVQ-B-32 7.99 2.51 Model Weight
SiT-XL 512 SoftVQ-BL-64 7.96 2.21 Model Weight
SiT-XL 512 SoftVQ-L-32 10.97 4.23 Model Weight

DiT-XL Models on SoftVQ-VAE

Genenerative Model Image Size Tokenizer gFID (w/o CFG) gFID (w/ CFG) Huggingface
DiT-XL 256 SoftVQ-L-64 5.83 2.93 Model Weight
DiT-XL 256 SoftVQ-L-32 9.07 3.69 Model Weight

Training

Train Tokenizer

torchrun --nproc_per_node=8 train/train_tokenizer.py --config configs/softvq-l-64.yaml

Train SiT

torchrun --nproc_per_node=8 train/train_sit.py --report-to="wandb" --allow-tf32 --mixed-precision="bf16" --seed=0 --path-type="linear" --prediction="v" --weighting="lognormal" --model="SiT-XL/1" --vae-model='softvq-l-64' --output-dir="experiments/sit" --exp-index=1 --data-dir=./imagenet/train

Train DiT

torchrun --nproc_per_node=8 train/train_dit.py --data-path ./imagenet/train --results-dir experiments/dit --model DiT-XL/1 --epochs 1400 --global-batch-size 256 --mixed-precision bf16 --vae-model='softvq-l-64'  --noise-schedule cosine  --disable-compile

Inference

Reconstruction

torchrun --nproc_per_node=8 inference/reconstruct_vq.py --data-path ./ImageNet/val --vq-model SoftVQVAE/softvq-l-64 

SiT Generation

torchrun --nproc_per_node=8 inference/generate_sit.py --tf32 True --model SoftVQVAE/sit-xl_softvq-b-64 --cfg-scale 1.75 --path-type cosine --num-steps 250 --guidance-high 0.7 --vae-model softvq-l-64

DiT Generation

torchrun --nproc_per_node=8 inference/generate_dit.py --model SoftVQVAE/dit-xl_softvq-b-64--cfg-scale 1.75 --noise-schedule cosine --num-sampling-steps 250 --vae-model softvq-l-64

Evaluation We use ADM evaluation toolkit to compute the FID/IS of generated samples

GMM Fitting

# save training latent first
torchrun --nproc_per_node=8 inference/cache_latent.py --dataset imagenet --data-path imagenet/train --sample-dir saved_latent --vae-name maetok-b-128
# run gmm
python inference/gmm_fit.py --use_gpu 0 --exp  maetok-b-128 --n_iter 500 --samples_per_class 100 --components 5 10 50 100 200 300

Reference

@inproceedings{chen2025maetok,
    title={Masked Autoencoders Are Effective Tokenizers for Diffusion Models},
    author={Hao Chen and Yujin Han and Fangyi Chen and Xiang Li and Yidong Wang and Jindong Wang and Ze Wang and Zicheng Liu and Difan Zou and Bhiksha Raj},
    booktitle={ICML},
    year={2025},
}

@inproceedings{chen2025softvqvae,
    title={SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer},
    author={Hao Chen and Ze Wang and Xiang Li and Ximeng Sun and Fangyi Chen and Jiang Liu and Jindong Wang and Bhiksha Raj and Zicheng Liu and Emad Barsoum},
    booktitle={CVPR},
    year={2025},
}

Acknowledge

A large portion of our code are borrowed from Llamagen, VAR, ImageFolder, DiT, SiT, REPA

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages