This is a PyTorch/GPU implementation of the paper Holistic Tokenizer for Autoregressive Image Generation:
This repo contains:
- πͺ A simple PyTorch implementation of Hita tokenizer and various AR generative models.
- β‘οΈ Pre-trained Hita tokenizers and AR generative models trained on ImageNet.
- πΈ Training and evaluation scripts for tokenizer and generative models, which were also provided in here.
- π Hugging Face for easy access to pre-trained models.
- [2025/07/24] π₯ Image tokenizers and AR models for class-conditional image generation are released. π₯
- [2025/07/24] π₯ All codes of Hita have been released. π₯
- [2024/07/03] π₯ Hita has been released. Checkout the paper for details.π₯
- [2025/06/26] π₯ Hita has been accepted by ICCV 2025! π₯
If you are not using Linux, do NOT proceed.
- Clone this repository and navigate to Hita folder
git clone https://github.com/CVMI-Lab/Hita.git
cd Hita- Install Package
conda create -n hita python=3.10 -y
conda activate hita
pip install --upgrade pip # enable PEP 660 support
pip install -e .- Install additional packages for training cases as required.
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
In this repo, we release:
- Two image tokenizers: Hita-V(anilla) and Hita-U(ltra).
- Class-conditional autoregressive generative models ranging from 100M to 3B parameters.
In this repo, we release two image tokenizers: Hita-V(anilla) and Hita-U(ltra). Hita-V is utilized in the original paper, while Hita-U is an updated version that uses more advanced techniques, such as the DINO discriminator and the learning objective of pre-trained vision foundation model reconstruction proposed in VFMTok, which exhibits better image reconstruction and generation quality.
| Method | tokens | rFID (256x256) | rIS (256x256) | weight |
|---|---|---|---|---|
| Hita-V | 569 | 1.03 | 198.5 | hita-vanilla.pt |
| Hita-U | 569 | 0.57 | 221.8 | hita-ultra.pt |
| Method | params | epochs | FID | IS | weight |
|---|---|---|---|---|---|
| HitaV-B | 111M | 50 | 5.85 | 212.3 | HitaV-B-50e.pt |
| HitaV-B | 111M | 300 | 4.33 | 238.9 | HitaV-B-50e.pt |
| HitaV-L | 343M | 50 | 3.75 | 262.1 | HitaV-L-50e.pt |
| HitaV-L | 343M | 300 | 2.86 | 267.3 | HitaV-L-50e.pt |
| HitaV-XL | 775M | 50 | 2.98 | 253.4 | HitaV-XL-50e.pt |
| HitaV-XXL | 1.4B | 50 | 2.70 | 274.8 | HitaV-XXL-50e.pt |
| HitaV-2B | 2.0B | 50 | 2.59 | 281.9 | HitaV-2B-50e.pt |
| Method | params | epochs | FID | IS | weight |
|---|---|---|---|---|---|
| HitaU-B | 111M | 50 | 4.21 | 229.0 | HitaU-B-50e.pt |
| HitaU-B | 111M | 250 | 3.49 | 237.5 | HitaU-B-250e.pt |
| HitaU-L | 343M | 50 | 2.97 | 273.3 | HitaU-L-50e.pt |
| HitaU-L | 343M | 250 | 2.44 | 274.6 | HitaU-L-250e.pt |
| HitaU-XL | 775M | 50 | 2.40 | 276.3 | HitaU-XL-50e.pt |
| HitaU-XL | 775M | 100 | 2.16 | 275.3 | HitaU-XL-100e.pt |
| HitaU-XXL | 1.4B | 50 | 2.07 | 273.8 | HitaU-XXL-50e.pt |
| HitaU-XXL | 1.4B | 100 | 2.01 | 276.4 | HitaU-XXL-100e.pt |
| HitaU-2B | 2.0B | 50 | 1.93 | 286.0 | HitaU-2B-50e.pt |
| HitaU-2B | 2.0B | 100 | 1.82 | 282.9 | HitaU-2B-100e.pt |
Once the pre-trained VFM features and the original image reconstruction are simultaneously conducted, we found that the trained Hita-U(ltra), when integrated into the AR generation models, can achieve image generation without CFG-guidance.
| Method | params | epochs | FID | IS | weight |
|---|---|---|---|---|---|
| HitaU-B | 111M | 50 | 8.32 | 108.5 | HitaU-B-50e.pt |
| HitaU-B | 111M | 250 | 5.19 | 138.9 | HitaU-B-250e.pt |
| HitaU-L | 343M | 50 | 3.96 | 151.8 | HitaU-L-50e.pt |
| HitaU-L | 343M | 250 | 2.46 | 188.9 | HitaU-L-250e.pt |
| HitaU-XL | 775M | 50 | 2.66 | 178.9 | HitaU-XL-50e.pt |
| HitaU-XL | 775M | 100 | 2.21 | 195.8 | HitaU-XL-100e.pt |
| HitaU-XXL | 1.4B | 50 | 2.21 | 196.0 | HitaU-XXL-50e.pt |
| HitaU-XXL | 1.4B | 100 | 1.84 | 217.2 | HitaU-XXL-100e.pt |
| HitaU-2B | 2.0B | 50 | 1.97 | 208.6 | HitaU-2B-50e.pt |
| HitaU-2B | 2.0B | 100 | 1.69 | 233.0 | HitaU-2B-100e.pt |
- Download the DINOv2-L pre-trained foundation model from the official model zoo.
- Create symbolic links that point from the locations of the pretrained DINOv2-L model and the ImageNet training dataset folders to this directory.
- Create dataset script for your own dataset. Here, we provide a template for training tokenizers and AR generative models using the ImageNet dataset in LMDB format.
ln -s DINOv2-L_folder init_models
ln -s ImageNetFolder imagenet- Training Hita-V tokenizer:
export NODE_COUNT=1
export NODE_RANK=0
export PROC_PER_NODE=8
scripts/torchrun.sh vq_train.py --image-size 336 --results-dir output --mixed-precision bf16 --codebook-embed-dim 8 --disc-type patchgan \
--data-path imagenet/lmdb/train_lmdb --global-batch-size 256 --num-workers 4 --ckpt-every 5000 --epochs 50 --log-every 1 --lr 1e-4 \
--transformer-config configs/hita_vqgan.yaml --ema --z-channels 512- Training Hita-V tokenizer:
scripts/torchrun.sh vq_train.py --image-size 336 --results-dir output --mixed-precision bf16 --codebook-embed-dim 8 --disc-type dinogan \
--data-path imagenet/lmdb/train_lmdb --global-batch-size 256 --num-workers 4 --ckpt-every 5000 --epochs 50 --log-every 1 --lr 1e-4 \
--transformer-config configs/hita_vqgan_ultra.yaml --ema --z-channels 512 --enable-vfm-recon- Training AR generative models
model_type='GPT-L' # 'GPT-B' 'GPT-XL' 'GPT-XXL' 'GPT-2B'
scripts/torchrun.sh \
train_c2i.py --gpt-type c2i --image-size 336 --gpt-model ${model_type} --downsample-size 16 --num-workers 4 \
--anno-file imagenet/lmdb/train_lmdb --global-batch-size 256 --ckpt-every 10000 --ema --log-every 1 \
--results-dir output/vanilla --vq-ckpt pretrained_models/hita-tok.pt --epochs 300 --codebook-embed-dim 8 \
--codebook-slots-embed-dim 12 --transformer-config-file configs/hita_vqgan.yaml --mixed-precision bf16 --lr 1e-4- Resume from an AR generative checkpoint
model_type='GPT-L'
scripts/torchrun.sh \
train_c2i.py --gpt-type c2i --image-size 336 --gpt-model ${model_type} --downsample-size 16 --num-workers 4 \
--anno-file imagenet/lmdb/train_lmdb --global-batch-size 270 --ckpt-every 10000 --ema --log-every 1 \
--results-dir output/vanilla --vq-ckpt pretrained_models/hita-tok.pt --epochs 300 --codebook-embed-dim 8 \
--codebook-slots-embed-dim 12 --transformer-config-file configs/hita_vqgan.yaml --mixed-precision bf16 \
--lr 1e-4 --gpt-ckpt output/vanilla/${model_type}/${model_type}-{ckpt_name}.pt- Evaluated a pretrained Hita-V tokenizer
scripts/torchrun.sh \
vqgan_test.py --vq-model VQ-16 --image-size 336 --output_dir recons --batch-size 50 \
--transformer-config-file configs/hita_vqgan.yaml --z-channels 512 \
--vq-ckpt pretrained_models/hita-tok.pt- Evaluate a pretrained Hita-U tokenizer:
scripts/torchrun.sh \
vqgan_test.py --vq-model VQ-16 --image-size 336 --output_dir recons --batch-size 50 \
--transformer-config-file configs/hita_vqgan_ultra.yaml --z-channels 512 \
--vq-ckpt pretrained_models/hita-ultra.pt- Evaluate a pretrained AR generative model
model_type='GPT-L' # 'GPT-B' 'GPT-XL' 'GPT-XXL' 'GPT-2B'
scripts/torchrun.sh \
test_net.py --vq-ckpt pretrained_models/hita-ultra.pt --gpt-ckpt output/ultra/${model_type}/${model_type}-$1.pt \
--num-slots 128 --gpt-model ${model_type} --image-size 336 --compile --sample-dir samples --cfg-scale $2 \
--image-size-eval 256 --precision bf16 --per-proc-batch-size $3 --codebook-embed-dim 8 --codebook-slots-embed-dim 12 \
--transformer-config-file configs/hita_vqgan_ultra.yamlIf you find Hita useful for your research and applications, please kindly cite using this BibTeX:
@article{zheng2025holistic,
title={Holistic Tokenizer for Autoregressive Image Generation},
author={Zheng, Anlin and Wang, Haochen and Zhao, Yucheng and Deng, Weipeng and Wang, Tiancai and Zhang, Xiangyu and Qi, Xiaojuan},
journal={arXiv preprint arXiv:2507.02358},
year={2025}
}
@article{zheng2025vision,
title={Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation},
author={Zheng, Anlin and Wen, Xin and Zhang, Xuanyang and Ma, Chuofan and Wang, Tiancai and Yu, Gang and Zhang, Xiangyu and Qi, Xiaojuan},
journal={arXiv preprint arXiv:2507.08441},
year={2025}
}
The majority of this project is licensed under MIT License. Portions of the project are available under separate license of referred projects, detailed in corresponding files.
Our codebase builds upon several excellent open-source projects, including LlamaGen and Paintmind. We are grateful to the communities behind them.
This codebase has been cleaned up but has not undergone extensive testing. If you encounter any issues or have questions, please open a GitHub issue. We appreciate your feedback!
