Wenqi Ouyang1
Zeqi Xiao1
Danni Yang2
Yifan Zhou1
Shuai Yang3
Lei Yang2
Jianlou Si2
Xingang Pan1
1S-Lab, Nanyang Technological University, 2SenseTime Research
3Wangxuan Institute of Computer Technology, Peking University
The official repo for "TokensGen: Harnessing Condensed Tokens for Long Video Generation".
teaser_small.mp4
- [2025-12-09] Our code and weights have been released
- [2025-07-20] Our project page has been established
- [2025-06-26] Our paper is accepted to ICCV 2025
- Release code and weights
Overview of the model. Left: Overall Framework for TokensGen. Right: Trainable Modules.
git clone https://github.com/Vicky0522/TokensGen.git
cd TokensGen
# install required packages
conda env create -f environment.yml
# install longvgen
python setup.py develop
Download CogVideoX-5b, To2V and T2To. Place them under the created folder weights/.
# single-GPU inference
CUDA_VISIBLE_DEVICES=0 python infer_cogvideo_mp_fifo.py --config config/infer/edit.yaml
# multi-GPU inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer_cogvideo_mp_fifo.py --config config/infer/edit.yaml
# single-GPU inference
CUDA_VISIBLE_DEVICES=0 python infer_cogvideo_mp_fifo.py --config config/infer/gen.yaml
# multi-GPU inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer_cogvideo_mp_fifo.py --config config/infer/gen.yaml
Download the MiraData
Change the video_dir and csv_file in the YAML file to your own paths. We provide the CSV files that were used to train the T2To Model.
After setting the paths, run the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train_cogvideo_to2v.py --config config/train/cogvideo_5b_vaevip_4x8x12_to2v.yaml
# After the training of To2V finishes, run the data processing script to obtain VAE latents for long videos (only the selected videos are calculated):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch calculate_vae_latents.py --config config/dataprocess/cogvideo_5b_vaevip_4x8x12_calculate_vae_latents.yaml
# Change the video_dir to the path of calculated VAE latents, and train the T2To Model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train_cogvideo_t2to.py --config config/train/cogvideo_5b_vaevip_4x8x12_t2to.yaml
If our work is helpful for your research, please consider citing:
@inproceedings{ouyang2025tokensgen,
title={TokensGen: Harnessing Condensed Tokens for Long Video Generation},
author={Ouyang, Wenqi and Xiao, Zeqi and Yang, Danni and Zhou, Yifan and Yang, Shuai and Yang, Lei and Si, Jianlou and Pan, Xingang},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={18197--18206},
year={2025}
}
