Skip to content

Vicky0522/TokensGen

Repository files navigation


TokensGen: Harnessing Condensed Tokens for Long Video Generation

Wenqi Ouyang1 Zeqi Xiao1 Danni Yang2 Yifan Zhou1 Shuai Yang3 Lei Yang2 Jianlou Si2 Xingang Pan1
1S-Lab, Nanyang Technological University, 2SenseTime Research
3Wangxuan Institute of Computer Technology, Peking University

The official repo for "TokensGen: Harnessing Condensed Tokens for Long Video Generation".

teaser_small.mp4

🔥 News

  • [2025-12-09] Our code and weights have been released
  • [2025-07-20] Our project page has been established
  • [2025-06-26] Our paper is accepted to ICCV 2025

🔧 TODO

  • Release code and weights

🧐 Methods

Overview of the model. Left: Overall Framework for TokensGen. Right: Trainable Modules.

🌿 Installation with conda

git clone https://github.com/Vicky0522/TokensGen.git
cd TokensGen
# install required packages
conda env create -f environment.yml
# install longvgen
python setup.py develop

🚀 Quick Start

Download the weights

Download CogVideoX-5b, To2V and T2To. Place them under the created folder weights/.

Editing (To2V)

# single-GPU inference
CUDA_VISIBLE_DEVICES=0 python infer_cogvideo_mp_fifo.py --config config/infer/edit.yaml

# multi-GPU inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer_cogvideo_mp_fifo.py --config config/infer/edit.yaml

Generation (T2To + To2V)

# single-GPU inference
CUDA_VISIBLE_DEVICES=0 python infer_cogvideo_mp_fifo.py --config config/infer/gen.yaml

# multi-GPU inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer_cogvideo_mp_fifo.py --config config/infer/gen.yaml

⚙️ Training

Download the Dataset

Download the MiraData

Train the To2V and T2To Model

Change the video_dir and csv_file in the YAML file to your own paths. We provide the CSV files that were used to train the T2To Model. After setting the paths, run the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train_cogvideo_to2v.py --config config/train/cogvideo_5b_vaevip_4x8x12_to2v.yaml

# After the training of To2V finishes, run the data processing script to obtain VAE latents for long videos (only the selected videos are calculated):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch calculate_vae_latents.py --config config/dataprocess/cogvideo_5b_vaevip_4x8x12_calculate_vae_latents.yaml

# Change the video_dir to the path of calculated VAE latents, and train the T2To Model
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train_cogvideo_t2to.py --config config/train/cogvideo_5b_vaevip_4x8x12_t2to.yaml

✏️ Citation

If our work is helpful for your research, please consider citing:

@inproceedings{ouyang2025tokensgen,
  title={TokensGen: Harnessing Condensed Tokens for Long Video Generation},
  author={Ouyang, Wenqi and Xiao, Zeqi and Yang, Danni and Zhou, Yifan and Yang, Shuai and Yang, Lei and Si, Jianlou and Pan, Xingang},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={18197--18206},
  year={2025}
}

About

[ICCV 2025] TokensGen: Harnessing Condensed Tokens for Long Video Generation

Topics

Resources

Stars

Watchers

Forks

Languages