Pytorch implementation of TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models
✨If you want ModelScope version, please find the code at the main branch.
Despite remarkable achievements in video synthesis, achieving granular control over complex dynamics, such as nuanced movement among multiple interacting objects, still presents a significant hurdle for dynamic world modeling, compounded by the necessity to manage appearance and disappearance, drastic scale changes, and ensure consistency for instances across frames. These challenges hinder the development of video generation that can faithfully mimic real-world complexity, limiting utility for applications requiring high-level realism and controllability, including advanced scene simulation and training of perception systems. To address that, we propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control via diffusion models, which facilitates the precise manipulation of the object trajectories and interactions, overcoming the prevalent limitation of scale and continuity disruptions. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects, a critical factor overlooked in the current literature. Moreover, we demonstrate that generated video sequences by our TrackDiffusion can be used as training data for visual perception models. To the best of our knowledge, this is the first work to apply video diffusion models with tracklet conditions and demonstrate that generated frames can be beneficial for improving the performance of object trackers.
The framework generates video frames based on the provided tracklets and employs the Instance Enhancer to reinforce the temporal consistency of foreground instance. A new gated cross-attention layer is inserted to take in the new instance information.
![]()
The code is tested with Pytorch==2.0.1 and cuda 11.8 on A800 servers. To setup the python environment, follow:
cd ${ROOT}
pip install -r requirements.txtThen, continue to install third_party requirements, follow:
pip install https://download.openmmlab.com/mmcv/dist/cu117/torch2.0.0/mmcv-2.0.0-cp310-cp310-manylinux1_x86_64.whl
git clone https://github.com/open-mmlab/mmtracking.git -b dev-1.x
cd mmtracking
pip install -e .cd third_party/diffusers
pip install -e .Please download the datasets from the official websites. YouTube-VIS
YouTube-VIS 2019 dataset can be download from OpenDataLab (recommended for users in China): https://opendatalab.com/YouTubeVIS2019/download
We also provide caption files for the ytvis dataset, please download from Google Drive.
| ModelScope Version | Stable Video Diffusion Version |
|---|---|
| weight | Our training are based on stabilityai/stable-video-diffusion-img2vid. You can access the following links to obtain the weights for stage1 and stage2:Stage1 Stage2 |
We use CocoVID to maintain all datasets in this codebase. In this case, you need to convert the official annotations to this style. We provide scripts and the usages are as following:
cd ./third_party/mmtracking
# YouTube-VIS 2021
python ./tools/dataset_converters/youtubevis/youtubevis2coco.py -i ./data/youtube_vis_2021 -o ./data/youtube_vis_2021/annotations --version 2021The folder structure will be as following after your run these scripts:
│ ├── youtube_vis_2021
│ │ │── train
│ │ │ │── JPEGImages
│ │ │ │── instances.json (the official annotation files)
│ │ │ │── ......
│ │ │── valid
│ │ │ │── JPEGImages
│ │ │ │── instances.json (the official annotation files)
│ │ │ │── ......
│ │ │── test
│ │ │ │── JPEGImages
│ │ │ │── instances.json (the official annotation files)
│ │ │ │── ......
│ │ │── annotations (the converted annotation file)
Launch training with (with 8xA800):
If you encounter an error similar to AssertionError: MMEngine==0.10.3 is used but incompatible. Please install mmengine>=0.0.0, <0.2.0., please directly jump to that line of code and comment it out.
bash ./scripts/t2v.shStage 1: Training with RGB boxes
# Launch training with (with 8xA800):
bash ./scripts/stage1.shStage 2: Training with boxes only
# Launch training with (with 8xA800):
bash ./scripts/stage2.shCheck demo.ipynb for more details.
- Compare TrackDiffusion with other methods for generation quality:
- Training support with frames generated from TrackDiffusion:
More results can be found in the main paper.
We aim to construct a controllable and flexible pipeline for perception data corner case generation and visual world modeling! Check our latest works:
- GeoDiffusion: text-prompted geometric controls for 2D object detection.
- MagicDrive: multi-view street scene generation for 3D object detection.
- TrackDiffusion: multi-object video generation for MOT tracking.
- DetDiffusion: customized corner case generation.
- Geom-Erasing: geometric controls for implicit concept removal.
@misc{li2024trackdiffusion,
title={TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models},
author={Pengxiang Li and Kai Chen and Zhili Liu and Ruiyuan Gao and Lanqing Hong and Guo Zhou and Hua Yao and Dit-Yan Yeung and Huchuan Lu and Xu Jia},
year={2024},
eprint={2312.00651},
archivePrefix={arXiv},
primaryClass={cs.CV}
}