GitHub

TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

Pytorch implementation of TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

✨If you want ModelScope version, please find the code at the main branch.

Abstract

Despite remarkable achievements in video synthesis, achieving granular control over complex dynamics, such as nuanced movement among multiple interacting objects, still presents a significant hurdle for dynamic world modeling, compounded by the necessity to manage appearance and disappearance, drastic scale changes, and ensure consistency for instances across frames. These challenges hinder the development of video generation that can faithfully mimic real-world complexity, limiting utility for applications requiring high-level realism and controllability, including advanced scene simulation and training of perception systems. To address that, we propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control via diffusion models, which facilitates the precise manipulation of the object trajectories and interactions, overcoming the prevalent limitation of scale and continuity disruptions. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects, a critical factor overlooked in the current literature. Moreover, we demonstrate that generated video sequences by our TrackDiffusion can be used as training data for visual perception models. To the best of our knowledge, this is the first work to apply video diffusion models with tracklet conditions and demonstrate that generated frames can be beneficial for improving the performance of object trackers.

Method

The framework generates video frames based on the provided tracklets and employs the Instance Enhancer to reinforce the temporal consistency of foreground instance. A new gated cross-attention layer is inserted to take in the new instance information.

Getting Started

Environment Setup

The code is tested with Pytorch==2.0.1 and cuda 11.8 on A800 servers. To setup the python environment, follow:

cd ${ROOT}
pip install -r requirements.txt

Then, continue to install third_party requirements, follow:

Install MMTracking

pip install https://download.openmmlab.com/mmcv/dist/cu117/torch2.0.0/mmcv-2.0.0-cp310-cp310-manylinux1_x86_64.whl

git clone https://github.com/open-mmlab/mmtracking.git -b dev-1.x

cd mmtracking
pip install -e .

Install Diffusers

cd third_party/diffusers
pip install -e .

Dataset

Please download the datasets from the official websites. YouTube-VIS

YouTube-VIS 2019 dataset can be download from OpenDataLab (recommended for users in China): https://opendatalab.com/YouTubeVIS2019/download

We also provide caption files for the ytvis dataset, please download from Google Drive.

Pretrained Weights

ModelScope Version	Stable Video Diffusion Version
weight	Our training are based on `stabilityai/stable-video-diffusion-img2vid`. You can access the following links to obtain the weights for stage1 and stage2: Stage1 Stage2

Training

1. Convert Annotations

We use CocoVID to maintain all datasets in this codebase. In this case, you need to convert the official annotations to this style. We provide scripts and the usages are as following:

cd ./third_party/mmtracking
# YouTube-VIS 2021
python ./tools/dataset_converters/youtubevis/youtubevis2coco.py -i ./data/youtube_vis_2021 -o ./data/youtube_vis_2021/annotations --version 2021

The folder structure will be as following after your run these scripts:

│   ├── youtube_vis_2021
│   │   │── train
│   │   │   │── JPEGImages
│   │   │   │── instances.json (the official annotation files)
│   │   │   │── ......
│   │   │── valid
│   │   │   │── JPEGImages
│   │   │   │── instances.json (the official annotation files)
│   │   │   │── ......
│   │   │── test
│   │   │   │── JPEGImages
│   │   │   │── instances.json (the official annotation files)
│   │   │   │── ......
│   │   │── annotations (the converted annotation file)

2. For T2V Training

Launch training with (with 8xA800):

If you encounter an error similar to AssertionError: MMEngine==0.10.3 is used but incompatible. Please install mmengine>=0.0.0, <0.2.0., please directly jump to that line of code and comment it out.

bash ./scripts/t2v.sh

3. For I2V Training (WIP)

Stage 1: Training with RGB boxes

# Launch training with (with 8xA800):

bash ./scripts/stage1.sh

Stage 2: Training with boxes only

# Launch training with (with 8xA800):

bash ./scripts/stage2.sh

Demo

Check demo.ipynb for more details.

Results

Compare TrackDiffusion with other methods for generation quality:

Training support with frames generated from TrackDiffusion:

More results can be found in the main paper.

The GeoDiffusion Family

We aim to construct a controllable and flexible pipeline for perception data corner case generation and visual world modeling! Check our latest works:

GeoDiffusion: text-prompted geometric controls for 2D object detection.
MagicDrive: multi-view street scene generation for 3D object detection.
TrackDiffusion: multi-object video generation for MOT tracking.
DetDiffusion: customized corner case generation.
Geom-Erasing: geometric controls for implicit concept removal.

Cite Us

@misc{li2024trackdiffusion,
      title={TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models}, 
      author={Pengxiang Li and Kai Chen and Zhili Liu and Ruiyuan Gao and Lanqing Hong and Guo Zhou and Hua Yao and Dit-Yan Yeung and Huchuan Lu and Xu Jia},
      year={2024},
      eprint={2312.00651},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets/figs		assets/figs
datasets		datasets
pipelines		pipelines
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt
train_svd_tracklet.py		train_svd_tracklet.py
web_demo.py		web_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

Abstract

Method

Getting Started

Environment Setup

Install MMTracking

Install Diffusers

Dataset

Pretrained Weights

Training

1. Convert Annotations

2. For T2V Training

3. For I2V Training (WIP)

Demo

Results

The GeoDiffusion Family

Cite Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

pixeli99/TrackDiffusion

Folders and files

Latest commit

History

Repository files navigation

TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models

Abstract

Method

Getting Started

Environment Setup

Install MMTracking

Install Diffusers

Dataset

Pretrained Weights

Training

1. Convert Annotations

2. For T2V Training

3. For I2V Training (WIP)

Demo

Results

The GeoDiffusion Family

Cite Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages