Skip to content

cvlab-kaist/DiTracker

Repository files navigation

Repurposing Video Diffusion Transformers for Robust Point Tracking

Soowon Son1 · Honggyu An1 · Chaehyun Kim1 · Hyunah Ko1 · Jisu Nam1 · Dahyun Chung1 ·
Siyoon Jin1 · Jung Yi1 · Jaewon Min1 · Junhwa Hur2† · Seungryong Kim1†

1KAIST AI      2Google DeepMind

Co-corresponding authors

Project Page arXiv

TL;DR: DiTracker repurposes video Diffusion Transformers (DiTs) for point tracking with softmax-based matching, LoRA adaptation, and cost fusion, achieving stronger robustness and faster convergence on challenging benchmarks.


🔧 Environment Setup

Clone the repository and set up the environment:

git clone https://github.com/cvlab-kaist/DiTracker.git
cd DiTracker

conda create -n DiTracker python=3.11 -y
conda activate DiTracker
pip install -r requirements.txt
pip install -e .

# Install modified diffusers library
cd diffusers
pip install -e .
cd ..

📁 Dataset Preparation

Evaluation Datasets

Download the following datasets for evaluation:

Organize the datasets with the following directory structure:

/path/to/data/
├── tapvid/
│   ├── davis/
│   └── kinetics/
└── itto/
    └── mose/

Training Dataset

For training, we use the Kubric-MOVi-F dataset from CoTracker3. Download CoTracker3 Kubric Dataset

🚀 Inference

Pre-trained DiTracker weights are included in the ./checkpoint directory. Use these weights to evaluate on various benchmarks and challenging scenarios.

Evaluation on Benchmarks

Run the following commands to evaluate DiTracker on different benchmarks:

# ITTO-MOSE
python evaluate.py --config-name eval_itto_mose_first dataset_root=/path/to/data

# TAP-Vid-DAVIS
python evaluate.py --config-name eval_tapvid_davis_first dataset_root=/path/to/data

# TAP-Vid-Kinetics
python evaluate.py --config-name eval_tapvid_kinetics_first dataset_root=/path/to/data

Note: ITTO-MOSE evaluation includes detailed metrics on motion dynamics and reappearance frequency.

Evaluation on Corruptions

Test robustness under various ImageNet-C corruption types:

python evaluate.py dataset_root=/path/to/data severity=5
  • severity: Corruption intensity. Higher values indicate stronger corruption.

Visualization

To visualize tracked trajectories, add the visualize=True option:

python evaluate_corruption.py --config-name eval_itto_mose_first dataset_root=/path/to/data visualize=True

🏋️ Training

To train DiTracker from scratch:

python train.py --ckpt_path ./output --dataset_root /path/to/data

All training parameters are configured to match the paper's specifications. Experiments were conducted on NVIDIA RTX A6000 GPUs.

Key Training Parameters

Other parameters can be customized. But for best performance, we recommend keeping these parameters at their default values as described in the paper.

Parameter Default Value Description
--model_path CogVideoX-2B Video DiT backbone model
--layer_hooks [17] Layer indices in video DiT for query-key extraction
--head_hooks [2] Attention head indices for query-key extraction
--model_resolution [480, 720] Input resolution (height × width)
--cost_softmax True Use softmax for cost calculation (vs. normalized dot product)
--resnet_fuse_mode "concat" ResNet fusion: "add" (average), "concat", or None (disable)

🙏 Acknowledgements

This code is built upon CoTracker3 and Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. We sincerely thank the authors for their excellent work and for making their code publicly available.

📝 Citation

If you find DiTracker useful for your research, please consider citing:

@misc{son2025repurposingvideodiffusiontransformers,
      title={Repurposing Video Diffusion Transformers for Robust Point Tracking}, 
      author={Soowon Son and Honggyu An and Chaehyun Kim and Hyunah Ko and Jisu Nam and Dahyun Chung and Siyoon Jin and Jung Yi and Jaewon Min and Junhwa Hur and Seungryong Kim},
      year={2025},
      eprint={2512.20606},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.20606}, 
}

About

Official implementation of "Repurposing Video Diffusion Transformers for Robust Point Tracking"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages