Soowon Son1 ·
Honggyu An1 ·
Chaehyun Kim1 ·
Hyunah Ko1 ·
Jisu Nam1 ·
Dahyun Chung1 ·
Siyoon Jin1 ·
Jung Yi1 ·
Jaewon Min1 ·
Junhwa Hur2† ·
Seungryong Kim1†
1KAIST AI 2Google DeepMind
†Co-corresponding authors
TL;DR: DiTracker repurposes video Diffusion Transformers (DiTs) for point tracking with softmax-based matching, LoRA adaptation, and cost fusion, achieving stronger robustness and faster convergence on challenging benchmarks.
Clone the repository and set up the environment:
git clone https://github.com/cvlab-kaist/DiTracker.git
cd DiTracker
conda create -n DiTracker python=3.11 -y
conda activate DiTracker
pip install -r requirements.txt
pip install -e .
# Install modified diffusers library
cd diffusers
pip install -e .
cd ..Download the following datasets for evaluation:
Organize the datasets with the following directory structure:
/path/to/data/
├── tapvid/
│ ├── davis/
│ └── kinetics/
└── itto/
└── mose/
For training, we use the Kubric-MOVi-F dataset from CoTracker3. Download CoTracker3 Kubric Dataset
Pre-trained DiTracker weights are included in the ./checkpoint directory. Use these weights to evaluate on various benchmarks and challenging scenarios.
Run the following commands to evaluate DiTracker on different benchmarks:
# ITTO-MOSE
python evaluate.py --config-name eval_itto_mose_first dataset_root=/path/to/data
# TAP-Vid-DAVIS
python evaluate.py --config-name eval_tapvid_davis_first dataset_root=/path/to/data
# TAP-Vid-Kinetics
python evaluate.py --config-name eval_tapvid_kinetics_first dataset_root=/path/to/dataNote: ITTO-MOSE evaluation includes detailed metrics on motion dynamics and reappearance frequency.
Test robustness under various ImageNet-C corruption types:
python evaluate.py dataset_root=/path/to/data severity=5severity: Corruption intensity. Higher values indicate stronger corruption.
To visualize tracked trajectories, add the visualize=True option:
python evaluate_corruption.py --config-name eval_itto_mose_first dataset_root=/path/to/data visualize=TrueTo train DiTracker from scratch:
python train.py --ckpt_path ./output --dataset_root /path/to/dataAll training parameters are configured to match the paper's specifications. Experiments were conducted on NVIDIA RTX A6000 GPUs.
Other parameters can be customized. But for best performance, we recommend keeping these parameters at their default values as described in the paper.
| Parameter | Default Value | Description |
|---|---|---|
--model_path |
CogVideoX-2B |
Video DiT backbone model |
--layer_hooks |
[17] |
Layer indices in video DiT for query-key extraction |
--head_hooks |
[2] |
Attention head indices for query-key extraction |
--model_resolution |
[480, 720] |
Input resolution (height × width) |
--cost_softmax |
True |
Use softmax for cost calculation (vs. normalized dot product) |
--resnet_fuse_mode |
"concat" |
ResNet fusion: "add" (average), "concat", or None (disable) |
This code is built upon CoTracker3 and Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. We sincerely thank the authors for their excellent work and for making their code publicly available.
If you find DiTracker useful for your research, please consider citing:
@misc{son2025repurposingvideodiffusiontransformers,
title={Repurposing Video Diffusion Transformers for Robust Point Tracking},
author={Soowon Son and Honggyu An and Chaehyun Kim and Hyunah Ko and Jisu Nam and Dahyun Chung and Siyoon Jin and Jung Yi and Jaewon Min and Junhwa Hur and Seungryong Kim},
year={2025},
eprint={2512.20606},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.20606},
}