Skip to content

ethz-vlg/mvtracker

Repository files navigation

selfcap dexycb 4d-dress-stretching 4d-dress-avatarmove

MVTracker is the first data-driven multi-view 3D point tracker for tracking arbitrary 3D points across multiple cameras. It fuses multi-view features into a unified 3D feature point cloud, within which it leverages kNN-based correlation to capture spatiotemporal relationships across views. A transformer then iteratively refines the point tracks, handling occlusions and adapting to varying camera setups without per-sequence optimization.

Updates

  • August 28, 2025: Public release.

Quick Start

This repo was validated on Python 3.10.12, PyTorch 2.3.0 (CUDA 12.1), cuDNN 8903, and gcc 11.3.0. If you want a fresh minimal environment that runs the Hub demo and demo.py:

conda create -n 3dpt python=3.10.12 -y
conda activate 3dpt
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install -r https://raw.githubusercontent.com/ethz-vlg/mvtracker/refs/heads/main/requirements.txt

# Optional, speeds up the model
pip install --upgrade --no-build-isolation flash-attn==2.5.8  # Speeds up attention
pip install "git+https://github.com/ethz-vlg/pointcept.git@2082918#subdirectory=libs/pointops"  # Speeds up kNN search; may require gcc 11.3.0: conda install -c conda-forge gcc_linux-64=11.3.0 gxx_linux-64=11.3.0 gcc=11.3.0 gxx=11.3.0

With the minimal dependencies in place, you can try MVTracker directly via PyTorch Hub:

import torch
import numpy as np
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"
mvtracker = torch.hub.load("ethz-vlg/mvtracker", "mvtracker", pretrained=True, device=device)

# Example input from demo sample (downloaded automatically)
sample = np.load(hf_hub_download("ethz-vlg/mvtracker", "data_sample.npz"))
rgbs = torch.from_numpy(sample["rgbs"]).float()
depths = torch.from_numpy(sample["depths"]).float()
intrs = torch.from_numpy(sample["intrs"]).float()
extrs = torch.from_numpy(sample["extrs"]).float()
query_points = torch.from_numpy(sample["query_points"]).float()

with torch.no_grad():
    results = mvtracker(
        rgbs=rgbs[None].to(device) / 255.0,
        depths=depths[None].to(device),
        intrs=intrs[None].to(device),
        extrs=extrs[None].to(device),
        query_points_3d=query_points[None].to(device),
    )

pred_tracks = results["traj_e"].cpu()  # [T,N,3]
pred_vis = results["vis_e"].cpu()      # [T,N]
print(pred_tracks.shape, pred_vis.shape)

Alternatively, you can run our interactive demo:

python demo.py --rerun save --lightweight

By default this saves a lightweight .rrd recording (e.g., mvtracker_demo.rrd) that you can open in any Rerun viewer. The simplest option is to drag and drop the file into the online viewer. For the best experience, you can also install Rerun locally (pip install rerun-sdk==0.21.0; rerun). Results can be explored interactively in the viewer with WASD/QE navigation, mouse rotation and zoom, and timeline playback controls.

[Interactive viewer on a cluster or with GUI support - click to expand]

If you are working on a cluster, you can stream results directly to your laptop by forwarding a port (ssh -R 9876:localhost:9876 user@cluster) and then running the demo in streaming mode (python demo.py --rerun stream), which sends live data into your local Rerun instance. If you are running the demo locally with GUI support, you can automatically spawn a Rerun window (python demo.py --rerun spawn).

Installation

You can use a pretrained model directly via PyTorch Hub (see Quick Start above), or clone this repo if you want to run our demo, evaluation, or training. We recommend using PyTorch with CUDA for best performance. CPU-only runs are possible but very slow.

git clone https://github.com/ethz-vlg/mvtracker.git
cd mvtracker

To extend the conda environment from the Quick Start to support training and evaluation, install the full requirements by running pip install -r requirements.full.txt. Baselines based on SpatialTracker V1 also require cupy:

pip install tensorflow==2.12.1 tensorflow-datasets tensorflow-graphics tensorboard
pip install cupy-cuda12x==12.2.0
python -m cupyx.tools.install_library --cuda 12.x --library cutensor
python -m cupyx.tools.install_library --cuda 12.x --library nccl
python -m cupyx.tools.install_library --cuda 12.x --library cudnn

Datasets

To benchmark multi-view 3D point tracking, we provide preprocessed versions of three datasets:

  • MV-Kubric: a synthetic training dataset adapted from single-view Kubric into a multi-view setting.
  • Panoptic Studio: evaluation benchmark with real-world activities such as basketball, juggling, and toy play (10 sequences).
  • DexYCB: evaluation benchmark with real-world hand–object interactions (10 sequences).
[Downloading our preprocessed datasets - click to expand]

You can download and extract them as (~72 GB after extraction):

# MV-Kubric (simulated + DUSt3R depths)
wget https://huggingface.co/datasets/ethz-vlg/mv3dpt-datasets/resolve/main/kubric-multiview--test.tar.gz -P datasets/
wget https://huggingface.co/datasets/ethz-vlg/mv3dpt-datasets/resolve/main/kubric-multiview--test--dust3r-depth.tar.gz -P datasets/
tar -xvzf datasets/kubric-multiview--test.tar.gz -C datasets/
tar -xvzf datasets/kubric-multiview--test--dust3r-depth.tar.gz -C datasets/
rm datasets/kubric-multiview*.tar.gz

# Panoptic Studio (optimization-based depth from Dynamic3DGS)
wget https://huggingface.co/datasets/ethz-vlg/mv3dpt-datasets/resolve/main/panoptic-multiview.tar.gz -P datasets/
tar -xvzf datasets/panoptic-multiview.tar.gz -C datasets/
rm datasets/panoptic-multiview.tar.gz

# DexYCB (Kinect + DUSt3R depths)
wget https://huggingface.co/datasets/ethz-vlg/mv3dpt-datasets/resolve/main/dex-ycb-multiview.tar.gz -P datasets/
wget https://huggingface.co/datasets/ethz-vlg/mv3dpt-datasets/resolve/main/dex-ycb-multiview--dust3r-depth.tar.gz -P datasets/
tar -xvzf datasets/dex-ycb-multiview.tar.gz -C datasets/
tar -xvzf datasets/dex-ycb-multiview--dust3r-depth.tar.gz -C datasets/
rm datasets/dex-ycb-multiview*.tar.gz

# $ du -sch datasets/*
# 31G     kubric-multiview
# 13G     panoptic-multiview
# 29G     dex-ycb-multiview
# 72G     total
[Regenerating datasets from scratch - click to expand]

If you wish to regenerate datasets from scratch, we provide scripts with docstrings that explain usage and list the commands we used. For licensing and usage terms, please refer to the original datasets.

For quick testing, we also release a small demo sample (~200 MB):

python demo.py --random_query_points

Our generic loader GenericSceneDataset supports adding new datasets. It can compute depths on the fly with DUSt3R, VGGT, MonoFusion, or MoGe-2, and can also estimate camera poses with VGGT.

Evaluation

Evaluation is driven by Hydra configs. See mvtracker/cli/eval.py and configs/eval.yaml for details.

To evaluate MVTracker with our best model, first download the checkpoint from Hugging Face:

wget https://huggingface.co/ethz-vlg/mvtracker/resolve/main/mvtracker_200000_june2025.pth -P checkpoints/

Then run:

python -m mvtracker.cli.eval \
  experiment_path=logs/mvtracker \
  model=mvtracker \
  datasets.eval.names=[kubric-multiview-v3-views0123] \
  restore_ckpt_path=checkpoints/mvtracker_200000_june2025.pth

# Expected result:
# {
#   "eval_kubric-multiview-v3-views0123/model__ate_visible__dynamic-static-mean": 5.07,
#   "eval_kubric-multiview-v3-views0123/model__average_jaccard__dynamic-static-mean": 81.42,
#   "eval_kubric-multiview-v3-views0123/model__average_pts_within_thresh__dynamic-static-mean": 90.00
# }

To evaluate a baseline, e.g. CoTracker3-Online (auto-downloaded checkpoint), run:

python -m mvtracker.cli.eval experiment_path=logs/cotracker3-online model=cotracker3_online

# Expected result:
# {
#   "eval_panoptic-multiview-views1_7_14_20/model__average_jaccard__any": 74.56
# }

For more baselines and dataset setups (e.g. varying camera counts, camera subsets, etc.), see scripts/slurm/eval.sh for the commands used in our experiments.

[Details on evaluation parameters - click to expand]

The evaluation datasets are specified with datasets.eval.names. Each name is parsed by the dataset from_name() factory (see e.g. DexYCBMultiViewDataset.from_name), which supports modifiers such as -views, -duster, -novelviews, -removehand, -2dpt, or -cached. This makes it easy to select subsets of cameras, enable different depth sources, or ensure deterministic track sampling. The main labeled benchmarks are:

  • Kubric (synthetic) — e.g. kubric-multiview-v3-views0123
  • Panoptic Studio (real) — e.g. panoptic-multiview-views1_7_14_20
  • DexYCB (real) — e.g. dex-ycb-multiview-views0123

For reproducibility of our main results, we also provide cached variants of each benchmark, which freeze track selection exactly as used in our paper. Without -cached, random seeding ensures reproducibility, but cached versions guarantee identical tracks across environments. The following cached variants are included in the released datasets:

  • kubric-multiview-v3-views0123-cached
  • kubric-multiview-v3-duster0123-cached
  • panoptic-multiview-views1_7_14_20-cached
  • panoptic-multiview-views27_16_14_8-cached
  • panoptic-multiview-views1_4_7_11-cached
  • dex-ycb-multiview-views0123-cached
  • dex-ycb-multiview-duster0123-cached

Training

To run a small overfitting test that fits into 24 GB GPU RAM:

python -m mvtracker.cli.train +experiment=mvtracker_overfit_mini

For a full-scale MVTracker on an 80 GB GPU:

python -m mvtracker.cli.train +experiment=mvtracker_overfit

Practical Considerations

[Scene normalization - click to expand]

Performance depends strongly on scene normalization. MVTracker was trained on Kubric with randomized but bounded scales and camera setups. At test time, scenes with very different scales, rotations, or translations must be aligned to this distribution. Our generic loader provides an automatic normalization that assumes the ground plane is parallel to the XY plane. This automatic normalization worked reasonably well for 4D-Dress, Hi4D, EgoExo4D, and SelfCap. For Panoptic and DexYCB, we applied manual similarity transforms, which are encoded in the respective dataloaders. Robust, general-purpose normalization remains an open challenge.

[Challenges and future directions - click to expand]

The central challenge in multi-view 3D point tracking is 4D reconstruction: obtaining depth maps that are accurate, temporally consistent, and available in real time, especially under sparse-view setups. MVTracker performs well when sensor depth and camera calibration are provided, but in settings where both must be estimated, errors in reconstruction quickly make tracking unreliable. While learned motion priors help tolerate moderate noise, they cannot replace a robust reconstruction backbone. We believe progress will hinge on methods that jointly solve depth estimation and tracking for mutual refinement, or large-scale foundation models for 4D reconstruction and tracking that fully leverage data and compute. We hope the community will direct future efforts toward this goal.

Acknowledgements

Our code builds upon and was inspired by many prior works, including SpaTracker, CoTracker, and DUSt3R. We thank the authors for releasing their code and pretrained models. We are also grateful to maintainers of Rerun for their helpful visualization toolkit.

Citation

If you find our repository useful, please consider giving it a star ⭐ and citing our work:

@inproceedings{rajic2025mvtracker,
  title     = {Multi-View 3D Point Tracking},
  author    = {Raji{\v{c}}, Frano and Xu, Haofei and Mihajlovic, Marko and Li, Siyuan and Demir, Irem and G{\"u}ndo{\u{g}}du, Emircan and Ke, Lei and Prokudin, Sergey and Pollefeys, Marc and Tang, Siyu},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year      = {2025}
}

About

[ICCV 2025 Oral] MVTracker: Multi-view 3D Point Tracking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published