Video Depth Propagation, Luigi Piccinelli, Thiemo Wandel, Christos Sakaridis, Wim Abbeloos, Luc Van Gool, 3DV 2026, Paper at arXiv 2512.10725
- [] Releasing training code.
- [] Releasing evaluation datasets.
-
14.12.2025: Model and code released. -
04.11.2025: VeloDepth is accepted at 3DV 2026!
Check more results in our website!
Requirements are not hard requirements, but there might be some differences (not tested):
- Linux
- Python 3.11+
- CUDA 12.1+
Install the environment needed to run VeloDepth with:
export VENV_DIR=<YOUR-VENVS-DIR>
export NAME=velodepth
python -m venv $VENV_DIR/$NAME
source $VENV_DIR/$NAME/bin/activate
# Install VeloDepth and dependencies (more recent CUDAs work fine)
pip install -e . --extra-index-url https://download.pytorch.org/whl/cu121
# Install Pillow-SIMD (Optional)
pip uninstall pillow
CC="cc -mavx2" pip install -U --force-reinstall pillow-simd
# Install KNN (for evaluation only)
cd ./velodepth/ops/knn;bash compile.sh;cd ../../../If you use conda, you should change the following:
python -m venv $VENV_DIR/$NAME -> conda create -n $NAME python=3.11
source $VENV_DIR/$NAME/bin/activate -> conda activate $NAMERun VeloDepth on the given assets to test your installation (you can check this script as guideline for further usage):
python ./scripts/demo.py --video ./assets/demo/bears.mp4 --out_dir ./dataIf everything runs correctly, demo.py will save RGBs, Depth maps for each frame of the video bears.mp4
After installing the dependencies, you can load the pre-trained models easily from Hugging Face as follows:
from velodepth.models import VeloDepth
model = VeloDepth.from_pretrained("lpiccinelli/velodepth")Then you can generate the metric 3D estimation and rays prediction directly from a single RGB image only as follows:
import numpy as np
from PIL import Image
# Move to CUDA, if any
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Load the RGB image and the normalization will be taken care of by the model
video_path = "./assets/demo/bears.mp4"
cap = cv2.VideoCapture(video_path)
frames = []
idx = 0
while True:
ok, frame_bgr = cap.read()
if not ok:
break
if idx % stride == 0:
frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
frames.append(frame_rgb)
if max_frames is not None and len(frames) >= max_frames:
break
idx += 1
cap.release()
rgbs = torch.from_numpy(np.stack(frames, axis=0)).astype(np.uint8)
predictions = model.infer(rgb, normalize=True)
# Point Cloud in Camera Coordinate
xyz = predictions["points"]
# Unprojected rays
rays = predictions["rays"]
# Metric Depth Estimation
depth = predictions["depth"]You can use ground truth camera parameters or rays as input to the model as well:
from velodepth.utils.camera import (Pinhole, OPENCV, Fisheye624, MEI, Spherical)
camera_path = "assets/demo/scannet.json" # any other json file
with open(camera_path, "r") as f:
camera_dict = json.load(f)
params = torch.tensor(camera_dict["params"])
name = camera_dict["name"]
camera = eval(name)(params=params)
predictions = model.infer(rgb, camera)To use the forward method for your custom training, you should:
- Take care of the dataloading:
a) ImageNet-normalization
b) Long-edge based resizing (and padding) with input shape provided inimage_shapeunder configs
c)BxCxHxWformat
d) If any intriniscs given, adapt them accordingly to your resizing - Format the input data structure as:
data = {"image": rgb, "rays": rays}
predictions = model(data, {})To run locally, you can use the script ./scripts/infer.py via the following command:
python ./scripts/infer.py --input IMAGE_PATH --output OUTPUT_FOLDER --config-file configs/eval/vitl.json --camera-path CAMERA_JSON --save --save-plyUsage: python ./scripts/demo.py [OPTIONS]
Options:
--video PATH Path to input video OR a folder with frames.
--frames_from_folder Interpret --video as a folder of frames.
--camera_json PATH Optional camera JSON applied to all frames.
--out_dir PATH Optional output directory for RGB/depth/rays and PLY.
--ply_frame INTEGER Save a PLY for this frame index.
--max_frames INTEGER Limit number of frames.
--stride INTEGER Sample every Nth frame.
--resolution_level INT Model resolution bucket [0..9].
--interpolation TEXT Output upsampling mode (bilinear | bicubic).
See also ./scripts/demo.py
The available model is the propagation of UniK3D, i.e. the base keyframe monodepth model.
Please visit Hugging Face or click on the links above to access the repo models with weights. You can load VeloDepth as the following:
from velodepth.models import VeloDepth
model = VeloDepth.from_pretrained(f"lpiccinelli/velodepth")In addition, we provide loading from TorchHub as:
model = torch.hub.load("lpiccinelli-eth/velodepth", "VeloDepth", pretrained=True, trust_repo=True, force_reload=True)Please visit the docs/train for more information.
Please visit the docs/eval for more information about running evaluation.
The metrics is delta_1 for accuracy and tau_5 for pairwise consistency (please check the paper for mathematical details) over metric 4D pointcloud (higher is better) on zero-shot evaluation.
If you find any bug in the code, please report to Luigi Piccinelli ([email protected])
If you find our work useful in your research please consider citing our publications:
@inproceedings{piccinelli2026velodepth,
title = {Video Depth Propagation},
author = {Piccinelli, Luigi and Wandel, Thiemo and Sakaridis, Christos and Abbeloos, Wim and Van Gool, Luc},
booktitle = {Proceedings of the International Conference on 3D Vision (3DV)},
year = {2026}
}This software is released under Creatives Common BY-NC 4.0 license. You can view a license summary here.
This work is funded by Toyota Motor Europe via the research project TRACE-Zurich (Toyota Research on Automated Cars Europe).


