Tuan Duc Ngo1Β Β
Jiahui Huang2Β Β
Seoung Wug Oh2Β Β
Kevin Blackburn-Matzen2Β Β
Evangelos Kalogerakis1,3Β Β
Chuang Gan1Β Β
Joon-Young Lee2
1UMass Amherst Β Β Β 2Adobe Research Β Β Β 3TU Crete
CVPR 2026
DAGE delivers accurate and consistent 3D geometry, fine-grained and high-resolution depthmaps, while maintaining efficiency and scalability.
DAGE is a dual-stream transformer that disentangles global coherence from fine detail for geometry estimation from uncalibrated multi-view/video inputs.
- LR stream builds view-consistent representations and estimates cameras efficiently.
- HR stream preserves sharp boundaries and fine structures per-frame.
- Lightweight adapter fuses the two via cross-attention without disturbing the pretrained single-frame pathway.
- Scales resolution and clip length independently, supports inputs up to 2K, and achieves state-of-the-art on video geometry estimation and multi-view reconstruction.
- [Mar, 2026] Initial release with inference, training code and model checkpoint.
git clone https://github.com/ngoductuanlhp/DAGE.git
cd DAGE
bash scripts/instal_env.sh
conda activate dageThis creates a conda environment with Python 3.10, PyTorch 2.10.0 (CUDA 13.0), and all required dependencies.
Run on the included demo data or your own video/image folder:
# Run with default settings on demo data
bash demo.sh
# Or run directly with custom arguments
# Default: LR at 252px, HR at 3600 tokens (~840x840 for square images)
python inference/infer_dage.py --checkpoint TuanNgo/DAGE
# Higher LR resolution (better camera poses, more compute)
python inference/infer_dage.py --checkpoint TuanNgo/DAGE --lr_max_size 518
# Higher HR resolution up to 2K (sharper pointmaps)
python inference/infer_dage.py --checkpoint TuanNgo/DAGE --hr_max_size 1920
# Memory-efficient chunking for GPUs with <40GB VRAM (lower chunk_size if OOM)
python inference/infer_dage.py --checkpoint TuanNgo/DAGE --hr_max_size 1920 --chunk_size 8Arguments:
| Argument | Default | Description |
|---|---|---|
--checkpoint |
TuanNgo/DAGE |
Path to model checkpoint |
--output_dir |
quali_results/dage |
Directory to save results |
--lr_max_size |
252 |
Max resolution for the LR stream |
--hr_max_size |
None |
Max resolution for the HR stream (auto-computed from 3600 tokens if not set) |
--chunk_size |
None |
Chunk size for HR stream (enables memory-efficient chunked inference) |
Input: Place videos (.mp4, .MOV) or image folders in assets/demo_data/.
Output: For each input, the script saves:
<name>_disp_colored.mp4β colorized disparity video<name>_depth_colored.mp4β colorized depth video<name>.npyβ dictionary withpointmap,pointmap_global,pointmap_mask,rgb, andextrinsics
Our checkpoint is available at π€ Hugging Face Hub: TuanNgo/DAGE
Or you can manually download the checkpoint and place it in the checkpoints/ directory:
mkdir -p checkpoints
gdown --fuzzy https://drive.google.com/file/d/1BsBJ7MTarlBP5RjCVfPQoQMsCxccBabF/view?usp=sharing -O ./checkpoints/- Input:
torch.Tensorof shape(B, N, 3, H, W)with pixel values in[0, 1]. - Output: A
dictwith the following keys:
| Key | Shape | Description |
|---|---|---|
local_points |
(B, N, H, W, 3) |
Per-view 3D point maps in local camera space |
conf |
(B, N, H, W, 1) |
Confidence logits (apply torch.sigmoid() for probabilities) |
camera_poses |
(B, N, 4, 4) |
Camera-to-world transformation matrices (OpenCV convention) |
metric_scale |
(B, 1) |
Predicted metric scale factor |
global_points |
(B, N, H, W, 3) |
3D points in world space (after infer()) |
mask |
(B, N, H, W) |
Binary confidence mask (after infer()) |
import torch
from einops import rearrange
from dage.models.dage import DAGE
from dage.utils.data_utils import read_video, resize_to_max_side
# --- Setup ---
device = 'cuda'
model = DAGE.from_pretrained('checkpoints/model.pt').to(device).eval()
# --- Load Data ---
# read_video returns (frames, H, W, fps)
# Options: stride=N, max_frames=N, force_num_frames=N
video, H, W, fps = read_video('path/to/video.mp4', stride=10, max_frames=100)
# Prepare tensors (B, N, C, H, W), values in [0, 1]
lr_max_size = 252
hr_max_size = 518 # or 1022 / 1918
lr_video, lr_height, lr_width = resize_to_max_side(video, lr_max_size)
hr_video, hr_height, hr_width = resize_to_max_side(video, hr_max_size)
hr_num_tokens = (hr_height // 14) * (hr_width // 14)
lr_video = rearrange(torch.from_numpy(lr_video), 't h w c -> 1 t c h w').float().to(device) / 255.0
hr_video = rearrange(torch.from_numpy(hr_video), 't h w c -> 1 t c h w').float().to(device) / 255.0
# --- Inference ---
with torch.no_grad():
output = model.infer(
hr_video=hr_video,
lr_video=lr_video,
lr_max_size=lr_max_size,
hr_num_tokens=hr_num_tokens,
chunk_size=None, # optional, for memory efficiency
)
# Access outputs
local_points = output['local_points'] # (N, H, W, 3)
global_points = output['global_points'] # (N, H, W, 3)
camera_poses = output['camera_poses'] # (N, 4, 4)
mask = output['mask'] # (N, H, W)Both streams require resolutions that are multiples of the patch size (14). The HR stream defaults to 3600 tokens total (e.g., 840x840 for square images, 630x1120 for 9:16), but can be overridden with --hr_max_size.
We use viser for interactive 3D point cloud visualization. The inference script saves .npy files that can be directly visualized.
Dynamic scenes β renders pointmaps sequentially with playback controls:
python visualization/vis_pointmaps.py --data_path quali_results/dage/<name>.npy
# NOTE removing floating points at edges (if exist)
# python visualization/vis_pointmaps.py --data_path quali_results/dage/<name>.npy --filter_edgeStatic scenes β merges all frames into a single point cloud in a shared coordinate frame:
python visualization/vis_pointmaps_all.py --data_path quali_results/dage/<name>.npy
# NOTE removing floating points at edges (if exist)
# python visualization/vis_pointmaps_all.py --data_path quali_results/dage/<name>.npy --filter_edgeSee docs/TRAINING.md for detailed instructions on data preparation, loss functions, and configuration.
See docs/EVALUATION.md for detailed instructions.
DAGE/
βββ assets/
β βββ demo_data/ # Demo videos for inference
βββ configs/
β βββ model_config_dage.yaml # Model architecture config
βββ dage/ # Main package
β βββ models/
β β βββ dage.py # DAGE model
β β βββ dinov2/ # DINOv2 backbone
β β βββ layers/ # Transformer blocks, attention, camera head
β β βββ moge/ # MoGe encoder components
β βββ utils/ # Geometry, visualization, data loading
βββ evaluation/ # Benchmark evaluation
βββ inference/
β βββ infer_dage.py # Main inference script
βββ scripts/
β βββ eval/ # Evaluation bash scripts
β βββ infer/ # Inference bash scripts
β βββ instal_env.sh # Environment setup
βββ setup.py
βββ third_party/ # Code for related work (VGGT, Pi3, Cut3r, etc)
βββ training/
βββ dataloaders/ # Video dataloaders & dataset configs
βββ loss/ # Loss functions
βββ train_dage_stage{1,2,3}.py # Three-stage training scripts
βββ training_configs/ # YAML configs for trainings
Our work builds upon several open-source projects:
If you find our work useful, please consider citing:
@inproceedings{ngo2026dage,
title={DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation},
author={Ngo, Tuan Duc and Huang, Jiahui and Oh, Seoung Wug and Blackburn-Matzen, Kevin and Kalogerakis, Evangelos and Gan, Chuang and Lee, Joon-Young},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}The code in this repository is released under the CC BY-NC 4.0 license, unless otherwise specified.