Skip to content

ngoductuanlhp/DAGE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

Tuan Duc Ngo1Β Β  Jiahui Huang2Β Β  Seoung Wug Oh2Β Β  Kevin Blackburn-Matzen2Β Β 
Evangelos Kalogerakis1,3Β Β  Chuang Gan1Β Β  Joon-Young Lee2

1UMass Amherst Β Β Β  2Adobe Research Β Β Β  3TU Crete

CVPR 2026

Paper Project Page

DAGE delivers accurate and consistent 3D geometry, fine-grained and high-resolution depthmaps, while maintaining efficiency and scalability.

🧭 Overview

DAGE is a dual-stream transformer that disentangles global coherence from fine detail for geometry estimation from uncalibrated multi-view/video inputs.

  • LR stream builds view-consistent representations and estimates cameras efficiently.
  • HR stream preserves sharp boundaries and fine structures per-frame.
  • Lightweight adapter fuses the two via cross-attention without disturbing the pretrained single-frame pathway.
  • Scales resolution and clip length independently, supports inputs up to 2K, and achieves state-of-the-art on video geometry estimation and multi-view reconstruction.

πŸ“’ Updates

  • [Mar, 2026] Initial release with inference, training code and model checkpoint.

πŸš€ Quick Start

πŸ› οΈ 1. Clone & Install Dependencies

git clone https://github.com/ngoductuanlhp/DAGE.git
cd DAGE

bash scripts/instal_env.sh
conda activate dage

This creates a conda environment with Python 3.10, PyTorch 2.10.0 (CUDA 13.0), and all required dependencies.

🎬 2. Run Inference

Run on the included demo data or your own video/image folder:

# Run with default settings on demo data
bash demo.sh

# Or run directly with custom arguments

# Default: LR at 252px, HR at 3600 tokens (~840x840 for square images)
python inference/infer_dage.py --checkpoint TuanNgo/DAGE

# Higher LR resolution (better camera poses, more compute)
python inference/infer_dage.py --checkpoint TuanNgo/DAGE --lr_max_size 518

# Higher HR resolution up to 2K (sharper pointmaps)
python inference/infer_dage.py --checkpoint TuanNgo/DAGE --hr_max_size 1920

# Memory-efficient chunking for GPUs with <40GB VRAM (lower chunk_size if OOM)
python inference/infer_dage.py --checkpoint TuanNgo/DAGE --hr_max_size 1920 --chunk_size 8

Arguments:

Argument Default Description
--checkpoint TuanNgo/DAGE Path to model checkpoint
--output_dir quali_results/dage Directory to save results
--lr_max_size 252 Max resolution for the LR stream
--hr_max_size None Max resolution for the HR stream (auto-computed from 3600 tokens if not set)
--chunk_size None Chunk size for HR stream (enables memory-efficient chunked inference)

Input: Place videos (.mp4, .MOV) or image folders in assets/demo_data/.

Output: For each input, the script saves:

  • <name>_disp_colored.mp4 β€” colorized disparity video
  • <name>_depth_colored.mp4 β€” colorized depth video
  • <name>.npy β€” dictionary with pointmap, pointmap_global, pointmap_mask, rgb, and extrinsics

πŸ€— 3. Model Checkpoints

Our checkpoint is available at πŸ€— Hugging Face Hub: TuanNgo/DAGE

Or you can manually download the checkpoint and place it in the checkpoints/ directory:

mkdir -p checkpoints

gdown --fuzzy https://drive.google.com/file/d/1BsBJ7MTarlBP5RjCVfPQoQMsCxccBabF/view?usp=sharing -O ./checkpoints/

πŸ“˜ Detailed Usage

πŸ”„ Model Input & Output

  • Input: torch.Tensor of shape (B, N, 3, H, W) with pixel values in [0, 1].
  • Output: A dict with the following keys:
Key Shape Description
local_points (B, N, H, W, 3) Per-view 3D point maps in local camera space
conf (B, N, H, W, 1) Confidence logits (apply torch.sigmoid() for probabilities)
camera_poses (B, N, 4, 4) Camera-to-world transformation matrices (OpenCV convention)
metric_scale (B, 1) Predicted metric scale factor
global_points (B, N, H, W, 3) 3D points in world space (after infer())
mask (B, N, H, W) Binary confidence mask (after infer())

πŸ’‘ Example Code Snippet

import torch
from einops import rearrange

from dage.models.dage import DAGE
from dage.utils.data_utils import read_video, resize_to_max_side

# --- Setup ---
device = 'cuda'
model = DAGE.from_pretrained('checkpoints/model.pt').to(device).eval()

# --- Load Data ---
# read_video returns (frames, H, W, fps)
# Options: stride=N, max_frames=N, force_num_frames=N
video, H, W, fps = read_video('path/to/video.mp4', stride=10, max_frames=100)

# Prepare tensors (B, N, C, H, W), values in [0, 1]

lr_max_size = 252
hr_max_size = 518 # or 1022 / 1918

lr_video, lr_height, lr_width = resize_to_max_side(video, lr_max_size)
hr_video, hr_height, hr_width = resize_to_max_side(video, hr_max_size)  
hr_num_tokens = (hr_height // 14) * (hr_width // 14)

lr_video = rearrange(torch.from_numpy(lr_video), 't h w c -> 1 t c h w').float().to(device) / 255.0
hr_video = rearrange(torch.from_numpy(hr_video), 't h w c -> 1 t c h w').float().to(device) / 255.0

# --- Inference ---
with torch.no_grad():
    output = model.infer(
        hr_video=hr_video,
        lr_video=lr_video,
        lr_max_size=lr_max_size,
        hr_num_tokens=hr_num_tokens,
        chunk_size=None,  # optional, for memory efficiency
    )

# Access outputs
local_points = output['local_points']   # (N, H, W, 3)
global_points = output['global_points'] # (N, H, W, 3)
camera_poses = output['camera_poses']   # (N, 4, 4)
mask = output['mask']                   # (N, H, W)

πŸ“ Resolution Handling

Both streams require resolutions that are multiples of the patch size (14). The HR stream defaults to 3600 tokens total (e.g., 840x840 for square images, 630x1120 for 9:16), but can be overridden with --hr_max_size.

πŸ‘€ Visualization

We use viser for interactive 3D point cloud visualization. The inference script saves .npy files that can be directly visualized.

Dynamic scenes β€” renders pointmaps sequentially with playback controls:

python visualization/vis_pointmaps.py --data_path quali_results/dage/<name>.npy

# NOTE removing floating points at edges (if exist)
# python visualization/vis_pointmaps.py --data_path quali_results/dage/<name>.npy --filter_edge

Static scenes β€” merges all frames into a single point cloud in a shared coordinate frame:

python visualization/vis_pointmaps_all.py --data_path quali_results/dage/<name>.npy

# NOTE removing floating points at edges (if exist)
# python visualization/vis_pointmaps_all.py --data_path quali_results/dage/<name>.npy --filter_edge

πŸŽ“ Training

See docs/TRAINING.md for detailed instructions on data preparation, loss functions, and configuration.

πŸ“Š Evaluation

See docs/EVALUATION.md for detailed instructions.

πŸ—‚οΈ Project Structure

DAGE/
β”œβ”€β”€ assets/
β”‚   └── demo_data/                  # Demo videos for inference
β”œβ”€β”€ configs/
β”‚   └── model_config_dage.yaml      # Model architecture config
β”œβ”€β”€ dage/                           # Main package
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ dage.py                 # DAGE model
β”‚   β”‚   β”œβ”€β”€ dinov2/                 # DINOv2 backbone
β”‚   β”‚   β”œβ”€β”€ layers/                 # Transformer blocks, attention, camera head
β”‚   β”‚   └── moge/                   # MoGe encoder components
β”‚   └── utils/                      # Geometry, visualization, data loading
β”œβ”€β”€ evaluation/                     # Benchmark evaluation
β”œβ”€β”€ inference/
β”‚   └── infer_dage.py               # Main inference script
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ eval/                       # Evaluation bash scripts
β”‚   β”œβ”€β”€ infer/                      # Inference bash scripts
β”‚   └── instal_env.sh               # Environment setup
β”œβ”€β”€ setup.py
β”œβ”€β”€ third_party/                    # Code for related work (VGGT, Pi3, Cut3r, etc)
└── training/
    β”œβ”€β”€ dataloaders/                # Video dataloaders & dataset configs
    β”œβ”€β”€ loss/                       # Loss functions
    β”œβ”€β”€ train_dage_stage{1,2,3}.py  # Three-stage training scripts
    └── training_configs/           # YAML configs for trainings

πŸ™ Acknowledgements

Our work builds upon several open-source projects:

πŸ“ Citation

If you find our work useful, please consider citing:

@inproceedings{ngo2026dage,
  title={DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation},
  author={Ngo, Tuan Duc and Huang, Jiahui and Oh, Seoung Wug and Blackburn-Matzen, Kevin and Kalogerakis, Evangelos and Gan, Chuang and Lee, Joon-Young},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

βš–οΈ License

The code in this repository is released under the CC BY-NC 4.0 license, unless otherwise specified.

About

[CVPR 2026] DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors