VGGT-Ω

Input Video transmission@parhoman

Reconstruction

Abstract

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-Ω, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol.

We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-Ω uses only ∼30% of the GPU memory of its predecessor, which allows us to train VGGT-Ω with 15× more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-Ω achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, e.g., improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding.

Key Takeaways

Reconstruction is a powerful and scalable proxy task
for spatial understanding.
Registers are natural carriers of global information,
and register attention makes that exchange efficient.
Latents learned through reconstruction (a) benefit VLA models
and (b) can be readily aligned with language.
Camera parameters and depth maps are
sufficient and compact targets for reconstruction.

Interactive Demos

Loading 3D…

Qualitative Comparison

Point-of-View Examples

These point-of-view examples pair input video (upper left) with a reconstruction rendered from the estimated camera poses (upper right). This rendering is a stringent test of pose accuracy: small camera-pose errors create visible misalignment. The lower panel provides an interactive 3D viewer.

Acknowledgements

We are deeply grateful to Bach-Thuan Bui, Ang Cao, Robert Carrillo, Vikas Chandra, Linzhuo Chen, Yuchao Dai, Andrew Davison, Alexei Efros, Haiwen Feng, David Forsyth, Jian Gao, Zhongrui Gui, Junlin Han, Tengda Han, Kaiming He, Wenlong Huang, George Hulm, Menglin Jia, Hanwen Jiang, Haian Jin, Linyi Jin, Angjoo Kanazawa, Nikhil Keetha, Zihang Lai, Hongdong Li, Runjia Li, Weiyu Li, Zhengqi Li, Philipp Lindenberger, Shaohui Liu, Shikun Liu, Iurii Makarov, Dmytro Mishkin, Linfei Pan, Feike Postmes, Michaël Ramamonjisoa, Janahan Ramanan, Belal Shaheen, Roman Shapovalov, You Shen, Yujun Shen, Kevin Sheridan, Zifan Shi, Stanislaw Szymanowicz, Letian Wang, Qianqian Wang, Yunnan Wang, Michael Wu, Rundi Wu, Shangzhe Wu, Junyu Xie, Yinghao Xu, Nan Xue, Ceyuan Yang, Jihan Yang, Chuhan Zhang, Junyi Zhang, Tianyuan Zhang, Kecheng Zheng, Yiran Zhong, and Andrew Zisserman for their discussions and support.

We thank Mikhail Parkhomenko for kindly granting us permission to use the film Transmission in our demo. We are especially grateful to Tianhong Li for insightful discussions on why dense decoding heads should be MLP-only (also see JiT). We are grateful to Tianrun Chen and Mo Xin for their help with data preprocessing and benchmarking, and to Bingyi Kang for discussions on improving DPT. Discussions with Noah Snavely, Ben Poole, Aleksander Holynski, and Jon Barron inspired parts of our exposition in the Further Insights and Discussion sections. We particularly appreciate the detailed discussions with Haotong Lin and Yifan Wang on numerous implementation details.

On a more personal note, this work is named Omega (Ω) because it is Jianyuan's last PhD paper. In trying to make this work feel complete, he kept adding new content and pursuing observations that he hoped to understand more deeply, which inevitably led to delays for months. He is deeply thankful to all coauthors for their patience, generosity, and understanding throughout this process.

Citation

@inproceedings{wang2026vggtomega,
  title     = {{VGGT-$\Omega$}},
  author    = {Jianyuan Wang and Minghao Chen and Shangzhan Zhang and Nikita Karaev and Johannes Sch{\"o}nberger and Patrick Labatut and Piotr Bojanowski and David Novotny and Andrea Vedaldi and Christian Rupprecht},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}