Jianyuan Wang1,2
Minghao Chen1
Shangzhan Zhang1
Nikita Karaev1
Johannes Schönberger2
Patrick Labatut2
Piotr Bojanowski2
David Novotny
Andrea Vedaldi1,2
Christian Rupprecht1
Before using the models, please request access to the checkpoints here. Once your request is approved, you can download the checkpoints. Please note that access requests are reviewed by an automated process based on the information provided in the request.
| Model | Resolution | Text alignment | Download |
|---|---|---|---|
VGGT-Omega-1B-512 |
512 | No | Link |
VGGT-Omega-1B-256-Text-Alignment |
256 | Yes | Link |
The authors are not involved in the review process and cannot approve or reject individual applications. However, the 🤗 Hugging Face demo is available to everyone.
First, clone this repository and install the dependencies:
git clone [email protected]:facebookresearch/vggt-omega.git
cd vggt-omega
pip install -r requirements.txt
pip install -e .Now, try the model with a few lines of code:
import torch
from vggt_omega.models import VGGTOmega
from vggt_omega.utils.load_fn import load_and_preprocess_images
from vggt_omega.utils.pose_enc import encoding_to_camera
checkpoint_path = "path/to/vggt_omega_1b_512.pt"
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]
model = VGGTOmega().to("cuda").eval()
model.load_state_dict(torch.load(checkpoint_path, map_location="cpu"))
images = load_and_preprocess_images(image_names, image_resolution=512).to("cuda")
with torch.inference_mode():
predictions = model(images)
extrinsics, intrinsics = encoding_to_camera(
predictions["pose_enc"],
predictions["images"].shape[-2:],
)
depth = predictions["depth"]
depth_conf = predictions["depth_conf"]
camera_and_register_tokens = predictions["camera_and_register_tokens"]
camera_tokens = camera_and_register_tokens[:, :, :1]
registers = camera_and_register_tokens[:, :, 1:]For the text-aligned checkpoint, use VGGTOmega(enable_alignment=True) with image_resolution=256 and read predictions["text_alignment_embedding"].
Install the demo dependencies:
pip install -r requirements_demo.txtLaunch the Gradio demo with a local checkpoint path:
python demo_gradio.py \
--checkpoint checkpoints/VGGT-Omega-1B-512/model.pt \
--image-resolution 512The demo accepts uploaded images or a video, runs camera and depth inference, and visualizes the depth-unprojected point cloud and predicted cameras as a GLB scene.
We benchmark the end-to-end peak GPU memory usage of VGGT-Omega-1B-512 on a
single NVIDIA A100 GPU with 624x416 input images. The measurement covers the full
inference program, from loading the model weights onto the GPU through the
forward pass, so it includes both the memory needed to store the model itself
and the memory used by inference activations and buffers. In other words, a GPU
with at least the listed available memory is able to run the corresponding
number of input frames under this setup.
| Input Frames | 1 | 10 | 25 | 50 | 100 | 200 | 300 | 400 | 500 |
|---|---|---|---|---|---|---|---|---|---|
| Peak Memory (GB) | 6.02 | 6.67 | 7.80 | 9.66 | 13.37 | 20.82 | 28.26 | 35.71 | 43.15 |
The benchmark uses load_and_preprocess_images
with the default mode="balanced" and image_resolution=512. For these roughly
3:2 landscape images, this produces 624x416 inputs. You can set
mode="max_size" to resize the longest side to 512 instead; for the same aspect
ratio, this gives about 512x336 inputs and uses less GPU memory.
See the LICENSE file for details about the license under which this code is made available.
@misc{wang2026vggtomega,
title={VGGT-$\Omega$},
author={Jianyuan Wang and Minghao Chen and Shangzhan Zhang and Nikita Karaev and Johannes Schönberger and Patrick Labatut and Piotr Bojanowski and David Novotny and Andrea Vedaldi and Christian Rupprecht},
year={2026},
eprint={2605.15195},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.15195},
}