Skip to content

DengKaiCQ/VGGT-Long

Repository files navigation

VGGT-Long: Chunk it, Loop it, Align it, Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

This repository contains the source code for our work:

VGGT-Long: Chunk it, Loop it, Align it, Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Abstrat: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios.

VGGT-Long-kitti.mp4
VGGT-Long-staris.mp4
VGGT-Long-Cyberpunk.mp4

overview method details

News

[11 Dec 2025] Released the DA3-Streaming.

[04 Sep 2025] Released the Pi-Long.

[22 Jul 2025] Arxiv submitted.

Updates

[12 Dec 2025] 1. We refactored the original architecture to support arbitrary foundation models, including VGGT, Pi3, and MapAnything. The current pipeline can be extended to future 3D foundation models easily. 2. Leveraging MapAnything's multimodal inputs and its ability to predict metric/real scale, Map-Long achieved a great performance in the setting of metric scale with $\text{SE}(3)$ alignment.

[05 Nov 2025] We have uploaded the input images captured by a mobile phone in the demo on Google Drive, as we have noticed that such complex large-scale scenes seem to be quite rare on other public datasets if you need them for your own demo. See part "Self-Collected Dataset Used in Demo Video" in README.md.

[08 Oct 2025] 1. We have updated the $\text{SE}(3)$ alignment, which you can enable in the config.yaml file. Recent developments in 3D models like MapAnything now support metric scale. Under such metric scale, using 6-DoF $\text{SE}(3)$ alignment will be more stable than 7-DoF $\text{SIM}(3)$ alignment. If you are using such models, we provide a code reference for $\text{SE}(3)$ alignment. 2. We have fixed a bug in the vectorized_reservoir_sampling function in loop_utils/sim3utils.py. Special thanks to @Horace89 for the assistance!

[22 Sep 2025] We uploaded the demo video on RedNote (and we also uploaded it on Youtube later on 06 Oct 2025).

[04 Sep 2025] We have developed Pi-Long as a complementary project to Pi3 and VGGT-Long. Benefiting from Pi3's outstanding performance, Pi-Long performs even better at the kilometer scale. Feel free to check it out.

[02 Aug 2025] Updated the licensing terms of VGGT-Long to reflect the upstream dependency license (See VGGT for the changes). Please see the License Section for full details.

[30 Jul 2025] Chunk Align speed up (0.273 s/iter$\rightarrow$0.175 s/iter on my machine).

[23 Jul 2025] Fixed some bugs in scripts/download_weights.sh.

[22 Jul 2025] Arxiv submitted.

[15 Jul 2025] To help you better understand our project, we have updated some visualizations.

[14 Jun 2025] GitHub code release.

Setup, Installation & Running

🖥️ 1 - Hardware and System Environment

This project was developed, tested, and run in the following hardware/system environment

Hardware Environment:
    CPU(s): Intel Xeon(R) Gold 6128 CPU @ 3.40GHz × 12
    GPU(s): NVIDIA RTX 4090 (24 GiB VRAM)
    RAM: 67.0 GiB (DDR4, 2666 MT/s)
    Disk: Dell 8TB 7200RPM HDD (SATA, Seq. Read 220 MiB/s)

System Environment:
    Linux System: Ubuntu 22.04.3 LTS
    CUDA Version: 11.8
    cuDNN Version: 9.1.0
    NVIDIA Drivers: 555.42.06
    Conda version: 23.9.0 (Miniconda)

📦 2 - Environment Setup

Note: This repository contains a significant amount of C++ code, but our goal is to make it as out-of-the-box usable as possible for researchers, as many deep learning researchers may not be familiar with C++ compilation. Currently, the code for VGGT-Long can run in a pure Python environment, which means you can skip all the C++ compilation steps in the README.

Step 1: Dependency Installation

Creating a virtual environment using conda (or miniconda),

conda create -n vggt-long python=3.10.18
conda activate vggt-long
# pip version created by conda: 25.1

Next, install PyTorch,

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
# Verified to work with CUDA 11.8 and torch 2.5.1

Install other requirements,

pip install -r requirements.txt

Step 2: Weights Download

Download all the pre-trained weights needed(Download weights for VGGT, Pi3, and MapAnything by default.):

bash ./scripts/download_weights.sh

You can skip the next two steps if you would like to run VGGT-Long in pure Python.

Step 3 (Optional) : Compile Loop-Closure Correction Module

We provide a Python-based Sim3 solver, so VGGT-Long can run the loop closure correction solving without compiling C++ code. However, we still recommend installing the C++ solver as it is more stable and faster.

python setup.py install

We made a simple figure (below) to help you better understanding the Sec 3.2 in paper.

loop

Step 4 (Optional) : Compile DBoW Loop-Closure Detection Module

The VPR Model of DBoW is for performing VPR Model inference with CPU-only. You can skip this step.

See details

Install the OpenCV C++ API.

sudo apt-get install -y libopencv-dev

Install DBoW2

cd DBoW2
mkdir -p build && cd build
cmake ..
make
sudo make install
cd ../..

Install the image retrieval

pip install ./DPRetrieval

Step 5 (Optional) : Install mapanything as a package into the vggt-long environment,if you want to use mapanything.Link

🚀 3 - Running the code

python vggt_long.py --image_dir ./path_of_images

You can modify the parameters in the configs/base_config.yaml file. If you have created multiple yaml files to explore the effects of different parameters, you can specify the file path by adding --config to the command. For example:

python vggt_long.py --image_dir ./path_of_images --config ./configs/base_config.yaml

You can change the 'model' in config to use a different foundation model.

You may run the following cmd if you got videos before python vggt_long.py.

mkdir ./extract_images
ffmpeg -i your_video.mp4 -vf "fps=5,scale=518:-1" ./extract_images/frame_%06d.png

🛠️ 4 - Possible Problems You May Encounter

You may encounter some problems. We have collected some questions and their solutions. If you encounter similar problems, you can refer to them.

See details

a. Error about libGL.so.1.

The error comes from opencv-python, please run the following cmd to install the system dependencies.

sudo apt-get install -y libgl1-mesa-glx

b. Unable to install faiss-gpu (which is used in Loop Closure)

for example,

ERROR: Could not find a version that satisfies the requirement faiss-gpu (from versions: none)
ERROR: No matching distribution found for faiss-gpu

To address this issue, you can modify the requirements.txt file as follows:

...
faiss-gpu -> faiss-gpu-cu11 or faiss-gpu-cu12
...

Then reinstall requirements:

pip install -r requirements.txt

You can also find some alternative solutions at this link (Stackoverflow).

If this problem still remains unsolved. You may consider proceeding to Step 4 in the Environment Setup.

c. Module torch has no attribute uint64.

Checking #21, you could downgrade the library safetensors to 0.5.3.

d. Significant drift occurred in the video we recorded with our own mobile device?

This issue is likely caused by either minimal movement in your video or an excessively high frame rate, leading to accumulated drift. We have observed that with very dense input where movement between consecutive frames is small, the model's drift can increase to noticeable levels. You could try extracting video frames at a lower frame rate, such as 1fps (this is similar to keyframe processing in Visual SLAM):

ffmpeg -i your_video.mp4 -vf "fps=1,scale=518:-1" ./extract_images/frame_%06d.png

You may also consider switching to Pi-Long / Map-Long / DA3-Long, as better base models can help mitigate this issue to some extent.

Please ensure that the videos you record are free from motion blur, as the base model currently handles motion blur with limited stability.

  • Record at higher frame rates (e.g., 60 FPS) to reduce exposure time per frame
  • Use a camera stabilizer or enable stabilization features
  • Prefer wide-angle lenses, which exhibit less apparent blur from camera shake due to their larger field of view
  • Ensure adequate lighting in dark environments to prevent the camera from increasing shutter time
  • If familiar with photography, use professional mode to increase shutter speed while widening the aperture and increasing ISO

🚨 5 - Important Notice: Memory Management & Requirements

In long-sequence scenarios, addressing CPU memory and GPU memory limitations has always been a core challenge. VGGT-Long resolves GPU memory limitations encountered by VGGT through chunk-based input partitioning. As for CPU memory constraints, we achieve lower CPU memory usage by storing intermediate results on the disk (the consequences of CPU memory overflow are far more severe than GPU issues, while GPU OOM may simply terminate the program, CPU OOM can cause complete system freeze, which we absolutely want to avoid). VGGT-Long automatically retrieves locally stored intermediate results when needed. Upon completion, these temporary files are automatically deleted to prevent excessive disk space consumption. This implementation implies two key considerations:

  1. Before running, sufficient disk space must be reserved (approximately 50GiB for 4500-frame KITTI 00 sequences, or ~5GiB for 300-frame short sequences);

  2. The actual runtime depends on your disk I/O speed and memory-disk bandwidth, which may vary significantly across different computer systems.

Datasets

Our test datasets are all sourced from publicly available autonomous driving datasets, and you can download them according to the official instructions.

Waymo Open Dataset: Main page, V1.4.1. If you encounter any problems on Waymo, you can find the reference code at #33 that might be helpful. We have noticed that different people seem to handle this dataset differently.

Virtual KITTI Dataset (V1.3.1): Link

KITTI Dataset Odometry Track: Link

Self-Collected Dataset Used in Demo Video

We have uploaded our self-collected scenes used in the demo video to Google Drive, as we found it might be difficult to find similar complex scenarios, specifically, long sequences in large scene in other datasets. We have extracted rgb frames in png format from the original videos, which you can directly read.

Initially, we intended to run the COLMAP on these scenes to provide you with a visual reference, as there is no GT after the scenes were captured. However, we found that COLMAP seems difficult to optimize in such large-scale scenario. We may update the reconstruction results from COLMAP as a visual reference later once we locate the bugs.

Download link (~13 GiB): Google Drive

COLAMP failed on my machine 🥺. If you succeed in getting it to work on these scenes with COLMAP, please contact me!

There are 4 scenes in the zip file:

iphone

game

Acknowledgements

Our project is based on VGGT, DPV-SLAM, GigaSLAM. Our work would not have been possible without these excellent repositories.

Citation

If you find our work helpful, please consider citing:

@misc{deng2025vggtlongchunkitloop,
      title={VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences}, 
      author={Kai Deng and Zexin Ti and Jiawei Xu and Jian Yang and Jin Xie},
      year={2025},
      eprint={2507.16443},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.16443}, 
}

License

The VGGT-Long codebase follows VGGT's license, please refer to ./LICENSE.txt for applicable terms. For commercial use, please follow the link VGGT that should utilize the commercial version of the pre-trained weight. Link of VGGT-1B-Commercial

More Experiments

kitti waymo1 waymo2 waymo3 waymo4

About

Official implement of VGGT-Long

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published