Related Repo: Pi-Long | DA3-Streaming | VGGT-Long-Gsplat
This repository contains the source code for our work:
VGGT-Long: Chunk it, Loop it, Align it, Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences
Abstrat: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios.
VGGT-Long-kitti.mp4
VGGT-Long-staris.mp4
VGGT-Long-Cyberpunk.mp4
[11 Dec 2025] Released the DA3-Streaming.
[04 Sep 2025] Released the Pi-Long.
[22 Jul 2025] Arxiv submitted.
[12 Dec 2025] 1. We refactored the original architecture to support arbitrary foundation models, including VGGT, Pi3, and MapAnything. The current pipeline can be extended to future 3D foundation models easily. 2. Leveraging MapAnything's multimodal inputs and its ability to predict metric/real scale, Map-Long achieved a great performance in the setting of metric scale with
[05 Nov 2025] We have uploaded the input images captured by a mobile phone in the demo on Google Drive, as we have noticed that such complex large-scale scenes seem to be quite rare on other public datasets if you need them for your own demo. See part "Self-Collected Dataset Used in Demo Video" in README.md.
[08 Oct 2025] 1. We have updated the config.yaml file. Recent developments in 3D models like MapAnything now support metric scale. Under such metric scale, using 6-DoF vectorized_reservoir_sampling function in loop_utils/sim3utils.py. Special thanks to @Horace89 for the assistance!
[22 Sep 2025] We uploaded the demo video on RedNote (and we also uploaded it on Youtube later on 06 Oct 2025).
[04 Sep 2025] We have developed Pi-Long as a complementary project to Pi3 and VGGT-Long. Benefiting from Pi3's outstanding performance, Pi-Long performs even better at the kilometer scale. Feel free to check it out.
[02 Aug 2025] Updated the licensing terms of VGGT-Long to reflect the upstream dependency license (See VGGT for the changes). Please see the License Section for full details.
[30 Jul 2025] Chunk Align speed up (0.273 s/iter0.175 s/iter on my machine).
[23 Jul 2025] Fixed some bugs in scripts/download_weights.sh.
[22 Jul 2025] Arxiv submitted.
[15 Jul 2025] To help you better understand our project, we have updated some visualizations.
[14 Jun 2025] GitHub code release.
This project was developed, tested, and run in the following hardware/system environment
Hardware Environment:
CPU(s): Intel Xeon(R) Gold 6128 CPU @ 3.40GHz × 12
GPU(s): NVIDIA RTX 4090 (24 GiB VRAM)
RAM: 67.0 GiB (DDR4, 2666 MT/s)
Disk: Dell 8TB 7200RPM HDD (SATA, Seq. Read 220 MiB/s)
System Environment:
Linux System: Ubuntu 22.04.3 LTS
CUDA Version: 11.8
cuDNN Version: 9.1.0
NVIDIA Drivers: 555.42.06
Conda version: 23.9.0 (Miniconda)
Note: This repository contains a significant amount of C++ code, but our goal is to make it as out-of-the-box usable as possible for researchers, as many deep learning researchers may not be familiar with C++ compilation. Currently, the code for VGGT-Long can run in a pure Python environment, which means you can skip all the C++ compilation steps in the README.
Creating a virtual environment using conda (or miniconda),
conda create -n vggt-long python=3.10.18
conda activate vggt-long
# pip version created by conda: 25.1Next, install PyTorch,
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
# Verified to work with CUDA 11.8 and torch 2.5.1Install other requirements,
pip install -r requirements.txtDownload all the pre-trained weights needed(Download weights for VGGT, Pi3, and MapAnything by default.):
bash ./scripts/download_weights.shYou can skip the next two steps if you would like to run VGGT-Long in pure Python.
We provide a Python-based Sim3 solver, so VGGT-Long can run the loop closure correction solving without compiling C++ code. However, we still recommend installing the C++ solver as it is more stable and faster.
python setup.py installWe made a simple figure (below) to help you better understanding the Sec 3.2 in paper.
The VPR Model of DBoW is for performing VPR Model inference with CPU-only. You can skip this step.
See details
Install the OpenCV C++ API.
sudo apt-get install -y libopencv-devInstall DBoW2
cd DBoW2
mkdir -p build && cd build
cmake ..
make
sudo make install
cd ../..Install the image retrieval
pip install ./DPRetrievalStep 5 (Optional) : Install mapanything as a package into the vggt-long environment,if you want to use mapanything.Link
python vggt_long.py --image_dir ./path_of_imagesYou can modify the parameters in the configs/base_config.yaml file. If you have created multiple yaml files to explore the effects of different parameters, you can specify the file path by adding --config to the command. For example:
python vggt_long.py --image_dir ./path_of_images --config ./configs/base_config.yamlYou can change the 'model' in config to use a different foundation model.
You may run the following cmd if you got videos before python vggt_long.py.
mkdir ./extract_images
ffmpeg -i your_video.mp4 -vf "fps=5,scale=518:-1" ./extract_images/frame_%06d.png
You may encounter some problems. We have collected some questions and their solutions. If you encounter similar problems, you can refer to them.
See details
The error comes from opencv-python, please run the following cmd to install the system dependencies.
sudo apt-get install -y libgl1-mesa-glxfor example,
ERROR: Could not find a version that satisfies the requirement faiss-gpu (from versions: none)
ERROR: No matching distribution found for faiss-gpuTo address this issue, you can modify the requirements.txt file as follows:
...
faiss-gpu -> faiss-gpu-cu11 or faiss-gpu-cu12
...Then reinstall requirements:
pip install -r requirements.txtYou can also find some alternative solutions at this link (Stackoverflow).
If this problem still remains unsolved. You may consider proceeding to Step 4 in the Environment Setup.
Checking #21, you could downgrade the library safetensors to 0.5.3.
This issue is likely caused by either minimal movement in your video or an excessively high frame rate, leading to accumulated drift. We have observed that with very dense input where movement between consecutive frames is small, the model's drift can increase to noticeable levels. You could try extracting video frames at a lower frame rate, such as 1fps (this is similar to keyframe processing in Visual SLAM):
ffmpeg -i your_video.mp4 -vf "fps=1,scale=518:-1" ./extract_images/frame_%06d.pngYou may also consider switching to Pi-Long / Map-Long / DA3-Long, as better base models can help mitigate this issue to some extent.
Please ensure that the videos you record are free from motion blur, as the base model currently handles motion blur with limited stability.
- Record at higher frame rates (e.g., 60 FPS) to reduce exposure time per frame
- Use a camera stabilizer or enable stabilization features
- Prefer wide-angle lenses, which exhibit less apparent blur from camera shake due to their larger field of view
- Ensure adequate lighting in dark environments to prevent the camera from increasing shutter time
- If familiar with photography, use professional mode to increase shutter speed while widening the aperture and increasing ISO
In long-sequence scenarios, addressing CPU memory and GPU memory limitations has always been a core challenge. VGGT-Long resolves GPU memory limitations encountered by VGGT through chunk-based input partitioning. As for CPU memory constraints, we achieve lower CPU memory usage by storing intermediate results on the disk (the consequences of CPU memory overflow are far more severe than GPU issues, while GPU OOM may simply terminate the program, CPU OOM can cause complete system freeze, which we absolutely want to avoid). VGGT-Long automatically retrieves locally stored intermediate results when needed. Upon completion, these temporary files are automatically deleted to prevent excessive disk space consumption. This implementation implies two key considerations:
-
Before running, sufficient disk space must be reserved (approximately 50GiB for 4500-frame KITTI 00 sequences, or ~5GiB for 300-frame short sequences);
-
The actual runtime depends on your disk I/O speed and memory-disk bandwidth, which may vary significantly across different computer systems.
Our test datasets are all sourced from publicly available autonomous driving datasets, and you can download them according to the official instructions.
Waymo Open Dataset: Main page, V1.4.1. If you encounter any problems on Waymo, you can find the reference code at #33 that might be helpful. We have noticed that different people seem to handle this dataset differently.
Virtual KITTI Dataset (V1.3.1): Link
KITTI Dataset Odometry Track: Link
We have uploaded our self-collected scenes used in the demo video to Google Drive, as we found it might be difficult to find similar complex scenarios, specifically, long sequences in large scene in other datasets. We have extracted rgb frames in png format from the original videos, which you can directly read.
Initially, we intended to run the COLMAP on these scenes to provide you with a visual reference, as there is no GT after the scenes were captured. However, we found that COLMAP seems difficult to optimize in such large-scale scenario. We may update the reconstruction results from COLMAP as a visual reference later once we locate the bugs.
Download link (~13 GiB): Google Drive
COLAMP failed on my machine 🥺. If you succeed in getting it to work on these scenes with COLMAP, please contact me!
There are 4 scenes in the zip file:
Our project is based on VGGT, DPV-SLAM, GigaSLAM. Our work would not have been possible without these excellent repositories.
If you find our work helpful, please consider citing:
@misc{deng2025vggtlongchunkitloop,
title={VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences},
author={Kai Deng and Zexin Ti and Jiawei Xu and Jian Yang and Jin Xie},
year={2025},
eprint={2507.16443},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.16443},
}
The VGGT-Long codebase follows VGGT's license, please refer to ./LICENSE.txt for applicable terms. For commercial use, please follow the link VGGT that should utilize the commercial version of the pre-trained weight. Link of VGGT-1B-Commercial










