VGGT-Long: Chunk it, Loop it, Align it, Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Paper | RedNote | YouTube

Related Repo: Pi-Long | DA3-Streaming | VGGT-Long-Gsplat

This repository contains the source code for our work:

Abstrat: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios.

VGGT-Long-kitti.mp4

VGGT-Long-staris.mp4

VGGT-Long-Cyberpunk.mp4

News

[11 Dec 2025] Released the DA3-Streaming.

[04 Sep 2025] Released the Pi-Long.

[22 Jul 2025] Arxiv submitted.

Updates

[12 Dec 2025] 1. We refactored the original architecture to support arbitrary foundation models, including VGGT, Pi3, and MapAnything. The current pipeline can be extended to future 3D foundation models easily. 2. Leveraging MapAnything's multimodal inputs and its ability to predict metric/real scale, Map-Long achieved a great performance in the setting of metric scale with $\text{SE}(3)$ alignment.

[05 Nov 2025] We have uploaded the input images captured by a mobile phone in the demo on Google Drive, as we have noticed that such complex large-scale scenes seem to be quite rare on other public datasets if you need them for your own demo. See part "Self-Collected Dataset Used in Demo Video" in README.md.

[08 Oct 2025] 1. We have updated the $\text{SE}(3)$ alignment, which you can enable in the config.yaml file. Recent developments in 3D models like MapAnything now support metric scale. Under such metric scale, using 6-DoF $\text{SE}(3)$ alignment will be more stable than 7-DoF $\text{SIM}(3)$ alignment. If you are using such models, we provide a code reference for $\text{SE}(3)$ alignment. 2. We have fixed a bug in the vectorized_reservoir_sampling function in loop_utils/sim3utils.py. Special thanks to @Horace89 for the assistance!

[22 Sep 2025] We uploaded the demo video on RedNote (and we also uploaded it on Youtube later on 06 Oct 2025).

[04 Sep 2025] We have developed Pi-Long as a complementary project to Pi3 and VGGT-Long. Benefiting from Pi3's outstanding performance, Pi-Long performs even better at the kilometer scale. Feel free to check it out.

[02 Aug 2025] Updated the licensing terms of VGGT-Long to reflect the upstream dependency license (See VGGT for the changes). Please see the License Section for full details.

[30 Jul 2025] Chunk Align speed up (0.273 s/iter$\rightarrow$0.175 s/iter on my machine).

[23 Jul 2025] Fixed some bugs in scripts/download_weights.sh.

[22 Jul 2025] Arxiv submitted.

[15 Jul 2025] To help you better understand our project, we have updated some visualizations.

[14 Jun 2025] GitHub code release.

Setup, Installation & Running

🖥️ 1 - Hardware and System Environment

This project was developed, tested, and run in the following hardware/system environment

Hardware Environment：
    CPU(s): Intel Xeon(R) Gold 6128 CPU @ 3.40GHz × 12
    GPU(s): NVIDIA RTX 4090 (24 GiB VRAM)
    RAM: 67.0 GiB (DDR4, 2666 MT/s)
    Disk: Dell 8TB 7200RPM HDD (SATA, Seq. Read 220 MiB/s)

System Environment：
    Linux System: Ubuntu 22.04.3 LTS
    CUDA Version: 11.8
    cuDNN Version: 9.1.0
    NVIDIA Drivers: 555.42.06
    Conda version: 23.9.0 (Miniconda)

📦 2 - Environment Setup

Note: This repository contains a significant amount of C++ code, but our goal is to make it as out-of-the-box usable as possible for researchers, as many deep learning researchers may not be familiar with C++ compilation. Currently, the code for VGGT-Long can run in a pure Python environment, which means you can skip all the C++ compilation steps in the README.

Step 1: Dependency Installation

Creating a virtual environment using conda (or miniconda),

conda create -n vggt-long python=3.10.18
conda activate vggt-long
# pip version created by conda: 25.1

Next, install PyTorch,

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
# Verified to work with CUDA 11.8 and torch 2.5.1

Install other requirements,

pip install -r requirements.txt

Step 2: Weights Download

Download all the pre-trained weights needed(Download weights for VGGT, Pi3, and MapAnything by default.):

bash ./scripts/download_weights.sh

You can skip the next two steps if you would like to run VGGT-Long in pure Python.

Step 3 (Optional) : Compile Loop-Closure Correction Module

We provide a Python-based Sim3 solver, so VGGT-Long can run the loop closure correction solving without compiling C++ code. However, we still recommend installing the C++ solver as it is more stable and faster.

python setup.py install

We made a simple figure (below) to help you better understanding the Sec 3.2 in paper.

Step 4 (Optional) : Compile `DBoW` Loop-Closure Detection Module

The VPR Model of DBoW is for performing VPR Model inference with CPU-only. You can skip this step.

See details

Install the OpenCV C++ API.

sudo apt-get install -y libopencv-dev

Install DBoW2

cd DBoW2
mkdir -p build && cd build
cmake ..
make
sudo make install
cd ../..

Install the image retrieval

pip install ./DPRetrieval

Step 5 (Optional) : Install mapanything as a package into the vggt-long environment,if you want to use mapanything.Link

🚀 3 - Running the code

python vggt_long.py --image_dir ./path_of_images

You can modify the parameters in the configs/base_config.yaml file. If you have created multiple yaml files to explore the effects of different parameters, you can specify the file path by adding --config to the command. For example:

python vggt_long.py --image_dir ./path_of_images --config ./configs/base_config.yaml

You can change the 'model' in config to use a different foundation model.

You may run the following cmd if you got videos before python vggt_long.py.

mkdir ./extract_images
ffmpeg -i your_video.mp4 -vf "fps=5,scale=518:-1" ./extract_images/frame_%06d.png

🛠️ 4 - Possible Problems You May Encounter

You may encounter some problems. We have collected some questions and their solutions. If you encounter similar problems, you can refer to them.

See details

a. Error about `libGL.so.1`.

The error comes from opencv-python, please run the following cmd to install the system dependencies.

sudo apt-get install -y libgl1-mesa-glx

b. Unable to install faiss-gpu (which is used in Loop Closure)

for example,

ERROR: Could not find a version that satisfies the requirement faiss-gpu (from versions: none)
ERROR: No matching distribution found for faiss-gpu

To address this issue, you can modify the requirements.txt file as follows:

...
faiss-gpu -> faiss-gpu-cu11 or faiss-gpu-cu12
...

Then reinstall requirements:

pip install -r requirements.txt

You can also find some alternative solutions at this link (Stackoverflow).

If this problem still remains unsolved. You may consider proceeding to Step 4 in the Environment Setup.

c. Module `torch` has no attribute `uint64`.

Checking #21, you could downgrade the library safetensors to 0.5.3.

d. Significant drift occurred in the video we recorded with our own mobile device?

This issue is likely caused by either minimal movement in your video or an excessively high frame rate, leading to accumulated drift. We have observed that with very dense input where movement between consecutive frames is small, the model's drift can increase to noticeable levels. You could try extracting video frames at a lower frame rate, such as 1fps (this is similar to keyframe processing in Visual SLAM):

ffmpeg -i your_video.mp4 -vf "fps=1,scale=518:-1" ./extract_images/frame_%06d.png

You may also consider switching to Pi-Long / Map-Long / DA3-Long, as better base models can help mitigate this issue to some extent.

Please ensure that the videos you record are free from motion blur, as the base model currently handles motion blur with limited stability.

Record at higher frame rates (e.g., 60 FPS) to reduce exposure time per frame
Use a camera stabilizer or enable stabilization features
Prefer wide-angle lenses, which exhibit less apparent blur from camera shake due to their larger field of view
Ensure adequate lighting in dark environments to prevent the camera from increasing shutter time
If familiar with photography, use professional mode to increase shutter speed while widening the aperture and increasing ISO

🚨 5 - Important Notice: Memory Management & Requirements

In long-sequence scenarios, addressing CPU memory and GPU memory limitations has always been a core challenge. VGGT-Long resolves GPU memory limitations encountered by VGGT through chunk-based input partitioning. As for CPU memory constraints, we achieve lower CPU memory usage by storing intermediate results on the disk (the consequences of CPU memory overflow are far more severe than GPU issues, while GPU OOM may simply terminate the program, CPU OOM can cause complete system freeze, which we absolutely want to avoid). VGGT-Long automatically retrieves locally stored intermediate results when needed. Upon completion, these temporary files are automatically deleted to prevent excessive disk space consumption. This implementation implies two key considerations:

Before running, sufficient disk space must be reserved (approximately 50GiB for 4500-frame KITTI 00 sequences, or ~5GiB for 300-frame short sequences);
The actual runtime depends on your disk I/O speed and memory-disk bandwidth, which may vary significantly across different computer systems.

Datasets

Our test datasets are all sourced from publicly available autonomous driving datasets, and you can download them according to the official instructions.

Waymo Open Dataset: Main page, V1.4.1. If you encounter any problems on Waymo, you can find the reference code at #33 that might be helpful. We have noticed that different people seem to handle this dataset differently.

Virtual KITTI Dataset (V1.3.1): Link

KITTI Dataset Odometry Track: Link

Self-Collected Dataset Used in Demo Video

We have uploaded our self-collected scenes used in the demo video to Google Drive, as we found it might be difficult to find similar complex scenarios, specifically, long sequences in large scene in other datasets. We have extracted rgb frames in png format from the original videos, which you can directly read.

Initially, we intended to run the COLMAP on these scenes to provide you with a visual reference, as there is no GT after the scenes were captured. However, we found that COLMAP seems difficult to optimize in such large-scale scenario. We may update the reconstruction results from COLMAP as a visual reference later once we locate the bugs.

Download link (~13 GiB): Google Drive

COLAMP failed on my machine 🥺. If you succeed in getting it to work on these scenes with COLMAP, please contact me!

There are 4 scenes in the zip file:

Acknowledgements

Our project is based on VGGT, DPV-SLAM, GigaSLAM. Our work would not have been possible without these excellent repositories.

Citation

If you find our work helpful, please consider citing:

@misc{deng2025vggtlongchunkitloop,
      title={VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences}, 
      author={Kai Deng and Zexin Ti and Jiawei Xu and Jian Yang and Jin Xie},
      year={2025},
      eprint={2507.16443},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.16443}, 
}

License

The VGGT-Long codebase follows VGGT's license, please refer to ./LICENSE.txt for applicable terms. For commercial use, please follow the link VGGT that should utilize the commercial version of the pre-trained weight. Link of VGGT-1B-Commercial

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
DBoW2		DBoW2
DPRetrieval		DPRetrieval
LoopModelDBoW/retrieval		LoopModelDBoW/retrieval
LoopModels		LoopModels
assets		assets
base_models		base_models
configs		configs
fastloop		fastloop
loop_utils		loop_utils
scripts		scripts
thirdparty/eigen-3.4.0		thirdparty/eigen-3.4.0
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
convert_colmap.py		convert_colmap.py
requirements.txt		requirements.txt
setup.py		setup.py
vggt_long.py		vggt_long.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VGGT-Long: Chunk it, Loop it, Align it, Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Paper | RedNote | YouTube

Related Repo: Pi-Long | DA3-Streaming | VGGT-Long-Gsplat

News

Updates

Setup, Installation & Running

🖥️ 1 - Hardware and System Environment

📦 2 - Environment Setup

Step 1: Dependency Installation

Step 2: Weights Download

Step 3 (Optional) : Compile Loop-Closure Correction Module

Step 4 (Optional) : Compile `DBoW` Loop-Closure Detection Module

Step 5 (Optional) : Install mapanything as a package into the vggt-long environment,if you want to use mapanything.Link

🚀 3 - Running the code

🛠️ 4 - Possible Problems You May Encounter

a. Error about `libGL.so.1`.

b. Unable to install faiss-gpu (which is used in Loop Closure)

c. Module `torch` has no attribute `uint64`.

d. Significant drift occurred in the video we recorded with our own mobile device?

🚨 5 - Important Notice: Memory Management & Requirements

Datasets

Self-Collected Dataset Used in Demo Video

Acknowledgements

Citation

License

More Experiments

About

Uh oh!

Releases

Packages

Contributors 6

Languages

License

DengKaiCQ/VGGT-Long

Folders and files

Latest commit

History

Repository files navigation

VGGT-Long: Chunk it, Loop it, Align it, Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Paper | RedNote | YouTube

Related Repo: Pi-Long | DA3-Streaming | VGGT-Long-Gsplat

News

Updates

Setup, Installation & Running

🖥️ 1 - Hardware and System Environment

📦 2 - Environment Setup

Step 1: Dependency Installation

Step 2: Weights Download

Step 3 (Optional) : Compile Loop-Closure Correction Module

Step 4 (Optional) : Compile DBoW Loop-Closure Detection Module

Step 5 (Optional) : Install mapanything as a package into the vggt-long environment,if you want to use mapanything.Link

🚀 3 - Running the code

🛠️ 4 - Possible Problems You May Encounter

a. Error about libGL.so.1.

b. Unable to install faiss-gpu (which is used in Loop Closure)

c. Module torch has no attribute uint64.

d. Significant drift occurred in the video we recorded with our own mobile device?

🚨 5 - Important Notice: Memory Management & Requirements

Datasets

Self-Collected Dataset Used in Demo Video

Acknowledgements

Citation

License

More Experiments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Step 4 (Optional) : Compile `DBoW` Loop-Closure Detection Module

a. Error about `libGL.so.1`.

c. Module `torch` has no attribute `uint64`.

Packages