Official implementation of the ECCV 2024 paper Controllable Navigation Instruction Generation with Chain of Thought Prompting [Link].
- 12/16/2024: Initial release πππ.
We recommend using our Dockerfile to setup the environment. If you encounter any issues, please refer to Matterport3D Simulator.
- Nvidia GPU with driver >= 396.37
- Install docker
- Install NVIDIA Container Toolkit
- Note: CUDA / cuDNN toolkits do not need to be installed (these are provided by the docker image)
Clone the Matterport3D Simulator repository:
# Make sure to clone with --recursive
git clone --recursive https://github.com/peteanderson80/Matterport3DSimulator.git
cd Matterport3DSimulatorIf you didn't clone with the --recursive flag, then you'll need to manually clone the pybind submodule from the top-level directory:
git submodule update --init --recursiveTo use the simulator you must first download the Matterport3D Dataset which is available after requesting access here. The download script that will be provided allows for downloading of selected data types. At minimum you must download the matterport_skybox_images and undistorted_camera_parameters. If you wish to use depth outputs then also download undistorted_depth_images (not required for C-Instructor).
Set an environment variable to the location of the unzipped dataset, where <PATH> is the full absolute path (not a relative path or symlink) to the directory containing the individual matterport scan directories (17DRP5sb8fy, 2t7WUuJeko7, etc):
export MATTERPORT_DATA_DIR=<PATH>Note that if <PATH> is a remote sshfs mount, you will need to mount it with the -o allow_root option or the docker container won't be able to access this directory.
Build the docker image:
docker build -t mattersim:9.2-devel-ubuntu20.04 .Run the docker container, mounting both the git repo and the dataset:
docker run -it --mount type=bind,source=$MATTERPORT_DATA_DIR,target=/root/mount/Matterport3DSimulator/data/v1/scans --volume {ACTUAL_PATH}:/root/mount/{XXX} mattersim:9.2-devel-ubuntu20.04Now (from inside the docker container), build the simulator code:
cd /root/mount/Matterport3DSimulator
mkdir build && cd build
cmake -DEGL_RENDERING=ON ..
make
cd ../Note that there are three rendering options, which are selected using cmake options during the build process (by varying line 3 in the build commands immediately above):
- GPU rendering using OpenGL (requires an X server):
cmake ..(default) - Off-screen GPU rendering using EGL:
cmake -DEGL_RENDERING=ON .. - Off-screen CPU rendering using OSMesa:
cmake -DOSMESA_RENDERING=ON ..
The recommended (fast) approach for training agents is using off-screen GPU rendering (EGL).
To make data loading faster and to reduce memory usage we preprocess the matterport_skybox_images by downscaling and combining all cube faces into a single image. While still inside the docker container, run the following script:
./scripts/downsize_skybox.pyThis will take a while depending on the number of processes used (which is a setting in the script).
After completion, the matterport_skybox_images subdirectories in the dataset will contain image files with filename format <PANO_ID>_skybox_small.jpg. By default images are downscaled by 50% and 20 processes are used.
If you need depth outputs as well as RGB (via sim.setDepthEnabled(True)), precompute matching depth skybox images by running this script:
./scripts/depth_to_skybox.pyDepth skyboxes are generated from the undistorted_depth_images using a simple blending approach. As the depth images contain many missing values (corresponding to shiny, bright, transparent, and distant surfaces, which are common in the dataset) we apply a simple crossbilateral filter based on the NYUv2 code to fill all but the largest holes. A couple of things to keep in mind:
- We assume that the
undistorted depth imagesare aligned to thematterport_skybox_images, but in fact this alignment is not perfect. For certain applications where better alignment is required (e.g., generating RGB pointclouds) it might be necessary to replace thematterport_skybox_imagesby stitching togetherundistorted_color_images(which are perfectly aligned to theundistorted_depth_images). - In the generated depth skyboxes, the depth value is the euclidean distance from the camera center (not the distance in the z direction). This is corrected by the simulator (see Simulator API, below).
Now (still from inside the docker container), run the unit tests:
./build/tests ~TimingAssuming all tests pass, sim_imgs will now contain some test images rendered by the simulator. You may also wish to test the rendering frame rate. The following command will try to load all the Matterport environments into memory (requiring around 50 GB memory), and then some information about the rendering frame rate (at 640x480 resolution, RGB outputs only) will be printed to stdout:
./build/tests TimingThe timing test must be run individually from the other tests to get accurate results. Not that the Timing test will fail if there is insufficient memory. As long as all the other tests pass (i.e., ./build/tests ~Timing) then the install is good. Refer to the Catch documentation for unit test configuration options.
Copy preprocess folder to Matterport3DSimulator/tasks and use precompute_img_features_clip.py for extracting CLIP features.
Obtain the LLaMA backbone weights using this form. Please note that checkpoints from unofficial sources (e.g., BitTorrent) may contain malicious code and should be used with care. Organize the downloaded file in the following structure:
/path/to/llama_model_weights
βββ 7B
βΒ Β βββ checklist.chk
βΒ Β βββ consolidated.00.pth
βΒ Β βββ params.json
βββ tokenizer.model
The weights of LLaMA Adapter can be obtained through Github Release.
Download the annotations from HAMT Dropbox.
Extract landmarks using scripts under landmark.
We pre-train the model on the PREVALENT dataset using the following command until convergence:
bash exps/finetune.sh {path_to_llama}/LLaMA-7B/ {path_to_llama_adapter}/7fa55208379faf2dd862565284101b0e4a2a72114d6490a95e432cf9d9b6c813_BIAS-7B.pth config/data/pretrain_r2r.json {results_dir}Note that you will need to specify the arguments in exps/finetune.sh and config/data/pretrain_r2r.json.
We fine-tune the model on other VLN datasets using the following command until convergence:
bash exps/finetune.sh {path_to_llama}/LLaMA-7B/ {path_to_ckpts}/{filename}-7B.pth config/data/pretrain_{dataset_name}.json {results_dir}Note that you will need to specify the arguments in exps/finetune.sh and config/data/pretrain_{dataset_name}.json.
Please refer to demo_r2r.py for inference and navigation path visualization.
Please refer to pycocoevalcap/eval.py for evaluation. To run the evaluation script, please install java and prepare the necessities according to this link.
If you are using C-Instructor for your research, please cite the following paper:
@inproceedings{kong2025controllable,
title={Controllable navigation instruction generation with chain of thought prompting},
author={Kong, Xianghao and Chen, Jinyu and Wang, Wenguan and Su, Hang and Hu, Xiaolin and Yang, Yi and Liu, Si},
booktitle={European Conference on Computer Vision},
pages={37--54},
year={2025},
organization={Springer}
}This project is built upon LLaMA-Adapter, Matterport3D Simulator, HAMT, and Microsoft COCO Caption Evaluation.
