We address the task of Vision-Language Navigation in Continuous Environments (VLN-CE) under the zero-shot setting. Zero-shot VLN-CE is particularly challenging due to the absence of expert demonstrations for training and minimal environment structural prior to guide navigation. To confront these challenges, we propose a Constraint-Aware Navigator (CA-Nav), which reframes zero-shot VLN-CE as a sequential, constraint aware sub-instruction completion process. CA-Nav continuously translates sub-instructions into navigation plans using two core modules: the Constraint-Aware Sub-instruction Manager (CSM) and the Constraint-Aware Value Mapper (CVM). CSM defines the completion criteria for decomposed sub-instructions as constraints and tracks navigation progress by switching sub-instructions in a constraint-aware manner. CVM, guided by CSM’s constraints, generates a value map on the fly and refines it using superpixel clustering to improve navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the previous best method by 12% and 13% in Success Rate on the validation unseen splits of R2R-CE and RxR-CE, respectively. Moreover, CA-Nav demonstrates its effectiveness in real-world robot deployments across various indoor scenes and instructions.
-
Create a virtual environment. We develop this project with Python 3.8:
conda create -n CA-Nav python==3.8 conda activate CA-Nav
-
Install
habitat-sim-v0.1.7for a machine with multiple GPUs or without an attached display (i.e. a cluster):git clone https://github.com/facebookresearch/habitat-sim.git cd habitat-sim git checkout tags/v0.1.7 pip install -r requirements.txt python setup.py install --headless -
Install
habitat-lab-v0.1.7:git clone https://github.com/facebookresearch/habitat-lab.git cd habitat-lab git checkout tags/v0.1.7 cd habitat_baselines/rl vi requirements.txt # delete tensorflow==1.13.1 cd ../../ # (return to habitat-lab direction) pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html pip install -r requirements.txt python setup.py develop --all # install habitat and habitat_baselines; If the installation fails, try again, most of the time it is due to network problems
If you encounter some problems and failed to install habitat, please try to follow the Official Habitat Installation Guide to install
habitat-labandhabitat-sim. We use versionv0.1.7in our experiments, same as in the VLN-CE, please refer to the VLN-CE page for more details. -
Install Grounded-SAM and refine its phrases2classes function
git clone https://github.com/IDEA-Research/GroundingDINO.git cd GroundingDINO git checkout -q 57535c5a79791cb76e36fdb64975271354f10251 pip install -q -e . pip install 'git+https://github.com/facebookresearch/segment-anything.git'
ATTENTION: We found that optimizing the phrase-to-class mapping logic in Grounded-SAM using minimum edit distance leads to more stable prediction outputs.
cd <YOUR PATH>/GroundingDINO/groundingdino/util/inference.py pip install nltk
Find and comment the raw phrase2class function in Line 235, then write the refined version:
# @staticmethod # def phrases2classes(phrases: List[str], classes: List[str]) -> np.ndarray: # class_ids = [] # for phrase in phrases: # try: # class_ids.append(classes.index(phrase)) # except ValueError: # class_ids.append(None) # return np.array(class_ids) from nltk.metrics import edit_distance @staticmethod def phrases2classes(phrases: List[str], classes: List[str]) -> np.ndarray: class_ids = [] for phrase in phrases: if phrase in classes: class_ids.append(classes.index(phrase)) else: distances = np.array([edit_distance(phrase, class_id) for class_id in classes]) idx = np.argmin(distances) class_ids.append(idx) return np.array(class_ids)
-
Install other requirements
git clone https://github.com/Chenkehan21/CA-Nav-code.git cd CA-Nav-code pip install requirements.txt pip install requirements2.txt
-
R2R-CE
-
Instructions: Download the R2R_VLNCE_v1-3_preprocessed instructions from VLN-CE:
-
Scenes: Matterport3D (MP3D) scene reconstructions are used. The official Matterport3D download script (
download_mp.py) can be accessed by following the instructions on their project webpage. The scene data can then be downloaded:
# requires running with python 2.7 python download_mp.py --task habitat -o data/scene_datasets/mp3d/Extract such that it has the form
scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes. Place thescene_datasetsfolder indata/. -
-
CA-Nav LLM Replys / BLIP2-ITM / BLIP2-VQA / Grounded-SAM
Download from CA-Nav-Google-Drive
Overall, datas are organized as follows:
CA-Nav-code
├── data
│ ├── blip2
│ ├── datasets
│ ├── LLM_REPLYS_VAL_UNSEEN
│ ├── R2R_VLNCE_v1-3_preprocessed
│ ├── grounded_sam
│ ├── logs
│ ├── scene_datasets
│ └── vqa
└── ...
cd CA-NAV-code
sh run_r2r/main.shOur implementations are partially inspired by SemExp and ETPNav. Thanks for their great works!
If you find this repository is useful, please consider citing our paper:
@String(TPAMI = {IEEE Trans. Pattern Anal. Mach. Intell.})
@article{chen2025canav,
title={Constraint-aware zero-shot vision-language navigation in continuous environments},
author={Chen, Kehan and An, Dong and Huang, Yan and Xu, Rongtao and Su, Yifei and Ling, Yonggen and Reid, Ian and Wang, Liang},
journal=TPAMI,
year={2025},
volume={47},
number={11},
pages={10441--10456}
}
