A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks. This project contains the finetuning and evaluation code of our RSS 2025 paper.
Contributors: Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, He Wang
[Paper & Appendices] [Projece Page]
- Training Code
- Offline Evaluation Code
- Benchmark Evalation Code
- VLN-CE
- EVT-Bench
- A small split of VLN-CE RxR data
- Install
- Preparation
- Model Preparation
- Data Preparation
- Train
- Evaluation
- Offline Evaluation
- Benchmark Evaluation
- Citation
- Acknowledgments
First, clone this repo:
[email protected]:jzhzhang/Uni-NaVid.git
Then install the Package and dependences:
conda create -n uninavid python=3.10 -y
conda activate uninavid
cd Uni-NaVid
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Finall, install the flash-attn package:
pip install flash-attn==2.5.9.post1
To train our model, you need to download the vision encoder and the language model. Below are the links to download the models in our paper:
| Model type | Model name | Download |
|---|---|---|
| Encoder | EVA-CLIP | ckpt |
| Pretrain model | Vicuna-7B | ckpt |
| Finetuned model | Uni-NaVid (7B) | ckpt |
We provide a small subset of the data used in our paper to facilitate quick reproduction and customization with your own data. The data can be downloaded from here. The data is collcted from navigation tasks including the training splits of VLN-CE R2R and RxR, EVT-Bench, ObjectNav, EQA. Note that due to licensing restrictions, we did not use the L3MVN method for ObjectNav limiation learning, which may result in a slight performance drop in ObjectNav evaluation.
We recommend organizing your project directory as follows
Uni-NaVid
├── data
├── Nav-Finetune
├── nav_videos
├── open_uninavid_sampled_500.json
├── model_zoo
├── eva_vit_g.pth
├── <vicuna_weights> # optinoal, if you want to finetune from vicuna
├── <uninavid_weights>
├── scripts
├── uninavid
├── test_cases # optinoal, if you want to offline evaluate uni-navid
Please set the DATA_PATH and MODEL_PATH in the uninavid_stage_1.sh and uninavid_stage_2.sh scripts to your data and model paths.
If you want to finetune from Vicuna-7B (make sure you collect sufficient data):
bash uninavid_stage_1.sh
If you want to finetune based on Uni-NaVid:
bash uninavid_stage_2.sh
During evaluation, the model leverages online token merging (run_type=eval), achieving an inference speed of approximately 5 Hz on a single A100 GPU. By employing more advanced techniques, such as quantization, the speed can be further enhanced.
We provide the offline evaluation code of Uni-NaVid on real-world videos, including a VLN sample vln_1 and a tracking sample tracking_1. You can download the sample videos from here.
python offline_eval_uninavid.py test_cases/vln_1 Ourpur_dir # or test_cases/tracking_1
vln.1.mp4
(move to the chair, then turn left and move forward to the humanoid robot and stop.)
track.1.mp4
(follow the man with black top and brown pants.)
We provide the evaluation code of Uni-NaVid on VLN-CE R2R/RxR and EVT Bench.
Find the VLN-CE benchmark evaluation code here.
| Evaliation Benchmark | TL | NE | OS | SR | SPL |
|---|---|---|---|---|---|
| Uni-NaVid VLN-CE R2R Val. | 9.22 | 4.96 | 57.4 | 51.8 | 47.7 |
| Uni-NaVid VLN-CE RxR Val. | 18.4 | 5.67 | 66.4 | 56.1 | 44.5 |
Find the EVT-bench evaluation code here.
| Evaliation Benchmark | SR | TR | CR |
|---|---|---|---|
| Uni-NaVid EVT-Bench STT | 53.3 | 67.2 | 12.6 |
| Uni-NaVid EVT-Bench DT | 31.9 | 50.1 | 21.3 |
| Uni-NaVid EVT-Bench AT | 15.8 | 41.5 | 26.5 |
If you find this work useful for your research, please consider citing:
@article{zhang2024uni,
title={Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks},
author={Zhang, Jiazhao and Wang, Kunyu and Wang, Shaoan and Li, Minghan and Liu, Haoran and Wei, Songlin and Wang, Zhongyuan and Zhang, Zhizheng and Wang, He},
journal={Robotics: Science and Systems},
year={2025}
}
Our code is based on LLaMA-VID and NaVid.
This is an open-source version of Uni-NaVid, some functions have been rewritten to avoid certain license.
If you have any questions, feel free to email Jiazhao Zhang at [email protected].
