Skip to content

[RSS 2025] Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks.

License

Notifications You must be signed in to change notification settings

jzhzhang/Uni-NaVid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Uni-NaVid

A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks. This project contains the finetuning and evaluation code of our RSS 2025 paper.

Contributors: Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, He Wang

[Paper & Appendices] [Projece Page]

pipeline

Release

  • Training Code
  • Offline Evaluation Code
  • Benchmark Evalation Code
    • VLN-CE
    • EVT-Bench
  • A small split of VLN-CE RxR data

Cotents

Install

First, clone this repo:

[email protected]:jzhzhang/Uni-NaVid.git

Then install the Package and dependences:

conda create -n uninavid python=3.10 -y
conda activate uninavid
cd Uni-NaVid
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Finall, install the flash-attn package:

pip install flash-attn==2.5.9.post1

Preparation

Model

To train our model, you need to download the vision encoder and the language model. Below are the links to download the models in our paper:

Model type Model name Download
Encoder EVA-CLIP ckpt
Pretrain model Vicuna-7B ckpt
Finetuned model Uni-NaVid (7B) ckpt

Data

We provide a small subset of the data used in our paper to facilitate quick reproduction and customization with your own data. The data can be downloaded from here. The data is collcted from navigation tasks including the training splits of VLN-CE R2R and RxR, EVT-Bench, ObjectNav, EQA. Note that due to licensing restrictions, we did not use the L3MVN method for ObjectNav limiation learning, which may result in a slight performance drop in ObjectNav evaluation.

We recommend organizing your project directory as follows

Uni-NaVid
├── data
    ├── Nav-Finetune
        ├── nav_videos
        ├── open_uninavid_sampled_500.json
├── model_zoo
    ├── eva_vit_g.pth
    ├── <vicuna_weights> # optinoal, if you want to finetune from vicuna
    ├── <uninavid_weights> 
├── scripts
├── uninavid
├── test_cases # optinoal, if you want to offline evaluate uni-navid

Train

Please set the DATA_PATH and MODEL_PATH in the uninavid_stage_1.sh and uninavid_stage_2.sh scripts to your data and model paths.

If you want to finetune from Vicuna-7B (make sure you collect sufficient data):

bash uninavid_stage_1.sh

If you want to finetune based on Uni-NaVid:

bash uninavid_stage_2.sh

Evaluation

During evaluation, the model leverages online token merging (run_type=eval), achieving an inference speed of approximately 5 Hz on a single A100 GPU. By employing more advanced techniques, such as quantization, the speed can be further enhanced.

Offline Evaluation

We provide the offline evaluation code of Uni-NaVid on real-world videos, including a VLN sample vln_1 and a tracking sample tracking_1. You can download the sample videos from here.

python offline_eval_uninavid.py test_cases/vln_1 Ourpur_dir # or test_cases/tracking_1
vln.1.mp4

(move to the chair, then turn left and move forward to the humanoid robot and stop.)

track.1.mp4

(follow the man with black top and brown pants.)

Benchmark Evaluation

We provide the evaluation code of Uni-NaVid on VLN-CE R2R/RxR and EVT Bench.

Find the VLN-CE benchmark evaluation code here.

Evaliation Benchmark TL NE OS SR SPL
Uni-NaVid VLN-CE R2R Val. 9.22 4.96 57.4 51.8 47.7
Uni-NaVid VLN-CE RxR Val. 18.4 5.67 66.4 56.1 44.5

Find the EVT-bench evaluation code here.

Evaliation Benchmark SR TR CR
Uni-NaVid EVT-Bench STT 53.3 67.2 12.6
Uni-NaVid EVT-Bench DT 31.9 50.1 21.3
Uni-NaVid EVT-Bench AT 15.8 41.5 26.5

Citation

If you find this work useful for your research, please consider citing:

@article{zhang2024uni,
    title={Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks},
    author={Zhang, Jiazhao and Wang, Kunyu and Wang, Shaoan and Li, Minghan and Liu, Haoran and Wei, Songlin and Wang, Zhongyuan and Zhang, Zhizheng and Wang, He},
    journal={Robotics: Science and Systems},
    year={2025}
}

Acknowledgments

Our code is based on LLaMA-VID and NaVid.

This is an open-source version of Uni-NaVid, some functions have been rewritten to avoid certain license.

If you have any questions, feel free to email Jiazhao Zhang at [email protected].

About

[RSS 2025] Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published