Uni-NaVid

A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks. This project contains the finetuning and evaluation code of our RSS 2025 paper.

Contributors: Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, He Wang

[Paper & Appendices] [Projece Page]

Release

Cotents

Install
Preparation
- Model Preparation
- Data Preparation
Train
Evaluation
- Offline Evaluation
- Benchmark Evaluation
Citation
Acknowledgments

Install

First, clone this repo:

[email protected]:jzhzhang/Uni-NaVid.git

Then install the Package and dependences:

conda create -n uninavid python=3.10 -y
conda activate uninavid
cd Uni-NaVid
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Finall, install the flash-attn package:

pip install flash-attn==2.5.9.post1

Preparation

Model

To train our model, you need to download the vision encoder and the language model. Below are the links to download the models in our paper:

Model type	Model name	Download
Encoder	EVA-CLIP	ckpt
Pretrain model	Vicuna-7B	ckpt
Finetuned model	Uni-NaVid (7B)	ckpt

Data

We provide a small subset of the data used in our paper to facilitate quick reproduction and customization with your own data. The data can be downloaded from here. The data is collcted from navigation tasks including the training splits of VLN-CE R2R and RxR, EVT-Bench, ObjectNav, EQA. Note that due to licensing restrictions, we did not use the L3MVN method for ObjectNav limiation learning, which may result in a slight performance drop in ObjectNav evaluation.

We recommend organizing your project directory as follows

Uni-NaVid
├── data
    ├── Nav-Finetune
        ├── nav_videos
        ├── open_uninavid_sampled_500.json
├── model_zoo
    ├── eva_vit_g.pth
    ├── <vicuna_weights> # optinoal, if you want to finetune from vicuna
    ├── <uninavid_weights> 
├── scripts
├── uninavid
├── test_cases # optinoal, if you want to offline evaluate uni-navid

Train

Please set the DATA_PATH and MODEL_PATH in the uninavid_stage_1.sh and uninavid_stage_2.sh scripts to your data and model paths.

If you want to finetune from Vicuna-7B (make sure you collect sufficient data):

bash uninavid_stage_1.sh

If you want to finetune based on Uni-NaVid:

bash uninavid_stage_2.sh

Evaluation

During evaluation, the model leverages online token merging (run_type=eval), achieving an inference speed of approximately 5 Hz on a single A100 GPU. By employing more advanced techniques, such as quantization, the speed can be further enhanced.

Offline Evaluation

We provide the offline evaluation code of Uni-NaVid on real-world videos, including a VLN sample vln_1 and a tracking sample tracking_1. You can download the sample videos from here.

python offline_eval_uninavid.py test_cases/vln_1 Ourpur_dir # or test_cases/tracking_1

vln.1.mp4

(move to the chair, then turn left and move forward to the humanoid robot and stop.)

track.1.mp4

(follow the man with black top and brown pants.)

Benchmark Evaluation

We provide the evaluation code of Uni-NaVid on VLN-CE R2R/RxR and EVT Bench.

Find the VLN-CE benchmark evaluation code here.

Evaliation Benchmark	TL	NE	OS	SR	SPL
Uni-NaVid VLN-CE R2R Val.	9.22	4.96	57.4	51.8	47.7
Uni-NaVid VLN-CE RxR Val.	18.4	5.67	66.4	56.1	44.5

Find the EVT-bench evaluation code here.

Evaliation Benchmark	SR	TR	CR
Uni-NaVid EVT-Bench STT	53.3	67.2	12.6
Uni-NaVid EVT-Bench DT	31.9	50.1	21.3
Uni-NaVid EVT-Bench AT	15.8	41.5	26.5

Citation

If you find this work useful for your research, please consider citing:

@article{zhang2024uni,
    title={Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks},
    author={Zhang, Jiazhao and Wang, Kunyu and Wang, Shaoan and Li, Minghan and Liu, Haoran and Wei, Songlin and Wang, Zhongyuan and Zhang, Zhizheng and Wang, He},
    journal={Robotics: Science and Systems},
    year={2025}
}

Acknowledgments

Our code is based on LLaMA-VID and NaVid.

This is an open-source version of Uni-NaVid, some functions have been rewritten to avoid certain license.

If you have any questions, feel free to email Jiazhao Zhang at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
scripts		scripts
uninavid		uninavid
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
offline_eval_uninavid.py		offline_eval_uninavid.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Uni-NaVid

Release

Cotents

Install

Preparation

Model

Data

Train

Evaluation

Offline Evaluation

Benchmark Evaluation

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

jzhzhang/Uni-NaVid

Folders and files

Latest commit

History

Repository files navigation

Uni-NaVid

Release

Cotents

Install

Preparation

Model

Data

Train

Evaluation

Offline Evaluation

Benchmark Evaluation

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages