DriveMonkey

Extending Large Vision-Language Model for
Diverse Interactive Tasks in Autonomous Driving

Zongchuang Zhao^1*, Haoyu Fu^1*, Dingkang Liang^1†, Xin Zhou¹, Dingyuan Zhang¹, Hongwei Xie², Bing Wang², Xiang Bai¹

¹ Huazhong University of Science & Technology, ² Xiaomi EV

(*) Equal contribution. (†) Project leader.

DriveMonkey

News

[2026/03/20] DriveMonkey code and dataset are now released!

NuInteract Dataset

NuInteract is constructed based on nuScenes. It encompasses 239K images (six single-view images and one surrounding view image for each frame) with high-quality dense captions and 1.3M data across diverse interactive language-based tasks, resulting in a total of 1.5M image-text pairs.

Dense Caption

We collect various objects and their information from different experts, including the Ground Truth of nuScenes, GRiT for extracting bounding boxes and corresponding object descriptions, and SAM segmenting and identifying objects followed by describing them using BLIP. We then filter them using Intersection over Union (IoU) and Image-Text Matching (ITM) criteria. The filtered objects and the corresponding information are then input into the Gemini to generate dense captions.

Please download the dense caption of NuInteract dataset cap_public.tar.gz and decompress it.

tar -zxvf cap_public.tar.gz

After decompressing it, the file tree is as follows:

all_caption_public/
├── 0a0d1f7700da446580874d7d1e9fce51.json
├── 0a1b4e0aa3824b0a96bafae7105c58cc.json
├── ...
├── token_name.json
└── fffce4445c964803a12a2d64023fde40.json

Use the scripts load_dense_caption.py to load dense caption and convert the dense caption to InternVL data format. Note that set the file path to your own.

python tools/load_dense_caption.py

Other Diverse Tasks

We use predefined templates combined with object information to create data for diverse interactive driving tasks, including 2D region description, 2D visual grounding, prediction, planning, and 3D visual grounding.

Please download the dataset NuInteract.zip and decompress it.

unzip NuInteract.zip

After decompressing it, the file tree is as follows:

NuInteract/
├── train/
│ ├── 2D Visual Grounding.pkl
│ ├── Region Description and Prediction.pkl
│ ├── planning.pkl
│ └── 3D Visual Grounding.pkl
├── test/
│ ├── 2D Region Description Prediction and Visual Grounding.pkl
│ ├── planning_test.pkl
│ └── 3D Visual Grounding.pkl

All files follow the conversation format of InternVL.

Getting Started

git clone https://github.com/zc-zhao/DriveMonkey.git
cd ./DriveMonkey
conda create -n DriveMonkey python=3.9
conda activate DriveMonkey
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit

pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
pip install mmdet==2.24.0 mmdet3d==1.0.0rc4 mmsegmentation==0.20.0
pip install mmcv-full==1.7.0 -f https://download.openmmlab.com/mmcv/dist/cu117/torch1.13/index.html
pip install networkx==3.2.1
pip install flash-attn==2.3.6 --no-build-isolation

For evaluation, you should install language_evaluation following the language-evaluation.

Training

Following the InternVL (code), you should set the data path in the JSON file internvl_chat/shell/internvl_nuscene_pretrain.json and internvl_chat/shell/internvl_nuscene_finetune_mixdataset_vgtextx3_rgtextx1_3dx3_capx1_planx1.json.

{
    "nuscene_cap_194k": {
      "root": "./nuscenes/samples",
      "annotation": "your cap json path",
      "data_augment": false,
      "repeat_time": 1,
      "length": "data_length"
    }
}

Note that set the --model_name_or_path in the shell file.

Dense caption pretrain

bash ./shell_nuscene/internvl2.0/internvl2_2b_internlm2_1_8b_nuscene/internvl2_2b_internlm2_1_8b_dynamic_res_nuscene_pre_loadofficialweight.sh 1 8 pretrain_output_dir

Multi-Task Finetune

bash ./shell_nuscene/internvl2.0/internvl2_1b_qwen2_0_5b_nuscene/internvl2_1b_qwen2_0_5b_dynamic_res_nuscene_finetune_vgtextx3_rgtextx1_3dx3_capx1_planx1_loadoffcialweight.sh 1 8 output_dir

Evaluation

Note that set the val path in the internvl_chat/eval/nuscene/evaluate_nuscene_bev.py.

GPUS=8 bash evaluate_nuscene_bev.sh \
your_checkpoint_path \
val_task_type \
--val \
--out-dir output.pkl

For 2D task metric, as follows:

python ./eval/nuscene/calculate_metric/evaluate_2d_text_usecoco.py

For 3D task metric, as follows:

python ./eval/nuscene/calculate_metric/evaluate_map.py

python ./eval/nuscene/calculate_metric/evaluate_pr3d.py

Model

Download our model checkpoint from Hugging Face

Acknowledgement

This project is based on InternVL (code), nuScenes (homepage, code). Thanks for their wonderful works.

Citation

If this work is helpful for your research, please consider citing:

@article{zhao2025extending,
  title={Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving},
  author={Zhao, Zongchuang and Fu, Haoyu and Liang, Dingkang and Zhou, Xin and Zhang, Dingyuan and Xie, Hongwei and Wang, Bing and Bai, Xiang},
  journal={arXiv preprint arXiv:2505.08725},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets/images		assets/images
internvl_chat		internvl_chat
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extending Large Vision-Language Model for
Diverse Interactive Tasks in Autonomous Driving

DriveMonkey

News

NuInteract Dataset

Dense Caption

Other Diverse Tasks

Getting Started

Training

Dense caption pretrain

Multi-Task Finetune

Evaluation

Model

Acknowledgement

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving

DriveMonkey

News

NuInteract Dataset

Dense Caption

Other Diverse Tasks

Getting Started

Training

Dense caption pretrain

Multi-Task Finetune

Evaluation

Model

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Extending Large Vision-Language Model for
Diverse Interactive Tasks in Autonomous Driving

Packages