Zongchuang Zhao1*, Haoyu Fu1*, Dingkang Liang1†, Xin Zhou1, Dingyuan Zhang1, Hongwei Xie2, Bing Wang2, Xiang Bai1
1 Huazhong University of Science & Technology, 2 Xiaomi EV
(*) Equal contribution. (†) Project leader.
[2026/03/20] DriveMonkey code and dataset are now released!
NuInteract is constructed based on nuScenes. It encompasses 239K images (six single-view images and one surrounding view image for each frame) with high-quality dense captions and 1.3M data across diverse interactive language-based tasks, resulting in a total of 1.5M image-text pairs.
We collect various objects and their information from different experts, including the Ground Truth of nuScenes, GRiT for extracting bounding boxes and corresponding object descriptions, and SAM segmenting and identifying objects followed by describing them using BLIP. We then filter them using Intersection over Union (IoU) and Image-Text Matching (ITM) criteria. The filtered objects and the corresponding information are then input into the Gemini to generate dense captions.
- Please download the dense caption of NuInteract dataset cap_public.tar.gz and decompress it.
tar -zxvf cap_public.tar.gz
After decompressing it, the file tree is as follows:
all_caption_public/
├── 0a0d1f7700da446580874d7d1e9fce51.json
├── 0a1b4e0aa3824b0a96bafae7105c58cc.json
├── ...
├── token_name.json
└── fffce4445c964803a12a2d64023fde40.json
- Use the scripts load_dense_caption.py to load dense caption and convert the dense caption to InternVL data format. Note that set the file path to your own.
python tools/load_dense_caption.py
We use predefined templates combined with object information to create data for diverse interactive driving tasks, including 2D region description, 2D visual grounding, prediction, planning, and 3D visual grounding.
Please download the dataset NuInteract.zip and decompress it.
unzip NuInteract.zip
After decompressing it, the file tree is as follows:
NuInteract/
├── train/
│ ├── 2D Visual Grounding.pkl
│ ├── Region Description and Prediction.pkl
│ ├── planning.pkl
│ └── 3D Visual Grounding.pkl
├── test/
│ ├── 2D Region Description Prediction and Visual Grounding.pkl
│ ├── planning_test.pkl
│ └── 3D Visual Grounding.pkl
All files follow the conversation format of InternVL.
git clone https://github.com/zc-zhao/DriveMonkey.git
cd ./DriveMonkey
conda create -n DriveMonkey python=3.9
conda activate DriveMonkey
conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
pip install mmdet==2.24.0 mmdet3d==1.0.0rc4 mmsegmentation==0.20.0
pip install mmcv-full==1.7.0 -f https://download.openmmlab.com/mmcv/dist/cu117/torch1.13/index.html
pip install networkx==3.2.1
pip install flash-attn==2.3.6 --no-build-isolation
For evaluation, you should install language_evaluation following the language-evaluation.
Following the InternVL (code), you should set the data path in the JSON file internvl_chat/shell/internvl_nuscene_pretrain.json and internvl_chat/shell/internvl_nuscene_finetune_mixdataset_vgtextx3_rgtextx1_3dx3_capx1_planx1.json.
{
"nuscene_cap_194k": {
"root": "./nuscenes/samples",
"annotation": "your cap json path",
"data_augment": false,
"repeat_time": 1,
"length": "data_length"
}
}
Note that set the --model_name_or_path in the shell file.
bash ./shell_nuscene/internvl2.0/internvl2_2b_internlm2_1_8b_nuscene/internvl2_2b_internlm2_1_8b_dynamic_res_nuscene_pre_loadofficialweight.sh 1 8 pretrain_output_dir
bash ./shell_nuscene/internvl2.0/internvl2_1b_qwen2_0_5b_nuscene/internvl2_1b_qwen2_0_5b_dynamic_res_nuscene_finetune_vgtextx3_rgtextx1_3dx3_capx1_planx1_loadoffcialweight.sh 1 8 output_dir
Note that set the val path in the internvl_chat/eval/nuscene/evaluate_nuscene_bev.py.
GPUS=8 bash evaluate_nuscene_bev.sh \
your_checkpoint_path \
val_task_type \
--val \
--out-dir output.pkl
For 2D task metric, as follows:
python ./eval/nuscene/calculate_metric/evaluate_2d_text_usecoco.py
For 3D task metric, as follows:
python ./eval/nuscene/calculate_metric/evaluate_map.py
python ./eval/nuscene/calculate_metric/evaluate_pr3d.py
Download our model checkpoint from Hugging Face
This project is based on InternVL (code), nuScenes (homepage, code). Thanks for their wonderful works.
If this work is helpful for your research, please consider citing:
@article{zhao2025extending,
title={Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving},
author={Zhao, Zongchuang and Fu, Haoyu and Liang, Dingkang and Zhou, Xin and Zhang, Dingyuan and Xie, Hongwei and Wang, Bing and Bai, Xiang},
journal={arXiv preprint arXiv:2505.08725},
year={2025}
}
