NeurIPS 2025 Submission
Zhiyi Hou1,2,3,*, Enhui Ma1,3,*, Fang Li2,*, Zhiyi Lai2, Kalok Ho2, Zhanqian Wu2,
Lijun Zhou2, Long Chen2, Chitian Sun2, Haiyang Sun2,β , Bing Wang2,
Guang Chen2, Hangjun Ye2, Kaicheng Yu1,β
1Westlake University, 2Xiaomi EV, 3Zhejiang University
*Equal contribution. β Project leader. βCorresponding author.
Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle's future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage.
In this work, we introduce:
- DriveMRP-10K: A synthetic dataset of high-risk driving motions built from nuPlan using BEV-based simulation to model risks from ego-vehicle, other agents, and environment
- DriveMRP-Agent: A VLM-agnostic framework that incorporates projection-based visual prompting to bridge numerical coordinates and images
By fine-tuning with DriveMRP-10K, our framework significantly improves motion risk prediction performance, with accident recognition accuracy soaring from 27.13% to 88.03%. When tested via zero-shot evaluation on real-world high-risk motion data, DriveMRP-Agent boosts accuracy from 29.42% to 68.50%.
- Synthetic high-risk motion data generated via BEV-based simulation
- Models risks from three aspects:
- Ego-vehicle maneuvers
- Other vehicle interactions
- Environmental constraints
- Includes:
- Trajectory generation
- Human-in-the-loop labeling
- GPT-4o captions
- 10K multimodal samples for VLM training
- VLM-agnostic architecture based on Qwen2.5VL-7B
- Key components:
- Projection-based visual prompting: Bridges numerical coordinates and images
- Multi-context integration: Combines BEV and front-view contexts
- Chain-of-thought reasoning: For motion risk prediction
- Processes:
- Global context injection
- Ego-vehicle perspective alignment
- Trajectory projection
DriveMRP-10K/
βββ train/ # Training samples (8,000 scenarios)
β βββ scenario_001/
β β βββ bev.png # BEV representation
β β βββ front_view.png # Ego-vehicle perspective
β β βββ trajectory.json # Motion trajectory data
β β βββ caption.txt # GPT-4o generated description
β βββ ...
βββ val/ # Validation samples (1,000 scenarios)
βββ test/ # Test samples (1,000 scenarios)
βββ metadata.json # Dataset metadata and statistics
Dataset Statistics:
| Split | Scenarios | Risk Categories |
|---|---|---|
| Train | 8,000 | 4 |
| Val | 1,000 | 4 |
| Test | 1,000 | 4 |
Risk Categories:
- Collision risk ππ₯
- Emergency acceleration π
- Emergency braking β
- Illegal lane change
βοΈ
| Method | ROUGE-1-F1 | ROUGE-2-F1 | ROUGE-L-F1 | BERTScore | Accuracy | Recall | F1-score |
|---|---|---|---|---|---|---|---|
| EM-VLM4AD-Base | 14.88 | 1.38 | 11.09 | 45.70 | - | - | - |
| Llava-1.5-7B | 42.67 | 11.44 | 27.23 | 65.18 | 22.34 | 1.72 | 0.85 |
| InternVL2-8B | 51.15 | 16.84 | 31.11 | 69.66 | 18.35 | 3.20 | 2.98 |
| InternVL2.5-8B | 49.89 | 15.07 | 29.21 | 68.70 | 26.86 | 9.58 | 4.79 |
| Llama3.2-vision-11B | 23.50 | 7.07 | 15.48 | 57.10 | 11.32 | 1.12 | 0.83 |
| Qwen2.5-VL-7B-Instruct | 48.54 | 15.99 | 30.72 | 68.83 | 27.13 | 13.76 | 6.66 |
| DriveMRP-Agent (Ours) | 69.08 | 42.23 | 52.93 | 81.25 | 88.03 | 89.44 | 89.12 |
| Method | ROUGE-1-F1 | ROUGE-2-F1 | ROUGE-L-F1 | BERTScore | Accuracy | Recall | F1-score |
|---|---|---|---|---|---|---|---|
| InternVL2-8B | 52.42 | 18.19 | 32.44 | 70.72 | 22.75 | 13.65 | 9.55 |
| InternVL2.5-8B | 55.14 | 20.58 | 34.45 | 71.87 | 24.28 | 12.18 | 8.34 |
| Qwen2.5-VL-7B-Instruct | 34.36 | 18.58 | 24.83 | 66.50 | 29.42 | 22.06 | 13.61 |
| DriveMRP-Agent (Ours) | 62.74 | 30.82 | 42.35 | 76.69 | 68.50 | 51.37 | 56.18 |
| Method | ROUGE-1-F1 | ROUGE-2-F1 | ROUGE-L-F1 | BERTScore | Accuracy | Recall | F1-score |
|---|---|---|---|---|---|---|---|
| Llava-1.5-7B | 42.67 | 11.44 | 27.23 | 65.18 | 22.34 | 1.72 | 0.85 |
| + DriveMRP-10K | 63.22 | 34.66 | 45.57 | 77.52 | 59.04 | 24.11 | 25.99 |
| Llama3.2-vision-11B | 23.50 | 7.07 | 15.48 | 57.10 | 11.32 | 1.12 | 0.83 |
| + DriveMRP-10K | 52.43 | 33.63 | 36.47 | 70.65 | 56.05 | 22.04 | 23.03 |
| Qwen2.5-VL-7B-Instruct | 48.54 | 15.99 | 30.72 | 68.83 | 27.13 | 13.76 | 6.66 |
| + DriveMRP-10K | 69.08 | 42.23 | 52.93 | 81.25 | 88.03 | 89.44 | 89.12 |
- Ground truth: Illegal lane change
- DriveMRP correctly identifies risk while baselines misclassify as "no risk"
- Ground truth: Abnormal deceleration
- DriveMRP detects risk from trajectory color changes
- Ground truth: Collision risk
- DriveMRP identifies threat from trajectory proximity to obstacles
| Scenario | Video |
|---|---|
| Emergency Acceleration | acc-1.mp4 |
| Emergency Braking | dec-1.mp4 |
| Collision | col.mp4 |
| Illegal Lane Change | change_lane.mp4 |
@inproceedings{hou2025drivemrp,
title = {DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction},
author = {Hou, Zhiyi and Ma, Enhui and Li, Fang and Lai, Zhiyi and Ho, Kalok and Wu, Zhanqian and Zhou, Lijun and Chen, Long and Sun, Chitian and Sun, Haiyang and Wang, Bing and Chen, Guang and Ye, Hangjun and Yu, Kaicheng},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2025},
note = {Equal contribution between the first three authors. Haiyang Sun is the project leader.},
url = {https://openreview.net/forum?id=anonymous_id}
}This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
- This project page template was adapted from the Academic Project Page Template
- Built upon the Qwen vision-language models
- Dataset generated using the nuPlan dataset
- Research supported by Zhejiang University, Westlake University, and Xiaomi EV





