Skip to content

SII-HZY/DriveMRP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction

License: CC BY-SA 4.0

NeurIPS 2025 Submission

πŸ‘₯ Authors

Zhiyi Hou1,2,3,*, Enhui Ma1,3,*, Fang Li2,*, Zhiyi Lai2, Kalok Ho2, Zhanqian Wu2,
Lijun Zhou2, Long Chen2, Chitian Sun2, Haiyang Sun2,†, Bing Wang2,
Guang Chen2, Hangjun Ye2, Kaicheng Yu1,βœ‰

1Westlake University, 2Xiaomi EV, 3Zhejiang University
*Equal contribution. †Project leader. βœ‰Corresponding author.

Method Overview

πŸ“Œ Table of Contents

πŸ” Abstract

Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle's future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage.

In this work, we introduce:

  1. DriveMRP-10K: A synthetic dataset of high-risk driving motions built from nuPlan using BEV-based simulation to model risks from ego-vehicle, other agents, and environment
  2. DriveMRP-Agent: A VLM-agnostic framework that incorporates projection-based visual prompting to bridge numerical coordinates and images

By fine-tuning with DriveMRP-10K, our framework significantly improves motion risk prediction performance, with accident recognition accuracy soaring from 27.13% to 88.03%. When tested via zero-shot evaluation on real-world high-risk motion data, DriveMRP-Agent boosts accuracy from 29.42% to 68.50%.

🧠 Method Overview

πŸ—‚οΈ 1. DriveMRP-10K Dataset

Dataset Generation

  • Synthetic high-risk motion data generated via BEV-based simulation
  • Models risks from three aspects:
    • Ego-vehicle maneuvers
    • Other vehicle interactions
    • Environmental constraints
  • Includes:
    • Trajectory generation
    • Human-in-the-loop labeling
    • GPT-4o captions
  • 10K multimodal samples for VLM training

πŸ€– 2. DriveMRP-Agent Framework

Framework Architecture

  • VLM-agnostic architecture based on Qwen2.5VL-7B
  • Key components:
    • Projection-based visual prompting: Bridges numerical coordinates and images
    • Multi-context integration: Combines BEV and front-view contexts
    • Chain-of-thought reasoning: For motion risk prediction
  • Processes:
    1. Global context injection
    2. Ego-vehicle perspective alignment
    3. Trajectory projection

πŸ—ƒοΈ Dataset Structure

DriveMRP-10K/
β”œβ”€β”€ train/                  # Training samples (8,000 scenarios)
β”‚   β”œβ”€β”€ scenario_001/
β”‚   β”‚   β”œβ”€β”€ bev.png         # BEV representation
β”‚   β”‚   β”œβ”€β”€ front_view.png  # Ego-vehicle perspective
β”‚   β”‚   β”œβ”€β”€ trajectory.json  # Motion trajectory data
β”‚   β”‚   └── caption.txt     # GPT-4o generated description
β”‚   └── ...
β”œβ”€β”€ val/                    # Validation samples (1,000 scenarios)
β”œβ”€β”€ test/                   # Test samples (1,000 scenarios)
└── metadata.json           # Dataset metadata and statistics

Dataset Statistics:

Split Scenarios Risk Categories
Train 8,000 4
Val 1,000 4
Test 1,000 4

Risk Categories:

  1. Collision risk πŸš—πŸ’₯
  2. Emergency acceleration πŸš€
  3. Emergency braking βœ‹
  4. Illegal lane change ↔️

πŸ“Š Results

1. Performance on Synthetic Dataset (DriveMRP-10K)

Method ROUGE-1-F1 ROUGE-2-F1 ROUGE-L-F1 BERTScore Accuracy Recall F1-score
EM-VLM4AD-Base 14.88 1.38 11.09 45.70 - - -
Llava-1.5-7B 42.67 11.44 27.23 65.18 22.34 1.72 0.85
InternVL2-8B 51.15 16.84 31.11 69.66 18.35 3.20 2.98
InternVL2.5-8B 49.89 15.07 29.21 68.70 26.86 9.58 4.79
Llama3.2-vision-11B 23.50 7.07 15.48 57.10 11.32 1.12 0.83
Qwen2.5-VL-7B-Instruct 48.54 15.99 30.72 68.83 27.13 13.76 6.66
DriveMRP-Agent (Ours) 69.08 42.23 52.93 81.25 88.03 89.44 89.12

2. Zero-Shot Performance on Real-World Dataset

Method ROUGE-1-F1 ROUGE-2-F1 ROUGE-L-F1 BERTScore Accuracy Recall F1-score
InternVL2-8B 52.42 18.19 32.44 70.72 22.75 13.65 9.55
InternVL2.5-8B 55.14 20.58 34.45 71.87 24.28 12.18 8.34
Qwen2.5-VL-7B-Instruct 34.36 18.58 24.83 66.50 29.42 22.06 13.61
DriveMRP-Agent (Ours) 62.74 30.82 42.35 76.69 68.50 51.37 56.18

3. Performance Gains with DriveMRP-10K Fine-tuning

Method ROUGE-1-F1 ROUGE-2-F1 ROUGE-L-F1 BERTScore Accuracy Recall F1-score
Llava-1.5-7B 42.67 11.44 27.23 65.18 22.34 1.72 0.85
+ DriveMRP-10K 63.22 34.66 45.57 77.52 59.04 24.11 25.99
Llama3.2-vision-11B 23.50 7.07 15.48 57.10 11.32 1.12 0.83
+ DriveMRP-10K 52.43 33.63 36.47 70.65 56.05 22.04 23.03
Qwen2.5-VL-7B-Instruct 48.54 15.99 30.72 68.83 27.13 13.76 6.66
+ DriveMRP-10K 69.08 42.23 52.93 81.25 88.03 89.44 89.12

Qualitative Results

Case 1: Illegal Lane Change Risk

Illegal Lane Change

  • Ground truth: Illegal lane change
  • DriveMRP correctly identifies risk while baselines misclassify as "no risk"

Case 2: Abnormal Deceleration Risk

Abnormal Deceleration

  • Ground truth: Abnormal deceleration
  • DriveMRP detects risk from trajectory color changes

Case 3: Collision Risk

Collision Risk

  • Ground truth: Collision risk
  • DriveMRP identifies threat from trajectory proximity to obstacles

Risk Scenario Videos

Scenario Video
Emergency Acceleration acc-1.mp4
Emergency Braking dec-1.mp4
Collision col.mp4
Illegal Lane Change change_lane.mp4

πŸ“ Citation

@inproceedings{hou2025drivemrp,
  title     = {DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction},
  author    = {Hou, Zhiyi and Ma, Enhui and Li, Fang and Lai, Zhiyi and Ho, Kalok and Wu, Zhanqian and Zhou, Lijun and Chen, Long and Sun, Chitian and Sun, Haiyang and Wang, Bing and Chen, Guang and Ye, Hangjun and Yu, Kaicheng},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2025},
  note      = {Equal contribution between the first three authors. Haiyang Sun is the project leader.},
  url       = {https://openreview.net/forum?id=anonymous_id}
}

πŸ“œ License

This project is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.

πŸ™ Acknowledgements

  • This project page template was adapted from the Academic Project Page Template
  • Built upon the Qwen vision-language models
  • Dataset generated using the nuPlan dataset
  • Research supported by Zhejiang University, Westlake University, and Xiaomi EV

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors