DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Zhe Liu1,
Runhui Huang1,
Rui Yang1,
Siming Yan2,
Zining Wang2,
Lu Hou2,
Di Lin3,
Xiang Bai4,
Hengshuang Zhao1,β
1 The University of Hong Kong,
2 Yinwang Intelligent Technology Co. Ltd.,
3 Tianjin University,
4 Huazhong University of Science and Technology
β Corresponding author.
-
Unified Spatial-aware 4D MLLM Framework. DrivePI is the first unified framework that seamlessly integrates coarse-grained linguistic spatial understanding with fine-grained 3D perception capabilities, bridging the gap between vision-action (VA) and vision-language-action (VLA) paradigms in autonomous driving. πͺ
-
Multi-modal Sensing. DrivePI incorporates LiDAR as a complementary sensing modality alongside camera imagery, providing high-precision 3D geometric information that better elicits the spatial understanding capabilities of MLLMs. πͺ
-
Fine-grained 3D Perception and Prediction. DrivePI enables accurate 3D perception (e.g., 3D occupancy) and prediction (e.g., occupancy flow), which effectively enhances the interpretability and safety assurances for autonomous driving systems. πͺ
-
Strong Performance. Despite utilizing only a compact 0.5B parameter MLLM backbone (Qwen2.5), DrivePI outperforms existing VA models in 3D occupancy and occupancy flow while maintaining comparable interactive capabilities with existing VLA frameworks. πͺ
- 2026.03.21: The training and evaluation code for DrivePI have been released! Data preparation guidelines, trained models and all associated benchmarks will be available within a week!
- 2026.02.21: DrivePI and GenieDrive have been accepted by CVPR 2026!
- 2025.12.15: DrivePI paper released. π₯
- 2025.12.15: GenieDrive (Physics-Aware Driving World Model) paper released. π₯
- 2025.11.04: Our previous work UniLION has been released. Check out the codebase for unified autonomous driving model with Linear Group RNNs. π
- 2024.09.26: Our work LION has been accepted by NeurIPS 2024. Visit the codebase for Linear Group RNN for 3D Object Detection. π
- Release the paper.
- Release the code of DrivePI.
- Release checkpoints of DrivePI.
- Release the dataset.
- Support WAYMO E2E Dataset
- Vision-Action (VA) models take visual information (LiDAR point clouds, images) as inputs and output action signals through a modular framework. While these methods achieve promising results through accurate spatial perception, they are limited in language-based scene interaction.
- Vision-Language-Action (VLA) approaches leverage the reasoning capabilities of multimodal large language models (MLLMs). These methods achieve superior interaction capabilities but often struggle due to the absence of fine-grained intermediate 3D perception and prediction.
DrivePI bridges this gap by combining the strengths of both approaches, serving as a unified Vision-Language-Action framework that is also compatible with vision-action models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture.
Our multi-stage data pipeline consists of:
- Caption Annotation: We use InternVL3-78B to generate captions of front and back views separately, then merge and polish them to create comprehensive scene descriptions.
- 4D Spatial Understanding Annotation: We leverage ground-truth occupancy and flow data to generate diverse text-occupancy and text-flow QA pairs through multi-turn conversations, enabling fine-grained 3D understanding.
- Planning Reasoning Annotation: We create planning QA pairs based on future trajectory annotations to enhance planning interpretability, enabling the MLLM to predict future actions of the ego-vehicle.
Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models:
- Compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes.
- Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes.
| Method | VLM-based | OccScore | RayIoU (3D Occ.) |
mAVE (Occ. Flow) |
RayIoU (1m) | RayIoU (2m) | RayIoU (4m) |
|---|---|---|---|---|---|---|---|
| OccNeRF | 28.5 | 31.7 | -- | 16.6 | 29.3 | 49.2 | |
| RenderOcc | 33.0 | 36.7 | -- | 20.3 | 32.7 | 49.9 | |
| LetOccFlow | 36.4 | 40.5 | -- | 25.5 | 39.7 | 56.3 | |
| OccNet | 35.7 | 39.7 | -- | 29.3 | 39.7 | 50.0 | |
| BEVDetOcc-SF | 33.0 | 36.7 | 1.420 | 31.6 | 37.3 | 41.1 | |
| FB-Occ | 39.2 | 39.0 | 0.591 | 32.7 | 39.9 | 44.4 | |
| F-Occ | 41.0 | 39.9 | 0.491 | 33.9 | 40.7 | 45.2 | |
| CascadeFlow | 40.9 | 39.6 | 0.470 | 33.5 | 40.3 | 45.0 | |
| ALOcc-Flow-3D | 43.0 | 41.9 | 0.556 | 35.6 | 42.8 | 47.4 | |
| DrivePI (Ours) | β | 49.3 | 49.3 | 0.509 | 45.0 | 50.0 | 52.9 |
| Method | VLM-based | RayIoU | RayIoU (1m) | RayIoU (2m) | RayIoU (4m) |
|---|---|---|---|---|---|
| RenderOcc | 19.5 | 13.4 | 19.6 | 25.5 | |
| SimpleOcc | 22.5 | 17.0 | 22.7 | 27.9 | |
| BEVFormer | 32.4 | 26.1 | 32.9 | 38.0 | |
| BEVDet-Occ | 32.6 | 26.6 | 33.1 | 38.2 | |
| FB-Occ | 33.5 | 26.7 | 34.1 | 39.7 | |
| SparseOcc | 36.1 | 30.2 | 36.8 | 41.2 | |
| OPUS | 41.2 | 34.7 | 42.1 | 46.7 | |
| DrivePI (Ours)* | β | 46.0 | 42.2 | 46.7 | 49.2 |
*DrivePI trained exclusively on the 3D occupancy task of Occ3D-nuScenes.
| Method | VLM-based | Ego Status | L2 (m) | Collision Rate (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1s | 2s | 3s | avg. | 1s | 2s | 3s | avg. | |||
| ST-P3 | 1.33 | 2.11 | 2.90 | 2.11 | 0.23 | 0.62 | 1.27 | 0.71 | ||
| FF | 0.55 | 1.20 | 2.54 | 1.43 | 0.06 | 0.17 | 1.07 | 0.43 | ||
| EO | 0.67 | 1.36 | 2.78 | 1.60 | 0.04 | 0.09 | 0.88 | 0.33 | ||
| UniAD | 0.48 | 0.96 | 1.65 | 1.03 | 0.05 | 0.17 | 0.71 | 0.31 | ||
| VAD | 0.41 | 0.70 | 1.05 | 0.72 | 0.07 | 0.17 | 0.41 | 0.22 | ||
| VAD | β | 0.17 | 0.34 | 0.60 | 0.37 | 0.07 | 0.10 | 0.24 | 0.14 | |
| OmniDrive | β | β | 0.14 | 0.29 | 0.55 | 0.33 | 0.00 | 0.13 | 0.78 | 0.30 |
| ORION | β | β | 0.17 | 0.31 | 0.55 | 0.34 | 0.05 | 0.25 | 0.80 | 0.37 |
| OpenDriveVLA-7B | β | β | 0.20 | 0.58 | 1.21 | 0.66 | 0.00 | 0.22 | 0.55 | 0.25 |
| DrivePI (Ours) | β | 0.24 | 0.46 | 0.78 | 0.49 | 0.38 | 0.27 | 0.48 | 0.38 | |
| DrivePI (Ours) | β | β | 0.19 | 0.36 | 0.64 | 0.40 | 0.00 | 0.05 | 0.28 | 0.11 |
| Method | Exist | Count | Object | Status | Comparison | Accuracy |
|---|---|---|---|---|---|---|
| LLaMA-AdapV2 | 19.3 | 2.7 | 7.6 | 10.8 | 1.6 | 9.6 |
| LLaVA1.5 | 45.8 | 7.7 | 7.8 | 9.0 | 52.1 | 26.2 |
| LiDAR-LLM | 74.5 | 15.0 | 37.8 | 45.9 | 57.8 | 48.6 |
| BEVDet+BUTD | 83.7 | 20.9 | 48.8 | 52.0 | 67.7 | 57.0 |
| OpenDriveVLA-0.5B | 83.9 | 22.0 | 50.2 | 57.0 | 68.4 | 58.4 |
| OpenDriveVLA-3B | 84.0 | 22.3 | 50.3 | 56.9 | 68.5 | 58.5 |
| OpenDriveVLA-7B | 84.2 | 22.7 | 49.6 | 54.5 | 68.8 | 58.2 |
| DrivePI (Ours) | 85.3 | 22.4 | 57.5 | 59.1 | 68.3 | 60.7 |
| # | Text Head | Vision Head | 3D Occ. RayIoU |
Occ. Flow mAVE |
Planning | QA Acc. |
|
|---|---|---|---|---|---|---|---|
| L2 | Col. | ||||||
| I | β | -- | -- | -- | -- | -- | 61.2 |
| II | -- | β | 47.5 | 0.69 | 1.02 | 0.39 | -- |
| III | β | β | 49.3 | 0.51 | 0.49 | 0.38 | 60.7 |
# Create conda environment
conda create -n drivepi python==3.10.18
# Install requirements
pip install -r requirements.txt
# Install EMOVA
# Reference: https://github.com/emova-ollm/EMOVA
pip install -e .
pip install flash-attn --no-build-isolation-
BEV features generation:
- Use UniLION to generate and save BEV features
- Save path:
/path/DrivePI_Data/unilion_bev_feats_train/ - Name the features using token names
-
QA datasets:
- Save at:
/path/DrivePI_Data/drivepi_captions/
- Save at:
bash run_train_occ_action_llm_8_gpus_final.sh# For occupancy and action testing
bash run_test_occ_llm_occ_action.sh
# For text understanding testing
bash run_test_occ_llm__text.sh@article{liu2025drivepi,
title={DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning},
author={Liu, Zhe and Huang, Runhui and Yang, Rui and Yan, Siming and Wang, Zining and Hou, Lu and Lin, Di and Bai, Xiang and Zhao, Hengshuang},
journal={CVPR},
year={2026}
}We thank these great works and open-source repositories: UniLION, MMDectection3D, InternVL3, LLaVA, and EMOVA.


