GitHub - happinesslz/DrivePI: [CVPR 2026] DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Zhe Liu¹, Runhui Huang¹, Rui Yang¹, Siming Yan², Zining Wang², Lu Hou², Di Lin³, Xiang Bai⁴, Hengshuang Zhao^1,✉
¹ The University of Hong Kong, ² Yinwang Intelligent Technology Co. Ltd., ³ Tianjin University, ⁴ Huazhong University of Science and Technology
✉ Corresponding author.

🔥 Highlights

Unified Spatial-aware 4D MLLM Framework. DrivePI is the first unified framework that seamlessly integrates coarse-grained linguistic spatial understanding with fine-grained 3D perception capabilities, bridging the gap between vision-action (VA) and vision-language-action (VLA) paradigms in autonomous driving. 💪
Multi-modal Sensing. DrivePI incorporates LiDAR as a complementary sensing modality alongside camera imagery, providing high-precision 3D geometric information that better elicits the spatial understanding capabilities of MLLMs. 💪
Fine-grained 3D Perception and Prediction. DrivePI enables accurate 3D perception (e.g., 3D occupancy) and prediction (e.g., occupancy flow), which effectively enhances the interpretability and safety assurances for autonomous driving systems. 💪
Strong Performance. Despite utilizing only a compact 0.5B parameter MLLM backbone (Qwen2.5), DrivePI outperforms existing VA models in 3D occupancy and occupancy flow while maintaining comparable interactive capabilities with existing VLA frameworks. 💪

News

2026.03.21: The training and evaluation code for DrivePI have been released! Data preparation guidelines, trained models and all associated benchmarks will be available within a week!
2026.02.21: DrivePI and GenieDrive have been accepted by CVPR 2026!
2025.12.15: DrivePI paper released. 🔥
2025.12.15: GenieDrive (Physics-Aware Driving World Model) paper released. 🔥
2025.11.04: Our previous work UniLION has been released. Check out the codebase for unified autonomous driving model with Linear Group RNNs. 🚀
2024.09.26: Our work LION has been accepted by NeurIPS 2024. Visit the codebase for Linear Group RNN for 3D Object Detection. 🚀

TODO

🚗 Overview

In end-to-end autonomous driving systems, two main approaches have emerged:

Vision-Action (VA) models take visual information (LiDAR point clouds, images) as inputs and output action signals through a modular framework. While these methods achieve promising results through accurate spatial perception, they are limited in language-based scene interaction.
Vision-Language-Action (VLA) approaches leverage the reasoning capabilities of multimodal large language models (MLLMs). These methods achieve superior interaction capabilities but often struggle due to the absence of fine-grained intermediate 3D perception and prediction.

DrivePI bridges this gap by combining the strengths of both approaches, serving as a unified Vision-Language-Action framework that is also compatible with vision-action models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture.

📊 Data Engine

Our multi-stage data pipeline consists of:

Caption Annotation: We use InternVL3-78B to generate captions of front and back views separately, then merge and polish them to create comprehensive scene descriptions.
4D Spatial Understanding Annotation: We leverage ground-truth occupancy and flow data to generate diverse text-occupancy and text-flow QA pairs through multi-turn conversations, enabling fine-grained 3D understanding.
Planning Reasoning Annotation: We create planning QA pairs based on future trajectory annotations to enhance planning interpretability, enabling the MLLM to predict future actions of the ego-vehicle.

📈 Results

Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models:

Compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes.
Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes.

Visualization of DrivePI's multi-granularity understanding capabilities

3D Occupancy and Occupancy Flow on OpenOcc

Method	VLM-based	OccScore	RayIoU (3D Occ.)	mAVE (Occ. Flow)	RayIoU (1m)	RayIoU (2m)	RayIoU (4m)
OccNeRF		28.5	31.7	--	16.6	29.3	49.2
RenderOcc		33.0	36.7	--	20.3	32.7	49.9
LetOccFlow		36.4	40.5	--	25.5	39.7	56.3
OccNet		35.7	39.7	--	29.3	39.7	50.0
BEVDetOcc-SF		33.0	36.7	1.420	31.6	37.3	41.1
FB-Occ		39.2	39.0	0.591	32.7	39.9	44.4
F-Occ		41.0	39.9	0.491	33.9	40.7	45.2
CascadeFlow		40.9	39.6	0.470	33.5	40.3	45.0
ALOcc-Flow-3D		43.0	41.9	0.556	35.6	42.8	47.4
DrivePI (Ours)	✓	49.3	49.3	0.509	45.0	50.0	52.9

3D Occupancy on Occ3D-nuScenes

Method	VLM-based	RayIoU	RayIoU (1m)	RayIoU (2m)	RayIoU (4m)
RenderOcc		19.5	13.4	19.6	25.5
SimpleOcc		22.5	17.0	22.7	27.9
BEVFormer		32.4	26.1	32.9	38.0
BEVDet-Occ		32.6	26.6	33.1	38.2
FB-Occ		33.5	26.7	34.1	39.7
SparseOcc		36.1	30.2	36.8	41.2
OPUS		41.2	34.7	42.1	46.7
DrivePI (Ours)*	✓	46.0	42.2	46.7	49.2

*DrivePI trained exclusively on the 3D occupancy task of Occ3D-nuScenes.

Planning on nuScenes

Method	VLM-based	Ego Status	L2 (m)				Collision Rate (%)
			1s	2s	3s	avg.	1s	2s	3s	avg.
ST-P3			1.33	2.11	2.90	2.11	0.23	0.62	1.27	0.71
FF			0.55	1.20	2.54	1.43	0.06	0.17	1.07	0.43
EO			0.67	1.36	2.78	1.60	0.04	0.09	0.88	0.33
UniAD			0.48	0.96	1.65	1.03	0.05	0.17	0.71	0.31
VAD			0.41	0.70	1.05	0.72	0.07	0.17	0.41	0.22
VAD		✓	0.17	0.34	0.60	0.37	0.07	0.10	0.24	0.14
OmniDrive	✓	✓	0.14	0.29	0.55	0.33	0.00	0.13	0.78	0.30
ORION	✓	✓	0.17	0.31	0.55	0.34	0.05	0.25	0.80	0.37
OpenDriveVLA-7B	✓	✓	0.20	0.58	1.21	0.66	0.00	0.22	0.55	0.25
DrivePI (Ours)	✓		0.24	0.46	0.78	0.49	0.38	0.27	0.48	0.38
DrivePI (Ours)	✓	✓	0.19	0.36	0.64	0.40	0.00	0.05	0.28	0.11

Text Understanding on nuScenes-QA

Method	Exist	Count	Object	Status	Comparison	Accuracy
LLaMA-AdapV2	19.3	2.7	7.6	10.8	1.6	9.6
LLaVA1.5	45.8	7.7	7.8	9.0	52.1	26.2
LiDAR-LLM	74.5	15.0	37.8	45.9	57.8	48.6
BEVDet+BUTD	83.7	20.9	48.8	52.0	67.7	57.0
OpenDriveVLA-0.5B	83.9	22.0	50.2	57.0	68.4	58.4
OpenDriveVLA-3B	84.0	22.3	50.3	56.9	68.5	58.5
OpenDriveVLA-7B	84.2	22.7	49.6	54.5	68.8	58.2
DrivePI (Ours)	85.3	22.4	57.5	59.1	68.3	60.7

Component Ablation Study

#	Text Head	Vision Head	3D Occ. RayIoU	Occ. Flow mAVE	Planning		QA Acc.
					L2	Col.
I	✓	--	--	--	--	--	61.2
II	--	✓	47.5	0.69	1.02	0.39	--
III	✓	✓	49.3	0.51	0.49	0.38	60.7

🛠️ Installation and Running

Environment Setup

# Create conda environment
conda create -n drivepi python==3.10.18

# Install requirements
pip install -r requirements.txt

# Install EMOVA
# Reference: https://github.com/emova-ollm/EMOVA
pip install -e .
pip install flash-attn --no-build-isolation

Data Preparation

BEV features generation:
- Use UniLION to generate and save BEV features
- Save path: /path/DrivePI_Data/unilion_bev_feats_train/
- Name the features using token names
QA datasets:
- Save at: /path/DrivePI_Data/drivepi_captions/

Training

bash run_train_occ_action_llm_8_gpus_final.sh

Testing

# For occupancy and action testing
bash run_test_occ_llm_occ_action.sh

# For text understanding testing
bash run_test_occ_llm__text.sh

📝 Citation

@article{liu2025drivepi,
  title={DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning},
  author={Liu, Zhe and Huang, Runhui and Yang, Rui and Yan, Siming and Wang, Zining and Hou, Lu and Lin, Di and Bai, Xiang and Zhao, Hengshuang},
  journal={CVPR},
  year={2026}
}

Acknowledgements

We thank these great works and open-source repositories: UniLION, MMDectection3D, InternVL3, LLaVA, and EMOVA.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
docs		docs
emova		emova
images		images
occ_evaluation		occ_evaluation
scripts		scripts
.gitignore		.gitignore
README.md		README.md
evaluate_plan.py		evaluate_plan.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_test_occ_llm_4_gpus_occ_action.sh		run_test_occ_llm_4_gpus_occ_action.sh
run_test_occ_llm_4_gpus_text.sh		run_test_occ_llm_4_gpus_text.sh
run_train_occ_action_llm_8_gpus_final.sh		run_train_occ_action_llm_8_gpus_final.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

🔥 Highlights

News

TODO

🚗 Overview

📊 Data Engine

📈 Results

3D Occupancy and Occupancy Flow on OpenOcc

3D Occupancy on Occ3D-nuScenes

Planning on nuScenes

Text Understanding on nuScenes-QA

Component Ablation Study

🛠️ Installation and Running

Environment Setup

Data Preparation

Training

Testing

📝 Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

🔥 Highlights

News

TODO

🚗 Overview

📊 Data Engine

📈 Results

3D Occupancy and Occupancy Flow on OpenOcc

3D Occupancy on Occ3D-nuScenes

Planning on nuScenes

Text Understanding on nuScenes-QA

Component Ablation Study

🛠️ Installation and Running

Environment Setup

Data Preparation

Training

Testing

📝 Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages