Bridging Scene Generation and Planning:
Driving with World Model via Unifying Vision and Motion Representation

Xingtai Gui¹, Meijie Zhang², Tianyi Yan¹, Wencheng Han¹, Jiahao Gong², Feiyang Tan², Cheng-zhong Xu¹, Jianbing Shen¹

¹SKL-IOTSC, CIS, University of Macau, ²Afari Intelligent Drive

News

[2026.3.17] Release the Arxiv Ppaer
[2026.3.15] Release the WorldDrive Evaluation and Visualization script
[2026.3.14] Release the WorldDrive Project!

Abstract

End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input, and constructing an effective scene representation is a critical challenge. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model’s foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves state-of-the-art planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.

Overview

Getting Started

We provide detailed guides to help you quickly set up, and evaluate WorldDrive:

Checkpoint

👉 Checkpoint

# worlddrive_stage1_train.ckpt planner checkpoint
# worlddrive_stage2_train.ckpt planner with future-aware rewarder checkpoint
# worldtraj_stage1_1024_tadwm.pkl TA-DWM pretrain checkpoint

Quick Evaluation

Multi-modal Planner

Step1: cache dataset(3D causal VAE latents)

Download the pretrained 3D Causal VAE from offical CogvideoX-2B HF
👉 CogvideoX-2B VAE

sh scripts/cache/run_caching_trajworld_eval.sh # navtest for eval

Step2: evaluate planner

# download worlddrive_stage1_train.ckpt
sh scripts/evaluation/run_worlddrive_planner_pdm_score_evaluation_stage1.sh

Step3: evaluate planner with future-aware rewarder

# download worlddrive_stage2_train.ckpt
sh scripts/evaluation/run_worlddrive_planner_pdm_score_evaluation_stage2.sh

Visulize WorldDrive

Generate planning result and corresponding future scene

sh scripts/visualization/worlddrive_visual.sh

Quick Training

Multi-modal Planner Training

Step1: cache dataset(3D causal VAE latents)

Download the anchor and corresponding formated PDMS
👉 Anchors

sh scripts/cache/run_caching_trajworld.sh # navtrain

Step2: download ta-dwm checkpoint

Download the corresponding ta-dwm checkpoint training on NAVSIM (worldtraj_stage1_1024_tadwm) or use the checkpoint training from ta-dwm training.
👉 TA-DWM Model

Step3: train planner

sh scripts/training/run_worlddrive_planner.sh

Contact

If you have any questions, please contact Xingtai via email ([email protected])

Acknowledgement

We thank the research community for their valuable support. WorldDrive is built upon the following outstanding open-source projects:
diffusers
WoTE(End-to-End Driving with Online Trajectory Evaluation via BEV World Model (ICCV2025))
Epona(Epona: Autoregressive Diffusion World Model for Autonomous Driving)
Recogdrive(A Reinforced Cognitive Framework for End-to-End Autonomous Driving) \

Citation

If you find WorldDrive is useful in your research or applications, please consider giving us a star 🌟.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
docs		docs
download		download
navsim		navsim
scripts		scripts
ta_dwm		ta_dwm
tutorial		tutorial
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging Scene Generation and Planning:
Driving with World Model via Unifying Vision and Motion Representation

News

Table of Contents

Abstract

Overview

Getting Started

Checkpoint

Quick Evaluation

Multi-modal Planner

Step1: cache dataset(3D causal VAE latents)

Step2: evaluate planner

Step3: evaluate planner with future-aware rewarder

Visulize WorldDrive

Quick Training

Multi-modal Planner Training

Step1: cache dataset(3D causal VAE latents)

Step2: download ta-dwm checkpoint

Step3: train planner

Contact

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

News

Table of Contents

Abstract

Overview

Getting Started

Checkpoint

Quick Evaluation

Multi-modal Planner

Step1: cache dataset(3D causal VAE latents)

Step2: evaluate planner

Step3: evaluate planner with future-aware rewarder

Visulize WorldDrive

Quick Training

Multi-modal Planner Training

Step1: cache dataset(3D causal VAE latents)

Step2: download ta-dwm checkpoint

Step3: train planner

Contact

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Bridging Scene Generation and Planning:
Driving with World Model via Unifying Vision and Motion Representation

Packages