π [Arxiv] π€ [Model Weights]
Yingyan Li*, Shuyao Shang*, Weisong Liu*, Bing Zhan*, Haochen Wang*, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fanβ , Zhaoxiang Zhangβ
This paper presents DriveVLA-W0, a training paradigm that employs world modeling to predict future images. This generates dense, self-supervised signals, compelling the model to learn the underlying dynamics of the driving environment, addressing the "supervision deficit" in VLA models and amplifying data scaling laws.
Due to company policy, only the reviewed part of our codebase is available. Please contact us if you have any questions.
DriveVLA-W0/
βββ assets/ # Project assets (images, docs, etc.)
βββ configs/ # Model configuration files and normalization stats
β βββ fast/ # Fast action tokenizer configs
β βββ normalizer_navsim_test/ # NAVSIM testset normalization config
β βββ normalizer_navsim_trainval/ # NAVSIM train/val normalization config
β βββ normalizer_nuplan/ # NuPlan dataset normalization config
βββ data/ # Data pipelines and config
β βββ navsim/ # NAVSIM-related data
β βββ others/ # Other datasets
βββ inference/ # Inference scripts
β βββ navsim/ # NAVSIM PDMS evaluation
β βββ qwen/ # Qwen model inference
β βββ vla/ # Emu model inference
βββ models/ # Model definitions
β βββ policy_head/ # Policy head implementations
β βββ tokenizer/ # Tokenizer implementations
βββ scripts/ # Training and deployment scripts
βββ tools/ # Utility scripts
β βββ action_tokenizer/ # Action tokenizer tools
β βββ pickle_gen/ # Data preprocessing & pickle generation
βββ utils/ # utils code
β βββ datasets.py # Dataset definitions
βββ requirements.txt # Python dependencies
- Download Pretrained Models
pip install huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
mkdir pretrained_models
bash scripts/misc/download_emu3_pretrain.sh- Set Up Environment
conda create -n drivevla python=3.10
conda activate drivevla
pip install -r requirements.txt-
Download Model Weights Download Emu3_Flow_Matching_Action_Expert_PDMS_87.2 and navsim_emu_vla_256_144_test_pre_1s.pkl from Hugging Face.
-
Run Inference
# Run inference using pretrained model (update paths as needed)
bash inference/vla/infer_navsim_flow_matching_PDMS_87.2.shDriveVLA-W0 uses the NAVSIM (v1.1) dataset for training and evaluation. Steps required:
-
Obtain NAVSIM Dataset
- Visit the official NAVSIM repo
- Download the train and test data splits
- The data includes sensor information, scenario metadata, and labels
-
Data Preprocessing
# Generate VQ indices python tools/pickle_gen/pickle_generation_navsim_pre_1s.py # Generate NAVSIM pickle files bash scripts/tokenizer/extract_vq_emu3_navsim.sh
-
Data Format
- Preprocessed data is saved in
data/navsim/processed_data/ - Contains scenario files, metadata, and extracted features
- Preprocessed data is saved in
- Training: ~100,000 driving frames
- Validation: ~10,000 frames
- Test: NAVSIM test set
8x L20 GPUs (40GB memory each), ~16 hours
If your system does not already have CUDA 12.4+, please install it first:
# Download CUDA 12.8.1 (recommended version)
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
# Install CUDA toolkit
bash cuda_12.8.1_570.124.06_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-12.8
# Add to your ~/.bashrc or shell profile
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH# Create Conda environment
conda create -n drivevla python=3.10
conda activate drivevla
# Install PyTorch (CUDA 12.4)
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
# Install core dependencies
pip install -r requirements.txt
pip install "transformers[torch]"
# Install training-related dependencies
pip install deepspeed # Distributed training
pip install scipy # Scientific computing
pip install tensorboard==2.14.0 # Visualization
pip install wandb # Experiment trackingFirst, download the model checkpoints from Hugging Face.
Then, run the following testing script to produce the output actions (as JSON files):
bash inference/vla/infer_navsim_flow_matching_PDMS_87.2.sh
Finally, run the script below to compute the PDMS metrics using the generated JSONs (with the conda environment and a valid navsim repo):
bash inference/vla/eval_navsim_metric_from_json.sh
The project uses JSON-formatted configuration files located in configs/:
configs/
βββ moe_fast_video.json # MoE model fast inference config
βββ moe_fast_video_pretrain.json # MoE model pretraining config
βββ normalizer_navsim_test/ # NAVSIM test set normalization parameters
βββ normalizer_navsim_trainval/ # NAVSIM train+val normalization parameters
βββ normalizer_nuplan/ # NuPlan normalization parameters
Normalization parameters are automatically computed from the training datasets:
normalizer_navsim_trainval/β computed on NAVSIM training setnormalizer_navsim_test/β computed on NAVSIM test setnormalizer_nuplan/β computed on NuPlan dataset
Here is a comparison with state-of-the-art methods on the NAVSIM test set, as presented in the paper. Our model, DriveVLA-W0, establishes a new state-of-the-art.
| Method | Reference | Sensors | NC β | DAC β | TTC β | C. β | EP β | PDMS β |
|---|---|---|---|---|---|---|---|---|
| Human | 100.0 | 100.0 | 100.0 | 99.9 | 87.5 | 94.8 | ||
| BEV-based Methods | ||||||||
| LAW | ICLR'25 | 1x Cam | 96.4 | 95.4 | 88.7 | 99.9 | 81.7 | 84.6 |
| Hydra-MDP | arXiv'24 | 3x Cam + L | 98.3 | 96.0 | 94.6 | 100.0 | 78.7 | 86.5 |
| DiffusionDrive | CVPR'25 | 3x Cam + L | 98.2 | 96.2 | 94.7 | 100.0 | 82.2 | 88.1 |
| WoTE | ICCV'25 | 3x Cam + L | 98.5 | 96.8 | 94.4 | 99.9 | 81.9 | 88.3 |
| VLA-based Methods | ||||||||
| AutoVLA | NeurIPS'25 | 3x Cam | 98.4 | 95.6 | 98.0 | 99.9 | 81.9 | 89.1 |
| ReCogDrive | arXiv'25 | 3x Cam | 98.2 | 97.8 | 95.2 | 99.8 | 83.5 | 89.6 |
| DriveVLA-W0* | Ours | 1x Cam | 98.7 | 99.1 | 95.3 | 99.3 | 83.3 | 90.2 |
| AutoVLAβ | NeurIPS'25 | 3x Cam | 99.1 | 97.1 | 97.1 | 100.0 | 87.6 | 92.1 |
| DriveVLA-W0β | Ours | 1x Cam | 99.3 | 97.4 | 97.0 | 99.9 | 88.3 | 93.0 |
If you find our work useful for your research, please consider giving this repository a star β.
If you find this work useful for your research, please consider citing our paper:
@article{li2025drivevla,
title={DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving},
author={Li, Yingyan and Shang, Shuyao and Liu, Weisong and Zhan, Bing and Wang, Haochen and Wang, Yuqi and Chen, Yuntao and Wang, Xiaoman and An, Yasong and Tang, Chufeng and others},
journal={arXiv preprint arXiv:2510.12796},
year={2025}
}
We would like to acknowledge the following related works:
LAW (ICLR 2025): Using latent world models for self-supervised feature learning in end-to-end autonomous driving.
WoTE (ICCV 2025): Using BEV world models for online trajectory evaluation in end-to-end autonomous driving.
UniVLA: World modeling in the broader field of robotics.
