MBZUAI & Alexandria University

Latent Action Pretraining Through World Modeling

Teaching robots from human videos through self-supervised world modeling — achieving superior performance with efficient, model-agnostic pretraining.

1 MBZUAI, Abu Dhabi, UAE   |   2 Alexandria University, Egypt

Vision-Language-Action (VLA) models enable robots to follow language instructions but often require large labeled datasets and heavy models, limiting real-world use. Recent methods leverage latent action representations for pretraining on unlabeled videos, yet remain resource-intensive. We propose LAWM, a model-agnostic framework that learns latent actions through world modeling from unlabeled robot or human videos. LAWM transfers effectively across tasks and embodiments, outperforming ground-truth and prior pretraining methods on LIBERO and real-world experiments, while being more efficient and practical.

Overview Video

See LAWM in action across different manipulation tasks

Key Contributions

What makes LAWM unique and effective

Model-Agnostic Framework

LAWM can be applied to various imitation learning architectures, making it flexible and widely applicable across different robot learning systems.

Unlabeled Video Pretraining

Learn from unlabeled robot or human videos without expensive teleoperation data collection, enabling scalable learning from diverse sources.

Cross-Embodiment Transfer

Effective transfer across different tasks, environments, and robot embodiments, demonstrating strong generalization capabilities.

Efficient & Practical

Smaller model size and better performance compared to large-scale VLA models, making it practical for real-world deployment.

Superior to Supervised Pretraining

Self-supervised world modeling outperforms standard supervised pretraining on ground-truth robot actions, learning more generalizable dynamics.

Action Chunking Support

Predicts chunks of future actions instead of single steps, reducing error accumulation and improving efficiency in long-horizon tasks.

Method Overview

Two-stage framework for learning robust action representations

LAWM Framework
1

Self-Supervised Pretraining

Learn latent action representations from unlabeled videos through world modeling. The model predicts visual changes between frames without requiring ground-truth action labels, enabling large-scale pretraining on diverse video datasets.

2

Task-Specific Finetuning

Finetune the pretrained model on target robotic manipulation tasks with minimal labeled data, leveraging learned action priors. The world model is discarded, and only the imitation learning model is adapted to downstream tasks.

Experimental Results

State-of-the-art performance on LIBERO benchmark and real-world tasks

BAKU Results
Diffusion Policy Results
Model Pretraining Method Spatial Object Goal Long Average
Octo-base Robot Actions 78.90 85.70 84.60 51.10 75.10
OpenVLA Robot Actions 84.70 88.40 79.20 53.70 76.50
π₀ Robot Actions 90.00 86.00 95.00 73.00 86.00
π₀.₅ Robot Actions 98.80 98.20 98.00 92.40 96.85
π₀ (Paligemma-3B) VLM Checkpoint 87.00 63.00 89.00 48.00 71.80
SmolVLA (SmolVLM-2.25B) VLM Checkpoint 93.00 94.00 91.00 77.00 88.75
villa-X w/o latent Videos 86.00 86.50 85.00 70.00 81.90
villa-X Videos 97.50 97.00 91.50 74.50 90.10
BAKU w/o pretraining None 94.00 100.00 96.00 92.00 95.50
BAKU w/ latent pretraining (Ours) Videos 99.00 100.00 96.00 94.00 97.25
Real-world Results

Citation

If you find our work useful, please consider citing

arXiv: 2509.18428 [cs.RO]

Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

https://doi.org/10.48550/arXiv.2509.18428

@article{lawm2025,
  title   = {Latent Action Pretraining Through World Modeling},
  author  = {Bahey Tharwat and Yara Nasser and Ali Abouzied and Ian Reid},
  journal = {arXiv preprint arXiv:2509.18428},
  year    = {2025}
}