MBZUAI & Alexandria University

Latent Action Pretraining Through World Modeling

Teaching robots from human videos through self-supervised world modeling — achieving superior performance with efficient, model-agnostic pretraining.

Bahey Tharwat¹, Yara Nasser², Ali Abouzied¹, Ian Reid¹

¹ MBZUAI, Abu Dhabi, UAE | ² Alexandria University, Egypt

Paper Code

Vision-Language-Action (VLA) models enable robots to follow language instructions but often require large labeled datasets and heavy models, limiting real-world use. Recent methods leverage latent action representations for pretraining on unlabeled videos, yet remain resource-intensive. We propose LAWM, a model-agnostic framework that learns latent actions through world modeling from unlabeled robot or human videos. LAWM transfers effectively across tasks and embodiments, outperforming ground-truth and prior pretraining methods on LIBERO and real-world experiments, while being more efficient and practical.

Watch

Overview Video

See LAWM in action across different manipulation tasks

Highlights

Key Contributions

What makes LAWM unique and effective

Model-Agnostic Framework

LAWM can be applied to various imitation learning architectures, making it flexible and widely applicable across different robot learning systems.

Unlabeled Video Pretraining

Learn from unlabeled robot or human videos without expensive teleoperation data collection, enabling scalable learning from diverse sources.

Cross-Embodiment Transfer

Effective transfer across different tasks, environments, and robot embodiments, demonstrating strong generalization capabilities.

⚡

Efficient & Practical

Smaller model size and better performance compared to large-scale VLA models, making it practical for real-world deployment.

Superior to Supervised Pretraining

Self-supervised world modeling outperforms standard supervised pretraining on ground-truth robot actions, learning more generalizable dynamics.

Action Chunking Support

Predicts chunks of future actions instead of single steps, reducing error accumulation and improving efficiency in long-horizon tasks.

Approach

Method Overview

Two-stage framework for learning robust action representations

Self-Supervised Pretraining

Learn latent action representations from unlabeled videos through world modeling. The model predicts visual changes between frames without requiring ground-truth action labels, enabling large-scale pretraining on diverse video datasets.

Task-Specific Finetuning

Finetune the pretrained model on target robotic manipulation tasks with minimal labeled data, leveraging learned action priors. The world model is discarded, and only the imitation learning model is adapted to downstream tasks.

Performance

Experimental Results

State-of-the-art performance on LIBERO benchmark and real-world tasks

Model	Pretraining Method	Spatial	Object	Goal	Long	Average
Octo-base	Robot Actions	78.90	85.70	84.60	51.10	75.10
OpenVLA	Robot Actions	84.70	88.40	79.20	53.70	76.50
π₀	Robot Actions	90.00	86.00	95.00	73.00	86.00
π₀.₅	Robot Actions	98.80	98.20	98.00	92.40	96.85
π₀ (Paligemma-3B)	VLM Checkpoint	87.00	63.00	89.00	48.00	71.80
SmolVLA (SmolVLM-2.25B)	VLM Checkpoint	93.00	94.00	91.00	77.00	88.75
villa-X w/o latent	Videos	86.00	86.50	85.00	70.00	81.90
villa-X	Videos	97.50	97.00	91.50	74.50	90.10
BAKU w/o pretraining	None	94.00	100.00	96.00	92.00	95.50
BAKU w/ latent pretraining (Ours)	Videos	99.00	100.00	96.00	94.00	97.25

Reference

Citation

If you find our work useful, please consider citing

arXiv: 2509.18428 [cs.RO]

Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

https://doi.org/10.48550/arXiv.2509.18428

@article{lawm2025,
  title   = {Latent Action Pretraining Through World Modeling},
  author  = {Bahey Tharwat and Yara Nasser and Ali Abouzied and Ian Reid},
  journal = {arXiv preprint arXiv:2509.18428},
  year    = {2025}
}