Teaching robots from human videos through self-supervised world modeling — achieving superior performance with efficient, model-agnostic pretraining.
1 MBZUAI, Abu Dhabi, UAE | 2 Alexandria University, Egypt
See LAWM in action across different manipulation tasks
What makes LAWM unique and effective
LAWM can be applied to various imitation learning architectures, making it flexible and widely applicable across different robot learning systems.
Learn from unlabeled robot or human videos without expensive teleoperation data collection, enabling scalable learning from diverse sources.
Effective transfer across different tasks, environments, and robot embodiments, demonstrating strong generalization capabilities.
Smaller model size and better performance compared to large-scale VLA models, making it practical for real-world deployment.
Self-supervised world modeling outperforms standard supervised pretraining on ground-truth robot actions, learning more generalizable dynamics.
Predicts chunks of future actions instead of single steps, reducing error accumulation and improving efficiency in long-horizon tasks.
Two-stage framework for learning robust action representations
Learn latent action representations from unlabeled videos through world modeling. The model predicts visual changes between frames without requiring ground-truth action labels, enabling large-scale pretraining on diverse video datasets.
Finetune the pretrained model on target robotic manipulation tasks with minimal labeled data, leveraging learned action priors. The world model is discarded, and only the imitation learning model is adapted to downstream tasks.
State-of-the-art performance on LIBERO benchmark and real-world tasks
| Model | Pretraining Method | Spatial | Object | Goal | Long | Average |
|---|---|---|---|---|---|---|
| Octo-base | Robot Actions | 78.90 | 85.70 | 84.60 | 51.10 | 75.10 |
| OpenVLA | Robot Actions | 84.70 | 88.40 | 79.20 | 53.70 | 76.50 |
| π₀ | Robot Actions | 90.00 | 86.00 | 95.00 | 73.00 | 86.00 |
| π₀.₅ | Robot Actions | 98.80 | 98.20 | 98.00 | 92.40 | 96.85 |
| π₀ (Paligemma-3B) | VLM Checkpoint | 87.00 | 63.00 | 89.00 | 48.00 | 71.80 |
| SmolVLA (SmolVLM-2.25B) | VLM Checkpoint | 93.00 | 94.00 | 91.00 | 77.00 | 88.75 |
| villa-X w/o latent | Videos | 86.00 | 86.50 | 85.00 | 70.00 | 81.90 |
| villa-X | Videos | 97.50 | 97.00 | 91.50 | 74.50 | 90.10 |
| BAKU w/o pretraining | None | 94.00 | 100.00 | 96.00 | 92.00 | 95.50 |
| BAKU w/ latent pretraining (Ours) | Videos | 99.00 | 100.00 | 96.00 | 94.00 | 97.25 |
If you find our work useful, please consider citing
arXiv: 2509.18428 [cs.RO]
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
@article{lawm2025,
title = {Latent Action Pretraining Through World Modeling},
author = {Bahey Tharwat and Yara Nasser and Ali Abouzied and Ian Reid},
journal = {arXiv preprint arXiv:2509.18428},
year = {2025}
}