📜 A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond
Photo Credit: Gemini-Nano-Banana🍌.
- 🎯 Aim
- 📚 VLA Definition | WAM Definition
- 🔍 Survey
Vision-Language-Action (VLA) Models
- 🧠 General VLA
- 💡 VLA with Reasoning
- 🌐 VLA with 3D/4D Modelling
- 🔥 VLA with Reinforcement Learning
- 🪶 Efficient VLA
- 🧪 VLA with Latent Actions
- 🧭 Domain-Specific VLA (e.g., Humanoid, Dexterous, Tactile)
- 🧷 Other Topics in VLA
World Action Models (WAM)
- 🎬 World Action Models from VideoGen
- 🌍 World Action Models from VLM
- ✨ World Action Models from Scratch
Traditional Policies
Resources
This is a curated list of VLA and WAM research, systematically organized to provide a comprehensive view of the recent advance in robotics foundation models. It will be continuously updated and refined, with the goal of clarifying the research context for scholars in the domain of robotics foundation models. If you have any new papers worth adding, please feel free to push or raise an issue. Join us in maintaining a high-quality VLA & WAM & More list.
In short, VLA models are a type of robotics policy that inherits the pretrained VLMs' rich language grounding and visual understanding abilities to offer a scalable route toward general-purpose, language-conditioned robot policies. We can trace the origin and formal definition of the VLA to the work RT-2.
In short, WAM models are a type of robotics policy that leverages the world modeling capability (i.e., predicting future states) for action prediction. We refer to the great work DreamZero, which formally coins the name World Action Model, for details.
There is an intersection between VLA and WAM: WAMs built upon pretrained VLMs. These models are simultaneously both VLA and WAM.
-
ABot-M0, ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning.
-
SimVLA, SimVLA: A Simple VLA Baseline for Robotic Manipulation.
-
ACoT-VLA, ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models.
-
[⭐️] Emergence of Human to Robot Transfer in Vision-Language-Action Models.
-
AVA-VLA, AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention.
-
AsyncVLA, AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models.
-
VLA-0, VLA-0: Building State-of-the-Art VLAs with Zero Modification.
-
X-VLA, X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model.
-
SmolVLA, SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics.
-
NORA, NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks.
-
CronusVLA, CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling.
-
[⭐️] Gemini Robotics, Gemini Robotics: Bringing AI into the Physical World.
-
[⭐️] OpenVLA-OFT, Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success.
-
[⭐️] FAST, FAST: Efficient Action Tokenization for Vision-Language-Action Models.
-
CogACT, CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation.
-
RoboVLMs, Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models.
-
[⭐️] π0, π0: A Vision-Language-Action Flow Model for General Robot Control.
-
[⭐️] OpenVLA, OpenVLA: An Open-Source Vision-Language-Action Model.
-
RoboFlamingo, Vision-Language Foundation Models as Effective Robot Imitators.
-
[⭐️] RT-2, RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.
-
GenieReasoner, Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training.
-
MolmoAct, MolmoAct: Action Reasoning Models that can Reason in Space.
-
[⭐️] π0.5, π0.5: a Vision-Language-Action Model with Open-World Generalization.
-
ChatVLA, ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model.
-
4D-VLA, 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration.
-
3D CAVLA, 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks.
-
SpatialVLA, SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model.
-
[⭐️] 3D-VLA, 3D-VLA: A 3D Vision-Language-Action Generative World Model.
-
EVOLVE-VLA, EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models.
-
SRPO, SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models.
-
World-Env, World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training.
-
SimpleVLA-RL, SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning.
-
VLA-Reasoner, VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search. (Not RL strictly, it's planning)
-
ThinkAct, ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning.
-
TGRPO, TGRPO: Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization.
-
VLA-RL, VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning.
-
RIPT-VLA, Interactive Post-Training for Vision-Language-Action Models.
-
GRAPE, GRAPE: Generalizing Robot Policy via Preference Alignment.
-
HBVLA, HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models.
-
MergeVLA, MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent.
-
VLA-Adapter, VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model.
-
FLOWER, FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies.
-
[⭐️] TinyVLA, TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation.
-
METIS, METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model.
-
Tactile-VLA, Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization.
-
CombatVLA, CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games.
-
Humanoid-VLA, Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration.
-
DynamicVLA, DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation.
-
TwinVLA, TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models.
-
MemoryVLA, MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation.
-
ReconVLA, ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver.
-
X-ICM, Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization.
-
ForceVLA, ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation.
-
TraceVLA, TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies.
-
[⭐️] Cosmos Policy, Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning.
-
mimic-video, mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs.
-
Dream2Flow, Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow.
-
DreamGen, DreamGen: Unlocking Generalization in Robot Learning through Video World Models.
-
Inverse Probabilistic Adaptation, Solving New Tasks by Adapting Internet Video Knowledge.
-
VPP,Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations.
-
GR-2, GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation.
-
[⭐️] GR-1, Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation.
-
UniPi, Learning Universal Policies via Text-Guided Video Generation.
-
VLAW, VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model.
-
WoG, World Guidance World Modeling in Condition Space for Action Generation.
-
VLA-JEPA, VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model.
-
MM-ACT, MM-ACT: Learn from Multimodal Parallel Generation to Act.
-
RynnVLA-002, RynnVLA-002: A Unified Vision-Language-Action and World Model.
-
F1, F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions.
-
FlowVLA, FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models.
-
DreamVLA, DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge.
-
WorldVLA, WorldVLA: Towards Autoregressive Action World Model.
-
CoT-VLA, CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models.
-
UP-VLA, UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent.
-
[⭐️] Unified World Models, Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets.
-
LPS, Latent Policy Steering with Embodiment-Agnostic Pretrained World Models.
-
Seer, Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation.
-
AVDC, Learning to Act from Actionless Videos through Dense Correspondences.
-
DayDreamer, DayDreamer: World Models for Physical Robot Learning.
-
[⭐️] Diffusers, Planning with Diffusion for Flexible Behavior Synthesis.
-
State-free Policy, Do You Need Proprioceptive States in Visuomotor Policies?.
-
RDP, Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation.
-
[⭐️] RDT-1B, RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation.
-
ReKep, ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation.
-
HPT, Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers.
-
MDT, Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals.
-
ManiGaussian, ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation.
-
[⭐️] DP3, 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations.
-
VoxPoser, VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models.
-
[⭐️] Diffusion Policy, Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.
-
[⭐️] RT-1, RT-1: Robotics Transformer for Real-World Control at Scale.
-
[⭐️] PerAct, Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation.
-
[⭐️] Zero-Shot Planner, Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents.
-
GM100, The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents.
-
RoboMIND 2.0, RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence.
-
RoboCOIN, RoboCOIN: An Open-Sourced Bimanual Robotic Data COllection for INtegrated Manipulation.
-
Galaxea, Galaxea Open-World Dataset and G0 Dual-System VLA Model.
-
[⭐️] AgiBot World, AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems.
-
RoboMIND, RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation.
-
[⭐️] DROID, DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.
-
[⭐️] Open X-Embodiment, Open X-Embodiment: Robotic Learning Datasets and RT-X Models.
-
RH20T, RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot.
-
LIBERO-Plus, LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
-
[⭐️] RoboTwin 2.0, RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation.
-
[⭐️] RoboCasa, RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots.
-
[⭐️] SimplerEnv, Evaluating Real-World Robot Manipulation Policies in Simulation.
-
[⭐️] LIBERO, LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning.
-
FurnitureBench, FurnitureBench: Reproducible Real-World Benchmark for Long-Horizon Complex Manipulation.
-
[⭐️] CALVIN, CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks.
-
[⭐️] SAPIEN, SAPIEN: A SimulAted Part-based Interactive ENvironment.
-
[⭐️] RLBench, RLBench: The Robot Learning Benchmark & Learning Environment.
-
GELLO, GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators.
-
[⭐️] ALOHA, Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.
Thanks to Awesome World Models for the template.
