🤖 Awesome VLA & WAM

📜 A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond

Photo Credit: Gemini-Nano-Banana🍌.

Overview

🎯 Aim
📚 VLA Definition | WAM Definition
🔍 Survey

Vision-Language-Action (VLA) Models

🧠 General VLA
💡 VLA with Reasoning
🌐 VLA with 3D/4D Modelling
🔥 VLA with Reinforcement Learning
🪶 Efficient VLA
🧪 VLA with Latent Actions
🧭 Domain-Specific VLA (e.g., Humanoid, Dexterous, Tactile)
🧷 Other Topics in VLA

World Action Models (WAM)

🎬 World Action Models from VideoGen
🌍 World Action Models from VLM
✨ World Action Models from Scratch

Traditional Policies

🦾 Traditional Policies

Resources

💾 Datasets
📊 Benchmark / Environment
🏞️ Physics Engine
🖥️ Hardware

Aim

This is a curated list of VLA and WAM research, systematically organized to provide a comprehensive view of the recent advance in robotics foundation models. It will be continuously updated and refined, with the goal of clarifying the research context for scholars in the domain of robotics foundation models. If you have any new papers worth adding, please feel free to push or raise an issue. Join us in maintaining a high-quality VLA & WAM & More list.

VLA Definition

In short, VLA models are a type of robotics policy that inherits the pretrained VLMs' rich language grounding and visual understanding abilities to offer a scalable route toward general-purpose, language-conditioned robot policies. We can trace the origin and formal definition of the VLA to the work RT-2.

[⭐️] RT-2, RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.

WAM Definition

In short, WAM models are a type of robotics policy that leverages the world modeling capability (i.e., predicting future states) for action prediction. We refer to the great work DreamZero, which formally coins the name World Action Model, for details.

[⭐️] DreamZero, World Action Models are Zero-shot Policies.

There is an intersection between VLA and WAM: WAMs built upon pretrained VLMs. These models are simultaneously both VLA and WAM.

Survey

Vision-Language-Action (VLA) Models: Concepts, Progress, Applications and Challenges.
A Survey on Vision-Language-Action Models for Embodied AI.

General VLA

VLANeXt, VLANeXt: Recipes for Building Strong VLA Models.
HoloBrain-0, HoloBrain-0 Technical Report.
ABot-M0, ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning.
SimVLA, SimVLA: A Simple VLA Baseline for Robotic Manipulation.
Lingbot-VLA, A Pragmatic VLA Foundation Model.
ACoT-VLA, ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models.
[⭐️] Emergence of Human to Robot Transfer in Vision-Language-Action Models.
[⭐️] π∗0.6, π∗0.6: a VLA That Learns From Experience.
AVA-VLA, AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention.
AsyncVLA, AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models.
VLA-0, VLA-0: Building State-of-the-Art VLAs with Zero Modification.
X-VLA, X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model.
UniVLA, Unified Vision-Language-Action Model.
SmolVLA, SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics.
NORA, NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks.
CronusVLA, CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling.
[⭐️] Gemini Robotics, Gemini Robotics: Bringing AI into the Physical World.
[⭐️] OpenVLA-OFT, Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success.
[⭐️] FAST, FAST: Efficient Action Tokenization for Vision-Language-Action Models.
CogACT, CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation.
RoboVLMs, Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models.
[⭐️] π0, π0: A Vision-Language-Action Flow Model for General Robot Control.
[⭐️] OpenVLA, OpenVLA: An Open-Source Vision-Language-Action Model.
RoboFlamingo, Vision-Language Foundation Models as Effective Robot Imitators.
[⭐️] RT-2, RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control.

VLA with Reasoning

GenieReasoner, Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training.
MolmoAct, MolmoAct: Action Reasoning Models that can Reason in Space.
[⭐️] π0.5, π0.5: a Vision-Language-Action Model with Open-World Generalization.
ChatVLA, ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model.

VLA with 3D/4D Modelling

4D-VLA, 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration.
3D CAVLA, 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks.
SpatialVLA, SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model.
[⭐️] 3D-VLA, 3D-VLA: A 3D Vision-Language-Action Generative World Model.

VLA with Reinforcement Learning

EVOLVE-VLA, EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models.
SRPO, SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models.
World-Env, World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training.
SimpleVLA-RL, SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning.
VLA-Reasoner, VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search. (Not RL strictly, it's planning)
ThinkAct, ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning.
TGRPO, TGRPO: Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization.
VLA-RL, VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning.
RIPT-VLA, Interactive Post-Training for Vision-Language-Action Models.
GRAPE, GRAPE: Generalizing Robot Policy via Preference Alignment.

Efficient VLA

HBVLA, HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models.
MergeVLA, MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent.
VLA-Adapter, VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model.
FLOWER, FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies.
[⭐️] TinyVLA, TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation.

VLA with Latent Actions

Motus, Motus: A Unified Latent Action World Model.
[⭐️] GR00T N1, GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.
[⭐️] LAPA, Latent Action Pretraining from Videos.

Domain-Specific VLA (e.g., Humanoid, Dexterous, Tactile)

METIS, METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model.
Tactile-VLA, Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization.
CombatVLA, CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games.
Humanoid-VLA, Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration.

World Action Models from VideoGen

[⭐️] DreamZero, World Action Models are Zero-shot Policies.
[⭐️] Cosmos Policy, Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning.
Lingbot-VA, Causal World Modeling for Robot Control.
mimic-video, mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs.
Dream2Flow, Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow.
Video Policy, Video Generators are Robot Policies.
DreamGen, DreamGen: Unlocking Generalization in Robot Learning through Video World Models.
Inverse Probabilistic Adaptation, Solving New Tasks by Adapting Internet Video Knowledge.
VPP,Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations.
GR-2, GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation.
[⭐️] GR-1, Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation.
VLP, Video Language Planning.
UniPi, Learning Universal Policies via Text-Guided Video Generation.

World Action Models from VLM

VLAW, VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model.
WoG, World Guidance World Modeling in Condition Space for Action Generation.
VLA-JEPA, VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model.
MM-ACT, MM-ACT: Learn from Multimodal Parallel Generation to Act.
RynnVLA-002, RynnVLA-002: A Unified Vision-Language-Action and World Model.
F1, F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions.
FlowVLA, FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models.
DreamVLA, DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge.
WorldVLA, WorldVLA: Towards Autoregressive Action World Model.
FLARE, FLARE: Robot Learning with Implicit World Modeling.
CoT-VLA, CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models.
UP-VLA, UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent.

World Action Models from Scratch

[⭐️] Unified World Models, Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets.
LPS, Latent Policy Steering with Embodiment-Agnostic Pretrained World Models.
[⭐️] UVAM, Unified Video Action Model.
Seer, Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation.
AVDC, Learning to Act from Actionless Videos through Dense Correspondences.
UniSim, Learning Interactive Real-World Simulators.
DayDreamer, DayDreamer: World Models for Physical Robot Learning.
[⭐️] Diffusers, Planning with Diffusion for Flexible Behavior Synthesis.

Traditional Policies

State-free Policy, Do You Need Proprioceptive States in Visuomotor Policies?.
RDP, Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation.
[⭐️] RDT-1B, RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation.
ReKep, ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation.
HPT, Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers.
MDT, Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals.
[⭐️] Octo, Octo: An Open-Source Generalist Robot Policy.
ManiGaussian, ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation.
[⭐️] DP3, 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations.
VoxPoser, VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models.
RPT, Robot Learning with Sensorimotor Pre-training.
[⭐️] Diffusion Policy, Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.
[⭐️] RT-1, RT-1: Robotics Transformer for Real-World Control at Scale.
[⭐️] PerAct, Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation.
[⭐️] Zero-Shot Planner, Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents.

Datasets

GM100, The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents.
RoboMIND 2.0, RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence.
RoboCOIN, RoboCOIN: An Open-Sourced Bimanual Robotic Data COllection for INtegrated Manipulation.
Galaxea, Galaxea Open-World Dataset and G0 Dual-System VLA Model.
[⭐️] AgiBot World, AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems.
RoboMIND, RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation.
[⭐️] DROID, DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.
[⭐️] Open X-Embodiment, Open X-Embodiment: Robotic Learning Datasets and RT-X Models.
RH20T, RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot.

Benchmark / Environment

LIBERO-Plus, LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
[⭐️] RoboTwin 2.0, RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation.
[⭐️] RoboCasa, RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots.
[⭐️] SimplerEnv, Evaluating Real-World Robot Manipulation Policies in Simulation.
[⭐️] LIBERO, LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning.
FurnitureBench, FurnitureBench: Reproducible Real-World Benchmark for Long-Horizon Complex Manipulation.
[⭐️] CALVIN, CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks.
[⭐️] SAPIEN, SAPIEN: A SimulAted Part-based Interactive ENvironment.
[⭐️] RLBench, RLBench: The Robot Learning Benchmark & Learning Environment.

Physics Engine

[⭐️] PhysX,
[⭐️] MuJoCo.
[⭐️] PyBullet.

Hardware

GELLO, GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators.
[⭐️] ALOHA, Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware.

Acknowledgements

Thanks to Awesome World Models for the template.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
README.md		README.md
awesome-vla-wam.jpg		awesome-vla-wam.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Awesome VLA & WAM

Overview

Aim

VLA Definition

WAM Definition

Survey

General VLA

VLA with Reasoning

VLA with 3D/4D Modelling

VLA with Reinforcement Learning

Efficient VLA

VLA with Latent Actions

Domain-Specific VLA (e.g., Humanoid, Dexterous, Tactile)

Other Topics in VLA

World Action Models from VideoGen

World Action Models from VLM

World Action Models from Scratch

Traditional Policies

Datasets

Benchmark / Environment

Physics Engine

Hardware

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🤖 Awesome VLA & WAM

Overview

Aim

VLA Definition

WAM Definition

Survey

General VLA

VLA with Reasoning

VLA with 3D/4D Modelling

VLA with Reinforcement Learning

Efficient VLA

VLA with Latent Actions

Domain-Specific VLA (e.g., Humanoid, Dexterous, Tactile)

Other Topics in VLA

World Action Models from VideoGen

World Action Models from VLM

World Action Models from Scratch

Traditional Policies

Datasets

Benchmark / Environment

Physics Engine

Hardware

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages