Skip to content

DravenALG/awesome-vla-wam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 

Repository files navigation

🤖 Awesome VLA & WAM

📜 A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond

Awesome VLA & WAM

Photo Credit: Gemini-Nano-Banana🍌.

Overview

Vision-Language-Action (VLA) Models

World Action Models (WAM)

Traditional Policies

Resources

Aim

This is a curated list of VLA and WAM research, systematically organized to provide a comprehensive view of the recent advance in robotics foundation models. It will be continuously updated and refined, with the goal of clarifying the research context for scholars in the domain of robotics foundation models. If you have any new papers worth adding, please feel free to push or raise an issue. Join us in maintaining a high-quality VLA & WAM & More list.

VLA Definition

In short, VLA models are a type of robotics policy that inherits the pretrained VLMs' rich language grounding and visual understanding abilities to offer a scalable route toward general-purpose, language-conditioned robot policies. We can trace the origin and formal definition of the VLA to the work RT-2.

  • [⭐️] RT-2, RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv Website

WAM Definition

In short, WAM models are a type of robotics policy that leverages the world modeling capability (i.e., predicting future states) for action prediction. We refer to the great work DreamZero, which formally coins the name World Action Model, for details.

  • [⭐️] DreamZero, World Action Models are Zero-shot Policies. arXiv Website

There is an intersection between VLA and WAM: WAMs built upon pretrained VLMs. These models are simultaneously both VLA and WAM.

Survey

  • Vision-Language-Action (VLA) Models: Concepts, Progress, Applications and Challenges. arXiv Website

  • A Survey on Vision-Language-Action Models for Embodied AI. arXiv Website

General VLA

  • VLANeXt, VLANeXt: Recipes for Building Strong VLA Models. arXiv Website

  • HoloBrain-0, HoloBrain-0 Technical Report. arXiv Website

  • ABot-M0, ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning. arXiv Website

  • SimVLA, SimVLA: A Simple VLA Baseline for Robotic Manipulation. arXiv Website

  • Lingbot-VLA, A Pragmatic VLA Foundation Model. arXiv Website

  • ACoT-VLA, ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models. arXiv Website

  • [⭐️] Emergence of Human to Robot Transfer in Vision-Language-Action Models. arXiv Website

  • [⭐️] π∗0.6, π∗0.6: a VLA That Learns From Experience. arXiv Website

  • AVA-VLA, AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention. arXiv

  • AsyncVLA, AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models. arXiv Website

  • VLA-0, VLA-0: Building State-of-the-Art VLAs with Zero Modification. arXiv Website

  • X-VLA, X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model. arXiv Website

  • UniVLA, Unified Vision-Language-Action Model. arXiv Website

  • SmolVLA, SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics. arXiv Website

  • NORA, NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks. arXiv Website

  • CronusVLA, CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling. arXiv Website

  • [⭐️] Gemini Robotics, Gemini Robotics: Bringing AI into the Physical World. arXiv Website

  • [⭐️] OpenVLA-OFT, Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success. arXiv Website

  • [⭐️] FAST, FAST: Efficient Action Tokenization for Vision-Language-Action Models. arXiv Website

  • CogACT, CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation. arXiv Website

  • RoboVLMs, Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models. arXiv Website

  • [⭐️] π0, π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv Website

  • [⭐️] OpenVLA, OpenVLA: An Open-Source Vision-Language-Action Model. arXiv Website

  • RoboFlamingo, Vision-Language Foundation Models as Effective Robot Imitators. arXiv Website

  • [⭐️] RT-2, RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv Website

VLA with Reasoning

  • GenieReasoner, Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training. arXiv Website

  • MolmoAct, MolmoAct: Action Reasoning Models that can Reason in Space. arXiv Website

  • [⭐️] π0.5, π0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv Website

  • ChatVLA, ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model. arXiv Website

VLA with 3D/4D Modelling

  • 4D-VLA, 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration. arXiv Website

  • 3D CAVLA, 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks. arXiv Website

  • SpatialVLA, SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model. arXiv Website

  • [⭐️] 3D-VLA, 3D-VLA: A 3D Vision-Language-Action Generative World Model. arXiv Website

VLA with Reinforcement Learning

  • EVOLVE-VLA, EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models. arXivWebsite

  • SRPO, SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models. arXiv

  • World-Env, World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training. arXivWebsite

  • SimpleVLA-RL, SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning. arXivWebsite

  • VLA-Reasoner, VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search. (Not RL strictly, it's planning) arXiv

  • ThinkAct, ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning. arXivWebsite

  • TGRPO, TGRPO: Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization. arXiv

  • VLA-RL, VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning. arXivWebsite

  • RIPT-VLA, Interactive Post-Training for Vision-Language-Action Models. arXivWebsite

  • GRAPE, GRAPE: Generalizing Robot Policy via Preference Alignment. arXivWebsite

Efficient VLA

  • HBVLA, HBVLA: Pushing 1-Bit Post-Training Quantization for Vision-Language-Action Models. arXiv

  • MergeVLA, MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent. arXiv Website

  • VLA-Adapter, VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model. arXiv Website

  • FLOWER, FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies. arXiv Website

  • [⭐️] TinyVLA, TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation. arXiv Website

VLA with Latent Actions

  • Motus, Motus: A Unified Latent Action World Model. arXiv Website

  • [⭐️] GR00T N1, GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv Website

  • [⭐️] LAPA, Latent Action Pretraining from Videos. arXiv Website

Domain-Specific VLA (e.g., Humanoid, Dexterous, Tactile)

  • METIS, METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model. arXivWebsite

  • Tactile-VLA, Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization. arXivWebsite

  • CombatVLA, CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games. arXiv Website

  • Humanoid-VLA, Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration. arXiv

Other Topics in VLA

  • DynamicVLA, DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation. arXiv Website

  • TwinVLA, TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models. arXiv Website

  • MemoryVLA, MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation. arXiv Website

  • ReconVLA, ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver. arXiv Website

  • X-ICM, Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization. arXiv Website

  • ForceVLA, ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation. arXiv Website

  • TraceVLA, TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies. arXiv Website

World Action Models from VideoGen

  • [⭐️] DreamZero, World Action Models are Zero-shot Policies. arXiv Website

  • [⭐️] Cosmos Policy, Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning. arXiv Website

  • Lingbot-VA, Causal World Modeling for Robot Control. arXiv Website

  • mimic-video, mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs. arXiv Website

  • Dream2Flow, Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow. arXiv Website

  • Video Policy, Video Generators are Robot Policies. arXiv Website

  • DreamGen, DreamGen: Unlocking Generalization in Robot Learning through Video World Models. arXiv Website

  • Inverse Probabilistic Adaptation, Solving New Tasks by Adapting Internet Video Knowledge. arXiv Website

  • VPP,Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations. arXiv Website

  • GR-2, GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. arXiv Website

  • [⭐️] GR-1, Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation. arXiv Website

  • VLP, Video Language Planning. arXiv Website

  • UniPi, Learning Universal Policies via Text-Guided Video Generation. arXiv Website

World Action Models from VLM

  • VLAW, VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model. arXiv Website

  • WoG, World Guidance World Modeling in Condition Space for Action Generation. arXiv Website

  • VLA-JEPA, VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model. arXiv Website

  • MM-ACT, MM-ACT: Learn from Multimodal Parallel Generation to Act. arXiv Website

  • RynnVLA-002, RynnVLA-002: A Unified Vision-Language-Action and World Model. arXiv Website

  • F1, F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions. arXiv Website

  • FlowVLA, FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models. arXiv Website

  • DreamVLA, DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge. arXiv Website

  • WorldVLA, WorldVLA: Towards Autoregressive Action World Model. arXiv Website

  • FLARE, FLARE: Robot Learning with Implicit World Modeling. arXiv Website

  • CoT-VLA, CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models. arXiv Website

  • UP-VLA, UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent. arXiv Website

World Action Models from Scratch

  • [⭐️] Unified World Models, Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets. arXiv Website

  • LPS, Latent Policy Steering with Embodiment-Agnostic Pretrained World Models. arXiv

  • [⭐️] UVAM, Unified Video Action Model. arXiv Website

  • Seer, Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation. arXiv Website

  • AVDC, Learning to Act from Actionless Videos through Dense Correspondences. arXiv Website

  • UniSim, Learning Interactive Real-World Simulators. arXiv Website

  • DayDreamer, DayDreamer: World Models for Physical Robot Learning. arXiv Website

  • [⭐️] Diffusers, Planning with Diffusion for Flexible Behavior Synthesis. arXiv Website

Traditional Policies

  • State-free Policy, Do You Need Proprioceptive States in Visuomotor Policies?. arXiv Website

  • RDP, Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation. arXiv Website

  • [⭐️] RDT-1B, RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation. arXiv Website

  • ReKep, ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation. arXiv Website

  • HPT, Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers. arXiv Website

  • MDT, Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals. arXiv Website

  • [⭐️] Octo, Octo: An Open-Source Generalist Robot Policy. arXiv Website

  • ManiGaussian, ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation. arXiv Website

  • [⭐️] DP3, 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations. arXiv Website

  • VoxPoser, VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. arXiv Website

  • RPT, Robot Learning with Sensorimotor Pre-training. arXiv Website

  • [⭐️] Diffusion Policy, Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv Website

  • [⭐️] RT-1, RT-1: Robotics Transformer for Real-World Control at Scale. arXiv Website

  • [⭐️] PerAct, Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. arXiv Website

  • [⭐️] Zero-Shot Planner, Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. arXiv Website

Datasets

  • GM100, The Great March 100: 100 Detail-oriented Tasks for Evaluating Embodied AI Agents. arXiv Website

  • RoboMIND 2.0, RoboMIND 2.0: A Multimodal, Bimanual Mobile Manipulation Dataset for Generalizable Embodied Intelligence. arXiv Website

  • RoboCOIN, RoboCOIN: An Open-Sourced Bimanual Robotic Data COllection for INtegrated Manipulation. arXiv Website

  • Galaxea, Galaxea Open-World Dataset and G0 Dual-System VLA Model. arXiv Website

  • [⭐️] AgiBot World, AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems. arXiv Website

  • RoboMIND, RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation. arXiv Website

  • [⭐️] DROID, DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset. arXiv Website

  • [⭐️] Open X-Embodiment, Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv Website

  • RH20T, RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot. arXiv Website

Benchmark / Environment

  • LIBERO-Plus, LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models arXiv Website

  • [⭐️] RoboTwin 2.0, RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation. arXiv Website

  • [⭐️] RoboCasa, RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots. arXiv Website

  • [⭐️] SimplerEnv, Evaluating Real-World Robot Manipulation Policies in Simulation. arXiv Website

  • [⭐️] LIBERO, LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. arXiv Website

  • FurnitureBench, FurnitureBench: Reproducible Real-World Benchmark for Long-Horizon Complex Manipulation. arXiv Website

  • [⭐️] CALVIN, CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks. arXiv Website

  • [⭐️] SAPIEN, SAPIEN: A SimulAted Part-based Interactive ENvironment. arXiv Website

  • [⭐️] RLBench, RLBench: The Robot Learning Benchmark & Learning Environment. arXiv Website

Physics Engine

  • [⭐️] PhysX, Website

  • [⭐️] MuJoCo. Website

  • [⭐️] PyBullet. Website

Hardware

  • GELLO, GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators. arXiv Website

  • [⭐️] ALOHA, Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. arXiv Website

Acknowledgements

Thanks to Awesome World Models for the template.

About

A Curated List of Vision-Language-Action (VLA) and World Action Models (WAM) Research and Beyond

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors