Awesome-VLA-Papers

This repository contains the list of representative VLA works in the survey “A Survey on Vision-Language-Action Models: An Action Tokenization Perspective”, along with relevant reference materials.

Foundation Models

Language Foundation Models

Transformer, Attention is All You Need, 2017.06, NIPS 2017. [📄 Paper]
USE, Universal sentence encoder, 2018.03. [📄 Paper]
GPT-1, Improving language understanding by generative pre-training, 2018.06. [📄 Paper]
BERT, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018.10, NAACL 2019. [📄 Paper] [💻 Code] [🤗 Model]
GPT-2, Language Models are Unsupervised Multitask Learners, 2019.02. [📄 Paper] [💻 Code]
MUSE, Multilingual universal sentence encoder for semantic retrieval, 2019.07. [📄 Paper] [💻 Code]
T5, Exploring the limits of transfer learning with a unified text-to-text transformer, 2019.10, JMLR 2020. [📄 Paper] [💻 Code] [🤗 Model]
GPT-3, Language Models are Few-Shot Learners, 2020.05, NeurIPS 2020. [📄 Paper]
InstructGPT, Training language models to follow instructions with human feedback, 2022.03, NeurIPS 2022. [📄 Paper] [🌍 Website]
Chinchilla, Training Compute-Optimal Large Language Models, 2022.03, NeurIPS 2022. [📄 Paper]
ChatGPT, 2022.11. [🌍 Website]
LLaMA, LLaMA: Open and Efficient Foundation Language Models, 2023.02. [📄 Paper] [💻 Code] [🤗 Model]
GPT-4, 2023.03. [📄 Paper] [🌍 Website]
Claude, 2023.03. [🌍 Website]
Llama 2, Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.07. [📄 Paper] [🤗 Model]
Claude 2, 2023.07. [🌍 Website]
Mistral, Mistral 7B, 2023.10. [📄 Paper] [🤗 Model]
Mamba, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, 2023.12, COML. [📄 Paper] [💻 Code]
Mixtral, Mixtral of Experts, 2024.01. [📄 Paper] [🤗 Model]
Gemma, Gemma: Open Models Based on Gemini Research and Technology, 2024.03. [📄 Paper] [🌍 Website]
Claude 3, 2024.03. [🌍 Website]
Llama 3, The Llama 3 Herd of Models, 2024.07. [📄 Paper] [🌍 Website] [🤗 Model]
Gemma 2, Gemma 2: Improving Open Language Models at a Practical Size, 2024.08. [📄 Paper] [🤗 Model]
OpenAI o1, 2024.12. [🌍 Website]
Gemini 2.0 Flash, 2025.01. [🌍 Website]
DeepSeek-R1, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025.01. [📄 Paper] [🤗 Model]
Gemini 2.0 Pro, 2025.02. [🌍 Website]
Gemini 2.5 Pro, 2025.03. [🌍 Website]
Gemma 3, 2025.03. [📄 Paper] [🌍 Website]
Gemini 2.5 Flash, 2025.04. [🌍 Website]
Claude 4, 2024.05. [🌍 Website]

Vision Foundation Models

ViT, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2020.10, ICLR 2021. [📄 Paper] [💻 Code]
CLIP, Learning Transferable Visual Models From Natural Language Supervision, 2021.02, ICML 2021. [📄 Paper] [💻 Code]
DINO, Emerging Properties in Self-Supervised Vision Transformers, 2021.04, ICCV 2021. [📄 Paper] [💻 Code]
GLIP, Grounded Language-Image Pre-training, 2021.12, CVPR 2022. [📄 Paper] [💻 Code]
GLIDE, GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, 2021.12, ICML 2022. [📄 Paper] [💻 Code]
Stable Diffusion, High-Resolution Image Synthesis with Latent Diffusion Models, 2021.12, CVPR 2022. [📄 Paper] [💻 Code]
DALL-E 2, Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022.04, CVPR 2022. [📄 Paper] [🌍 Website]
Imagen, Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, 2022.05, NeurIPS 2022. [📄 Paper] [🌍 Website]
Stable Diffusion 2, 2022.11. [🌍 Website]
ControlNet, Adding Conditional Control to Text-to-Image Diffusion Models, 2023.02, ICCV 2023. [📄 Paper] [💻 Code]
PVDM, Video Probabilistic Diffusion Models in Projected Latent Space, 2023.02, CVPR 2023. [📄 Paper] [🌍 Website] [💻 Code]
SigLIP, Sigmoid Loss for Language Image Pre-Training, 2023.03, ICCV 2023. [📄 Paper] [💻 Code]
Grounding DINO, Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, 2023.03, ECCV 2024. [📄 Paper] [💻 Code]
DINOv2, DINOv2: Learning Robust Visual Features without Supervision, 2023.04, TMLR 2024. [📄 Paper] [🌍 Website] [💻 Code]
SAM, Segment Anything, 2023.04, ICCV 2023. [📄 Paper] [🌍 Website] [💻 Code] [📊 Dataset]
CoTracker, CoTracker: It is Better to Track Together, 2023.07, ECCV 2024. [📄 Paper] [🌍 Website] [💻 Code]
Cutie, Putting the Object Back into Video Object Segmentation, 2023.10, CVPR 2024 Highlight. [📄 Paper] [🌍 Website] [💻 Code]
VideoCrafter1, VideoCrafter1: Open Diffusion Models for High-Quality Video Generation, 2023.10. [📄 Paper] [🌍 Website] [💻 Code]
FoundationPose, FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects, 2023.12, CVPR 2024 Highlight. [📄 Paper] [🌍 Website] [💻 Code]
HaMeR, Reconstructing Hands in 3D with Transformers, 2023.12, CVPR 2024. [📄 Paper] [🌍 Website] [💻 Code] [📊 Dataset]
Depth Anything, Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data, 2024.01, CVPR 2024. [📄 Paper] [🌍 Website] [💻 Code]
Grounded SAM, Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks, 2024.01. [📄 Paper] [💻 Code]
VideoCrafter2, VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models, 2024.01, CVPR 2024. [📄 Paper] [🌍 Website] [💻 Code]
Sora, Video generation models as world simulators, 2024.02. [🌍 Website]
Genie, Genie: Generative Interactive Environments, 2024.02, ICML 2024. [📄 Paper] [🌍 Website]
Stable Diffusion 3, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, 2024.03, ICML 2024. [📄 Paper] [🤗 Model]
Grounding DINO 1.5, Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection, 2024.05. [📄 Paper] [💻 Code]
Depth Anything V2, Depth Anything V2, 2024.06, NeurIPS 2024. [📄 Paper] [🌍 Website] [💻 Code]
SAM 2, SAM 2: Segment Anything in Images and Videos, 2024.08. [📄 Paper] [🌍 Website] [💻 Code]
Grounded SAM 2, Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks, 2024.08. [📄 Paper] [💻 Code]
SAMURAI, SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory, 2024.11. [📄 Paper] [🌍 Website] [💻 Code]
Genie 2, Genie 2: A large-scale foundation world model, 2024.11. [🌍 Website]
Veo 3, Veo 3, 2025.05. [🌍 Website]

Vision Language Models

BLIP, BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, 2022.01, ICML 2022. [📄 Paper] [💻 Code] [🤗 Model]
Flamingo, Flamingo: a Visual Language Model for Few-Shot Learning, 2022.04, NeurIPS 2022. [📄 Paper]
BLIP-2, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, 2023.01, ICML 2023. [📄 Paper] [🤗 Model]
LLaVA, Visual Instruction Tuning, 2023.04, NeurIPS 2023 Oral. [📄 Paper] [🌍 Website] [💻 Code] [📊 Dataset]
Qwen-VL, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond, 2023.08. [📄 Paper] [💻 Code] [🤗 Model]
LLaVA 1.5, Improved Baselines with Visual Instruction Tuning, 2023.10, CVPR 2024 highlight. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model] [📊 Dataset]
Prismatic, Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models, 2024.02, ICML 2024. [📄 Paper] [💻 Code (training)] [💻 Code (evaluation)] [🤗 Model]
GPT-4o, 2024.05. [📄 Paper] [🌍 Website]
PaliGemma, PaliGemma: A versatile 3B VLM for transfer, 2024.07. [📄 Paper] [🌍 Website]
Qwen2-VL, Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution, 2024.09. [📄 Paper] [🌍 Website] [🤗 Model]
Qwen2.5-VL, 2025.02. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
Gemini 2.5 Pro, 2025.03. [🌍 Website]
Gemini 2.5 Flash, 2025.04. [🌍 Website]

Language Description as Action Tokens

Language Plan

Language Planner, Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents, 2022.01, ICML 2022. [📄 Paper] [🌍 Website] [💻 Code]
Socratic Models, Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, 2022.04, ICLR 2023. [📄 Paper] [🌍 Website] [💻 Code]
SayCan, Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, 2022.04. [📄 Paper] [🌍 Website] [💻 Code]
Inner Monologue, Inner Monologue: Embodied Reasoning through Planning with Language Models, 2022.07, CoRL 2022. [📄 Paper] [🌍 Website]
PaLM-E, PaLM-E: An Embodied Multimodal Language Model, 2023.03, ICML 2023. [📄 Paper] [🌍 Website]
EmbodiedGPT, EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought, 2023.05, NeurIPS 2023. [📄 Paper] [🌍 Website] [💻 Code] [📊 Dataset]
DoReMi, DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment, 2023.07, IROS 2024. [📄 Paper] [🌍 Website]
ViLa, Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning, 2023.11, Workshop on Vision-Language Models for Navigation and Manipulation, ICRA 2024. [📄 Paper] [🌍 Website]
3D-VLA, 3D-VLA: A 3D Vision-Language-Action Generative World Model, 2024.03, ICML 2024. [📄 Paper] [🌍 Website] [💻 Code]
Bi-VLA, Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations, 2024.05, SMC 2024. [📄 Paper]
RoboMamba, RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation, 2024.06, NeurIPS 2024. [📄 Paper] [🌍 Website] [💻 Code]
ReplanVLM, ReplanVLM: Replanning Robotic Tasks with Visual Language Models, 2024.07. [📄 Paper]
BUMBLE, BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation, 2024.10, ICRA 2025. [📄 Paper] [🌍 Website] [💻 Code]
ReflectVLM, Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation, 2025.02. [📄 Paper] [🌍 Website] [💻 Code] [📊 Dataset] [🤗 Model]
Hi Robot, Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models, 2025.02. [📄 Paper] [🌍 Website]
RoboBrain, RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete, 2025.02, CVPR 2025. [📄 Paper] [🌍 Website] [💻 Code] [📊 Dataset]
$\pi_{0.5}$, $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization, 2025.04. [📄 Paper] [🌍 Website]

Language Motion

RT-H, RT-H: Action Hierarchies Using Language, 2024.03. [📄 Paper] [🌍 Website]
NaVILA, NaVILA: Legged Robot Vision-Language-Action Model for Navigation, 2024.12, RSS 2025. [📄 Paper] [🌍 Website] [💻 Code]

Code as Action Tokens

Code as Policies, Code as Policies: Language Model Programs for Embodied Control, 2022.09, ICRA 2023. [📄 Paper] [🌍 Website] [💻 Code]
ProgPrompt, ProgPrompt: Generating Situated Robot Task Plans using Large Language Models, 2022.09, ICRA 2023. [📄 Paper] [🌍 Website] [💻 Code]
ChatGPT for Robotics, ChatGPT for Robotics: Design Principles and Model Abilities, 2023.02, IEEE Access 2024. [📄 Paper] [🌍 Website] [💻 Code]
Text2Motion, Text2Motion: From Natural Language Instructions to Feasible Plans, 2023.03, ICRL 2023. [📄 Paper] [🌍 Website]
Instruct2Act, Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model, 2023.05. [📄 Paper] [💻 Code]
RoboScript, RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation, 2024.02. [📄 Paper]
RoboCodeX, RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis, 2024.02, ICML 2024. [📄 Paper] [🌍 Website] [💻 Code]

Affordance as Action Tokens

Keypoint

KITE, KITE: Keypoint-Conditioned Policies for Semantic Manipulation, 2023.6, CoRL 2023. [📄 Paper] [🌍 Website] [💻 Code]
CoPa, CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models, 2024.3, IROS 2024. [📄 Paper] [🌍 Website] [💻 Code]
RoboPoint, RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics, 2024.6, CoRL 2024. [📄 Paper] [🌍 Website] [💻 Code]
RAM, RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation, 2024.7, CoRL 2024 Oral. [📄 Paper] [🌍 Website] [💻 Code]
ReKep, ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation, 2024.9, CoRL 2024. [📄 Paper] [🌍 Website] [💻 Code]
OmniManip, OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints, 2025.1, CVPR 2025 Highlight. [📄 Paper] [🌍 Website]
Magma, Magma: A Foundation Model for Multimodal AI Agents, 2025.2, CVPR 2025. [📄 Paper] [🌍 Website] [💻 Code]
KUDA, KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation, 2025.3, ICRA 2025. [📄 Paper] [🌍 Website] [💻 Code]

Bounding Box

GPT-4V, GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration, 2023.11, RA-L 2024. [📄 Paper] [🌍 Website] [💻 Code]
A3VLM, A3VLM: Actionable Articulation-Aware Vision Language Model, 2024.6, CoRL 2024. [📄 Paper] [💻 Code]
DexGraspVLA, DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping, 2025.2. [📄 Paper] [🌍 Website] [💻 Code]

Segmentation Mask

MOO, Open-World Object Manipulation using Pre-Trained Vision-Language Models, 2023.3, CoRL 2023. [📄 Paper] [🌍 Website]
ROCKET-1, ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting, 2024.11, CVPR 2025. [📄 Paper] [🌍 Website]
SoFar, SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation, 2025.2. [📄 Paper] [🌍 Website] [💻 Code]
RoboDexVLM, RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation, 2025.3. [📄 Paper] [🌍 Website]

Affordance Map

CLIPort, CLIPort: What and Where Pathways for Robotic Manipulation, 2021.9, CoRL 2021. [📄 Paper] [🌍 Website] [💻 Code]
VoxPoser, VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, 2023.7, CoRL 2023 Oral. [📄 Paper] [🌍 Website] [💻 Code]
ManipLLM, ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation, 2023.12, CVPR 2024. [📄 Paper] [🌍 Website] [💻 Code]
ManiFoundation, ManiFoundation Model for General-Purpose Robotic Manipulation of Contact Synthesis with Arbitrary Objects and Robots, 2024.5, IROS 2024 Oral. [📄 Paper] [🌍 Website] [💻 Code]
MOKA, MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting, 2024.5, RSS 2024. [📄 Paper] [🌍 Website] [💻 Code]

Trajectory as Action Tokens

Robotic Manipulation

AVDC, Learning to Act from Actionless Videos through Dense Correspondences, 2023.10, ICLR 2024 spotlight. [📄 Paper]
RT-Trajectory, RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches, 2023.11, ICLR 2024 (Spotlight). [📄 Paper] [🌍 Website]
ATM, Any-point Trajectory Modeling for Policy Learning, 2023.12, RSS 2024. [📄 Paper] [🌍 Website] [💻 Code]
LLARVA, LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning, 2024.06, CoRL 2024. [📄 Paper] [🌍 Website] [💻 Code] [📊 Dataset]
Im2Flow2Act, Flow as the Cross-Domain Manipulation Interface, 2024.07, CoRL 2024. [📄 Paper] [🌍 Website] [💻 Code]
FLIP, FLIP : Flow-Centric Generative Planning as General-Purpose Manipulation World Model, 2024.12, ICLR 2025 Poster. [📄 Paper] [🌍 Website] [💻 Code]
HAMSTER, HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation, 2025.02, ICLR 2025. [📄 Paper] [🌍 Website]
ARM4R, Pre-training Auto-regressive Robotic Models with 4D Representations, 2025.02. [📄 Paper]
Magma, Magma: A foundation model for multimodal AI agents, 2025.02, CVPR 2025. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]

Autonomous Driving

DriveVLM, DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models, 2024.02, CoRL 2024. [📄 Paper] [🌍 Website]
CoVLA, CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving, 2024.08, WACV 2025 Oral. [📄 Paper] [🌍 Website] [📊 Dataset]
EMMA, EMMA: End-to-End Multimodal Model for Autonomous Driving, 2024.10. [📄 Paper]
VLM-E2E, VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion, 2025.02. [📄 Paper]

Goal State as Action Tokens

Single-Frame Image / Point Cloud

SuSIE, Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models, 2023.10, ICLR 2024. [📄 Paper] [🌍 Website] [💻 Code]
3D-VLA, 3D-VLA: A 3D Vision-Language-Action Generative World Model, 2024.03, ICML 2024. [📄 Paper] [🌍 Website]
CoTDiffusion, Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts, 2024.06, CVPR 2024. [📄 Paper] [🌍 Website]
CoT-VLA, CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models, 2025.03, CVPR 2025. [📄 Paper] [🌍 Website]

Multi-Frame Video

UniPi, Learning Universal Policies via Text-Guided Video Generation, 2023.02, NeurIPS 2023 spotlight. [📄 Paper] [🌍 Website]
AVDC, Learning to Act from Actionless Videos through Dense Correspondences, 2023.10, ICLR 2024 spotlight. [📄 Paper] [🌍 Website] [💻 Code]
VLP, Video Language Planning, 2023.10. [📄 Paper] [🌍 Website] [💻 Code]
Gen2Act, Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation, 2024.09, CoRL-X-Embodiment-WS 2024. [📄 Paper] [🌍 Website]
Video Prediction Policy, Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations, 2024.12, ICML 2025 Spotlight. [📄 Paper] [🌍 Website] [💻 Code]
FLIP, FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model, 2024.12, ICLR 2025 [📄 Paper] [🌍 Website] [💻 Code]
GEVRM, GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation, 2025.02, ICLR 2025. [📄 Paper]

Latent Representation as Action Tokens

OmniJARVIS, OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents, 2024.07, NeurIPS 2024. [📄 Paper] [🌍 Website] [💻 Code] [📊 Dataset]
QueST, QueST: Self-Supervised Skill Abstractions for Learning Continuous Control, 2024.07, NeurIPS 2024. [📄 Paper] [🌍 Website] [💻 Code]
LAPA, LAPA: Latent Action Pretraining from Videos, 2024.10, ICLR 2025. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
GROOT-2, GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents, 2024.12, ICLR 2025. [📄 Paper]
GO-1, AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems, 2025.03. [📄 Paper] [🌍 Website] [💻 Code] [📊 Dataset]
UniVLA, UniVLA: Learning to Act Anywhere with Task-centric Latent Actions, 2025.05, RSS2025. [📄 Paper] [💻 Code] [🤗 Model]

Raw Action as Action Tokens

LangLfP, Language-Conditioned Imitation Learning over Unstructured Data, 2020.05, RSS 2021. [📄 Paper] [🌍 Website]
BC-Z, Zero-Shot Task Generalization with Robotic Imitation Learning, 2022.02, CoRL 2021. [📄 Paper] [🌍 Website] [💻 Code]
Gato, A Generalist Agent, 2022.05, TMLR 2022. [📄 Paper] [🌍 Website]
VIMA, VIMA: General Robot Manipulation with Multimodal Prompts, 2022.10, ICML 2023. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
RT-1, RT-1: Robotics Transformer for Real-World Control at Scale, 2022.12, RSS 2023. [📄 Paper] [🌍 Website] [💻 Code]
RT-2, RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, 2023.07, CoRL 2023. [📄 Paper] [🌍 Website]
RT-X, Open X-Embodiment: Robotic Learning Datasets and RT-X Models, 2023.10, ICRA 2024. [📄 Paper] [🌍 Website] [💻 Code]
RoboFlamingo, Vision-Language Foundation Models as Effective Robot Imitators, 2023.11, ICLR 2024. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
LEO, An Embodied Generalist Agent in 3D World, 2023.11, ICML 2024. [📄 Paper] [🌍 Website] [💻 Code]
GR-1, Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation, 2023.12, ICLR 2024. [📄 Paper] [🌍 Website] [💻 Code]
Octo, Octo: An Open-Source Generalist Robot Policy, 2024.05, ICRA 2024. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
OpenVLA, OpenVLA: An open-source vision-language-action model, 2024.06, CoRL 2024. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
TinyVLA, TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation, 2024.09, RA-L 2025. [📄 Paper] [🌍 Website] [💻 Code]
HiRT, HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers, 2024.09, CoRL 2024. [📄 Paper]
GR-2, GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation, 2024.10. [📄 Paper] [🌍 Website]
RDT-1B, RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation, 2024.10, ICLR 2025. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
$\pi_0$, A Vision-Language-Action Flow Model for General Robot Control, 2024.10. [📄 Paper] [🌍 Website] [🤗 Model]
CogACT, CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation, 2024.11. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
$\pi_0$-FAST, FAST: Efficient Action Tokenization for Vision-Language-Action Models, 2025.01. [📄 Paper] [🌍 Website] [🤗 Model]
UniAct, Universal Actions for Enhanced Embodied Foundation Models, 2025.01, CVPR 2025. [📄 Paper] [🌍 Website] [💻 Code]
OpenVLA-OFT, Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success, 2025.02. [📄 Paper] [🌍 Website] [💻 Code]
JARVIS-VLA, Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse, 2025.03. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
HybridVLA, HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model, 2025.03. [📄 Paper] [🌍 Website] [💻 Code]
MoManipVLA, MoManipVLA: Transferring Vision-Language-Action Models for General Mobile Manipulation, 2025.03, CVPR 2025. [📄 Paper]
GR00T N1, GR00T N1: An Open Foundation Model for Generalist Humanoid Robots, 2025.03. [📄 Paper] [🌍 Website] [💻 Code] [🤗 Model]
$\pi_0$+KI, Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better, 2025.05. [📄 Paper] [🌍 Website]
RTC, Real-Time Execution of Action Chunking Flow Policies, 2025.06. [📄 Paper] [🌍 Website]

Reasoning as Action Tokens

Inner Monologue, Inner Monologue: Embodied Reasoning through Planning with Language Models, 2022.07, CoRL 2022. [📄 Paper] [🌍 Website]
DriveVLM, DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models, 2024.02, CoRL 2024. [📄 Paper] [🌍 Website]
ECoT, Robotic Control via Embodied Chain-of-Thought Reasoning, 2024.07, CoRL 2024. [📄 Paper] [🌍 Website]
RAD, Action-Free Reasoning for Policy Generalization, 2025.02. [📄 Paper] [🌍 Website]
AlphaDrive, AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning, 2025.03. [📄 Paper] [🌍 Website]
Cosmos-Reason1, Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning, 2025.03. [📄 Paper] [🌍 Website] [💻 Code]

Scalable Data Sources

Bottom Layer: Web Data and Human Video

Something-Something V2, The" something something" video database for learning and evaluating visual common sense, 2017.06. [📄 Paper] [🌍 Website]
EPIC-KITCHENS-100, Scaling Egocentric Vision: The EPIC-KITCHENS Dataset, 2018.04. [📄 Paper] [🌍 Website]
Ego4D, Ego4D: Around the World in 3,000 Hours of Egocentric Video, 2021.10. [📄 Paper] [🌍 Website]
Ego-Exo4D, Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives, 2023.11. [📄 Paper] [🌍 Website]

Middle Layer: Synthetic and Simulation Data

MimicGen, MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations, 2023.10, CoRL 2023. [📄 Paper]
RoboCase, RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots, 2024.06, RSS 2024. [📄 Paper] [🌍 Website] [💻 Code]
DexMimicGen, DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning, 2024.10, ICRA 2025. [📄 Paper] [🌍 Website]
AgiBot DigitalWorld, AgiBot DigitalWorld, 2025.02. [🌍 Website]

Top Layer: Real-world Robot Data

nuScenes, nuScenes: A multimodal dataset for autonomous driving, 2019.03, CVPR 2020. [📄 Paper] [🌍 Website] [💻 Code]
WOMD, Large Scale Interactive Motion Forecasting for Autonomous Driving: The Waymo Open Motion Dataset, 2021.04, ICCV 2021. [📄 Paper] [🌍 Website] [💻 Code]
RT-1, RT-1: Robotics Transformer for Real-World Control at Scale, 2022.12. [📄 Paper] [🌍 Website]
RH20T, RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot, 2023.06, ICRA 2024. [📄 Paper] [🌍 Website]
BridgeData V2, BridgeData V2: A Dataset for Robot Learning at Scale, 2023.08, CoRL 2o23. [📄 Paper] [🌍 Website]
OXE, Open X-Embodiment: Robotic Learning Datasets and RT-X Models, 2023.10, ICRA 2024. [📄 Paper] [🌍 Website]
HoNY, On Bringing Robots Home, 2023.11. [📄 Paper] [🌍 Website]
DROID, DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset, 2024.03, RSS 2024. [📄 Paper] [🌍 Website]
CoVLA, CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving, 2024.08, WACV 2025. [📄 Paper] [🌍 Website]
RoboMIND, RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation, 2024.12, RSS 2025. [📄 Paper] [🌍 Website]
AgiBot World, AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems, 2025.03. [📄 Paper] [🌍 Website] [💻 Code]

Related Surveys

Robot Learning in the Era of Foundation Models: A Survey, 2023.11, Neurocomputing Volume 638. [📄 Paper]
A Survey on Robotics with Foundation Models: toward Embodied AI, 2024.02. [📄 Paper]
A Survey on Integration of Large Language Models with Intelligent Robots, 2024.04, Intelligent Service Robotics 2024. [📄 Paper]
What Foundation Models can Bring for Robot Learning in Manipulation: A Survey, 2024.04. [📄 Paper]
A Survey on Vision-Language-Action Models for Embodied AI, 2024.05. [📄 Paper]
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, 2024.07. [📄 Paper]
Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions, 2025.02, Information Fusion Volume 122. [📄 Paper]
Generative Artificial Intelligence in Robotic Manipulation: A Survey, 2025.03. [📄 Paper]
OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation, 2025.05. [📄 Paper]
Vision-Language-Action Models: Concepts, Progress, Applications and Challenges, 2025.05. [📄 Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-VLA-Papers

Foundation Models

Language Foundation Models

Vision Foundation Models

Vision Language Models

Language Description as Action Tokens

Language Plan

Language Motion

Code as Action Tokens

Affordance as Action Tokens

Keypoint

Bounding Box

Segmentation Mask

Affordance Map

Trajectory as Action Tokens

Robotic Manipulation

Autonomous Driving

Goal State as Action Tokens

Single-Frame Image / Point Cloud

Multi-Frame Video

Latent Representation as Action Tokens

Raw Action as Action Tokens

Reasoning as Action Tokens

Scalable Data Sources

Bottom Layer: Web Data and Human Video

Middle Layer: Synthetic and Simulation Data

Top Layer: Real-world Robot Data

Related Surveys

About

Uh oh!

Releases

Packages

Contributors 9

Psi-Robot/Awesome-VLA-Papers

Folders and files

Latest commit

History

Repository files navigation

Awesome-VLA-Papers

Foundation Models

Language Foundation Models

Vision Foundation Models

Vision Language Models

Language Description as Action Tokens

Language Plan

Language Motion

Code as Action Tokens

Affordance as Action Tokens

Keypoint

Bounding Box

Segmentation Mask

Affordance Map

Trajectory as Action Tokens

Robotic Manipulation

Autonomous Driving

Goal State as Action Tokens

Single-Frame Image / Point Cloud

Multi-Frame Video

Latent Representation as Action Tokens

Raw Action as Action Tokens

Reasoning as Action Tokens

Scalable Data Sources

Bottom Layer: Web Data and Human Video

Middle Layer: Synthetic and Simulation Data

Top Layer: Real-world Robot Data

Related Surveys

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Packages