This repository contains the list of representative VLA works in the survey βA Survey on Vision-Language-Action Models: An Action Tokenization Perspectiveβ, along with relevant reference materials.
- Transformer, Attention is All You Need, 2017.06, NIPS 2017. [π Paper]
- USE, Universal sentence encoder, 2018.03. [π Paper]
- GPT-1, Improving language understanding by generative pre-training, 2018.06. [π Paper]
- BERT, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2018.10, NAACL 2019. [π Paper] [π» Code] [π€ Model]
- GPT-2, Language Models are Unsupervised Multitask Learners, 2019.02. [π Paper] [π» Code]
- MUSE, Multilingual universal sentence encoder for semantic retrieval, 2019.07. [π Paper] [π» Code]
- T5, Exploring the limits of transfer learning with a unified text-to-text transformer, 2019.10, JMLR 2020. [π Paper] [π» Code] [π€ Model]
- GPT-3, Language Models are Few-Shot Learners, 2020.05, NeurIPS 2020. [π Paper]
- InstructGPT, Training language models to follow instructions with human feedback, 2022.03, NeurIPS 2022. [π Paper] [π Website]
- Chinchilla, Training Compute-Optimal Large Language Models, 2022.03, NeurIPS 2022. [π Paper]
- ChatGPT, 2022.11. [π Website]
- LLaMA, LLaMA: Open and Efficient Foundation Language Models, 2023.02. [π Paper] [π» Code] [π€ Model]
- GPT-4, 2023.03. [π Paper] [π Website]
- Claude, 2023.03. [π Website]
- Llama 2, Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.07. [π Paper] [π€ Model]
- Claude 2, 2023.07. [π Website]
- Mistral, Mistral 7B, 2023.10. [π Paper] [π€ Model]
- Mamba, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, 2023.12, COML. [π Paper] [π» Code]
- Mixtral, Mixtral of Experts, 2024.01. [π Paper] [π€ Model]
- Gemma, Gemma: Open Models Based on Gemini Research and Technology, 2024.03. [π Paper] [π Website]
- Claude 3, 2024.03. [π Website]
- Llama 3, The Llama 3 Herd of Models, 2024.07. [π Paper] [π Website] [π€ Model]
- Gemma 2, Gemma 2: Improving Open Language Models at a Practical Size, 2024.08. [π Paper] [π€ Model]
- OpenAI o1, 2024.12. [π Website]
- Gemini 2.0 Flash, 2025.01. [π Website]
- DeepSeek-R1, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025.01. [π Paper] [π€ Model]
- Gemini 2.0 Pro, 2025.02. [π Website]
- Gemini 2.5 Pro, 2025.03. [π Website]
- Gemma 3, 2025.03. [π Paper] [π Website]
- Gemini 2.5 Flash, 2025.04. [π Website]
- Claude 4, 2024.05. [π Website]
- ViT, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2020.10, ICLR 2021. [π Paper] [π» Code]
- CLIP, Learning Transferable Visual Models From Natural Language Supervision, 2021.02, ICML 2021. [π Paper] [π» Code]
- DINO, Emerging Properties in Self-Supervised Vision Transformers, 2021.04, ICCV 2021. [π Paper] [π» Code]
- GLIP, Grounded Language-Image Pre-training, 2021.12, CVPR 2022. [π Paper] [π» Code]
- GLIDE, GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, 2021.12, ICML 2022. [π Paper] [π» Code]
- Stable Diffusion, High-Resolution Image Synthesis with Latent Diffusion Models, 2021.12, CVPR 2022. [π Paper] [π» Code]
- DALL-E 2, Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022.04, CVPR 2022. [π Paper] [π Website]
- Imagen, Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, 2022.05, NeurIPS 2022. [π Paper] [π Website]
- Stable Diffusion 2, 2022.11. [π Website]
- ControlNet, Adding Conditional Control to Text-to-Image Diffusion Models, 2023.02, ICCV 2023. [π Paper] [π» Code]
- PVDM, Video Probabilistic Diffusion Models in Projected Latent Space, 2023.02, CVPR 2023. [π Paper] [π Website] [π» Code]
- SigLIP, Sigmoid Loss for Language Image Pre-Training, 2023.03, ICCV 2023. [π Paper] [π» Code]
- Grounding DINO, Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection, 2023.03, ECCV 2024. [π Paper] [π» Code]
- DINOv2, DINOv2: Learning Robust Visual Features without Supervision, 2023.04, TMLR 2024. [π Paper] [π Website] [π» Code]
- SAM, Segment Anything, 2023.04, ICCV 2023. [π Paper] [π Website] [π» Code] [π Dataset]
- CoTracker, CoTracker: It is Better to Track Together, 2023.07, ECCV 2024. [π Paper] [π Website] [π» Code]
- Cutie, Putting the Object Back into Video Object Segmentation, 2023.10, CVPR 2024 Highlight. [π Paper] [π Website] [π» Code]
- VideoCrafter1, VideoCrafter1: Open Diffusion Models for High-Quality Video Generation, 2023.10. [π Paper] [π Website] [π» Code]
- FoundationPose, FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects, 2023.12, CVPR 2024 Highlight. [π Paper] [π Website] [π» Code]
- HaMeR, Reconstructing Hands in 3D with Transformers, 2023.12, CVPR 2024. [π Paper] [π Website] [π» Code] [π Dataset]
- Depth Anything, Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data, 2024.01, CVPR 2024. [π Paper] [π Website] [π» Code]
- Grounded SAM, Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks, 2024.01. [π Paper] [π» Code]
- VideoCrafter2, VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models, 2024.01, CVPR 2024. [π Paper] [π Website] [π» Code]
- Sora, Video generation models as world simulators, 2024.02. [π Website]
- Genie, Genie: Generative Interactive Environments, 2024.02, ICML 2024. [π Paper] [π Website]
- Stable Diffusion 3, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, 2024.03, ICML 2024. [π Paper] [π€ Model]
- Grounding DINO 1.5, Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection, 2024.05. [π Paper] [π» Code]
- Depth Anything V2, Depth Anything V2, 2024.06, NeurIPS 2024. [π Paper] [π Website] [π» Code]
- SAM 2, SAM 2: Segment Anything in Images and Videos, 2024.08. [π Paper] [π Website] [π» Code]
- Grounded SAM 2, Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks, 2024.08. [π Paper] [π» Code]
- SAMURAI, SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory, 2024.11. [π Paper] [π Website] [π» Code]
- Genie 2, Genie 2: A large-scale foundation world model, 2024.11. [π Website]
- Veo 3, Veo 3, 2025.05. [π Website]
- BLIP, BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, 2022.01, ICML 2022. [π Paper] [π» Code] [π€ Model]
- Flamingo, Flamingo: a Visual Language Model for Few-Shot Learning, 2022.04, NeurIPS 2022. [π Paper]
- BLIP-2, BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, 2023.01, ICML 2023. [π Paper] [π€ Model]
- LLaVA, Visual Instruction Tuning, 2023.04, NeurIPS 2023 Oral. [π Paper] [π Website] [π» Code] [π Dataset]
- Qwen-VL, Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond, 2023.08. [π Paper] [π» Code] [π€ Model]
- LLaVA 1.5, Improved Baselines with Visual Instruction Tuning, 2023.10, CVPR 2024 highlight. [π Paper] [π Website] [π» Code] [π€ Model] [π Dataset]
- Prismatic, Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models, 2024.02, ICML 2024. [π Paper] [π» Code (training)] [π» Code (evaluation)] [π€ Model]
- GPT-4o, 2024.05. [π Paper] [π Website]
- PaliGemma, PaliGemma: A versatile 3B VLM for transfer, 2024.07. [π Paper] [π Website]
- Qwen2-VL, Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution, 2024.09. [π Paper] [π Website] [π€ Model]
- Qwen2.5-VL, 2025.02. [π Paper] [π Website] [π» Code] [π€ Model]
- Gemini 2.5 Pro, 2025.03. [π Website]
- Gemini 2.5 Flash, 2025.04. [π Website]
- Language Planner, Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents, 2022.01, ICML 2022. [π Paper] [π Website] [π» Code]
- Socratic Models, Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, 2022.04, ICLR 2023. [π Paper] [π Website] [π» Code]
- SayCan, Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, 2022.04. [π Paper] [π Website] [π» Code]
- Inner Monologue, Inner Monologue: Embodied Reasoning through Planning with Language Models, 2022.07, CoRL 2022. [π Paper] [π Website]
- PaLM-E, PaLM-E: An Embodied Multimodal Language Model, 2023.03, ICML 2023. [π Paper] [π Website]
- EmbodiedGPT, EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought, 2023.05, NeurIPS 2023. [π Paper] [π Website] [π» Code] [π Dataset]
- DoReMi, DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment, 2023.07, IROS 2024. [π Paper] [π Website]
- ViLa, Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning, 2023.11, Workshop on Vision-Language Models for Navigation and Manipulation, ICRA 2024. [π Paper] [π Website]
- 3D-VLA, 3D-VLA: A 3D Vision-Language-Action Generative World Model, 2024.03, ICML 2024. [π Paper] [π Website] [π» Code]
- Bi-VLA, Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations, 2024.05, SMC 2024. [π Paper]
- RoboMamba, RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation, 2024.06, NeurIPS 2024. [π Paper] [π Website] [π» Code]
- ReplanVLM, ReplanVLM: Replanning Robotic Tasks with Visual Language Models, 2024.07. [π Paper]
- BUMBLE, BUMBLE: Unifying Reasoning and Acting with Vision-Language Models for Building-wide Mobile Manipulation, 2024.10, ICRA 2025. [π Paper] [π Website] [π» Code]
- ReflectVLM, Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation, 2025.02. [π Paper] [π Website] [π» Code] [π Dataset] [π€ Model]
- Hi Robot, Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models, 2025.02. [π Paper] [π Website]
- RoboBrain, RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete, 2025.02, CVPR 2025. [π Paper] [π Website] [π» Code] [π Dataset]
-
$\pi_{0.5}$ ,$\pi_{0.5}$ : a Vision-Language-Action Model with Open-World Generalization, 2025.04. [π Paper] [π Website]
- RT-H, RT-H: Action Hierarchies Using Language, 2024.03. [π Paper] [π Website]
- NaVILA, NaVILA: Legged Robot Vision-Language-Action Model for Navigation, 2024.12, RSS 2025. [π Paper] [π Website] [π» Code]
- Code as Policies, Code as Policies: Language Model Programs for Embodied Control, 2022.09, ICRA 2023. [π Paper] [π Website] [π» Code]
- ProgPrompt, ProgPrompt: Generating Situated Robot Task Plans using Large Language Models, 2022.09, ICRA 2023. [π Paper] [π Website] [π» Code]
- ChatGPT for Robotics, ChatGPT for Robotics: Design Principles and Model Abilities, 2023.02, IEEE Access 2024. [π Paper] [π Website] [π» Code]
- Text2Motion, Text2Motion: From Natural Language Instructions to Feasible Plans, 2023.03, ICRL 2023. [π Paper] [π Website]
- Instruct2Act, Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model, 2023.05. [π Paper] [π» Code]
- RoboScript, RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation, 2024.02. [π Paper]
- RoboCodeX, RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis, 2024.02, ICML 2024. [π Paper] [π Website] [π» Code]
- KITE, KITE: Keypoint-Conditioned Policies for Semantic Manipulation, 2023.6, CoRL 2023. [π Paper] [π Website] [π» Code]
- CoPa, CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models, 2024.3, IROS 2024. [π Paper] [π Website] [π» Code]
- RoboPoint, RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics, 2024.6, CoRL 2024. [π Paper] [π Website] [π» Code]
- RAM, RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation, 2024.7, CoRL 2024 Oral. [π Paper] [π Website] [π» Code]
- ReKep, ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation, 2024.9, CoRL 2024. [π Paper] [π Website] [π» Code]
- OmniManip, OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints, 2025.1, CVPR 2025 Highlight. [π Paper] [π Website]
- Magma, Magma: A Foundation Model for Multimodal AI Agents, 2025.2, CVPR 2025. [π Paper] [π Website] [π» Code]
- KUDA, KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation, 2025.3, ICRA 2025. [π Paper] [π Website] [π» Code]
- GPT-4V, GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration, 2023.11, RA-L 2024. [π Paper] [π Website] [π» Code]
- A3VLM, A3VLM: Actionable Articulation-Aware Vision Language Model, 2024.6, CoRL 2024. [π Paper] [π» Code]
- DexGraspVLA, DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping, 2025.2. [π Paper] [π Website] [π» Code]
- MOO, Open-World Object Manipulation using Pre-Trained Vision-Language Models, 2023.3, CoRL 2023. [π Paper] [π Website]
- ROCKET-1, ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting, 2024.11, CVPR 2025. [π Paper] [π Website]
- SoFar, SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation, 2025.2. [π Paper] [π Website] [π» Code]
- RoboDexVLM, RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation, 2025.3. [π Paper] [π Website]
- CLIPort, CLIPort: What and Where Pathways for Robotic Manipulation, 2021.9, CoRL 2021. [π Paper] [π Website] [π» Code]
- VoxPoser, VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, 2023.7, CoRL 2023 Oral. [π Paper] [π Website] [π» Code]
- ManipLLM, ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation, 2023.12, CVPR 2024. [π Paper] [π Website] [π» Code]
- ManiFoundation, ManiFoundation Model for General-Purpose Robotic Manipulation of Contact Synthesis with Arbitrary Objects and Robots, 2024.5, IROS 2024 Oral. [π Paper] [π Website] [π» Code]
- MOKA, MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting, 2024.5, RSS 2024. [π Paper] [π Website] [π» Code]
- AVDC, Learning to Act from Actionless Videos through Dense Correspondences, 2023.10, ICLR 2024 spotlight. [π Paper]
- RT-Trajectory, RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches, 2023.11, ICLR 2024 (Spotlight). [π Paper] [π Website]
- ATM, Any-point Trajectory Modeling for Policy Learning, 2023.12, RSS 2024. [π Paper] [π Website] [π» Code]
- LLARVA, LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning, 2024.06, CoRL 2024. [π Paper] [π Website] [π» Code] [π Dataset]
- Im2Flow2Act, Flow as the Cross-Domain Manipulation Interface, 2024.07, CoRL 2024. [π Paper] [π Website] [π» Code]
- FLIP, FLIP : Flow-Centric Generative Planning as General-Purpose Manipulation World Model, 2024.12, ICLR 2025 Poster. [π Paper] [π Website] [π» Code]
- HAMSTER, HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation, 2025.02, ICLR 2025. [π Paper] [π Website]
- ARM4R, Pre-training Auto-regressive Robotic Models with 4D Representations, 2025.02. [π Paper]
- Magma, Magma: A foundation model for multimodal AI agents, 2025.02, CVPR 2025. [π Paper] [π Website] [π» Code] [π€ Model]
- DriveVLM, DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models, 2024.02, CoRL 2024. [π Paper] [π Website]
- CoVLA, CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving, 2024.08, WACV 2025 Oral. [π Paper] [π Website] [π Dataset]
- EMMA, EMMA: End-to-End Multimodal Model for Autonomous Driving, 2024.10. [π Paper]
- VLM-E2E, VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion, 2025.02. [π Paper]
- SuSIE, Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models, 2023.10, ICLR 2024. [π Paper] [π Website] [π» Code]
- 3D-VLA, 3D-VLA: A 3D Vision-Language-Action Generative World Model, 2024.03, ICML 2024. [π Paper] [π Website]
- CoTDiffusion, Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts, 2024.06, CVPR 2024. [π Paper] [π Website]
- CoT-VLA, CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models, 2025.03, CVPR 2025. [π Paper] [π Website]
- UniPi, Learning Universal Policies via Text-Guided Video Generation, 2023.02, NeurIPS 2023 spotlight. [π Paper] [π Website]
- AVDC, Learning to Act from Actionless Videos through Dense Correspondences, 2023.10, ICLR 2024 spotlight. [π Paper] [π Website] [π» Code]
- VLP, Video Language Planning, 2023.10. [π Paper] [π Website] [π» Code]
- Gen2Act, Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation, 2024.09, CoRL-X-Embodiment-WS 2024. [π Paper] [π Website]
- Video Prediction Policy, Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations, 2024.12, ICML 2025 Spotlight. [π Paper] [π Website] [π» Code]
- FLIP, FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model, 2024.12, ICLR 2025 [π Paper] [π Website] [π» Code]
- GEVRM, GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation, 2025.02, ICLR 2025. [π Paper]
- OmniJARVIS, OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents, 2024.07, NeurIPS 2024. [π Paper] [π Website] [π» Code] [π Dataset]
- QueST, QueST: Self-Supervised Skill Abstractions for Learning Continuous Control, 2024.07, NeurIPS 2024. [π Paper] [π Website] [π» Code]
- LAPA, LAPA: Latent Action Pretraining from Videos, 2024.10, ICLR 2025. [π Paper] [π Website] [π» Code] [π€ Model]
- GROOT-2, GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents, 2024.12, ICLR 2025. [π Paper]
- GO-1, AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems, 2025.03. [π Paper] [π Website] [π» Code] [π Dataset]
- UniVLA, UniVLA: Learning to Act Anywhere with Task-centric Latent Actions, 2025.05, RSS2025. [π Paper] [π» Code] [π€ Model]
- LangLfP, Language-Conditioned Imitation Learning over Unstructured Data, 2020.05, RSS 2021. [π Paper] [π Website]
- BC-Z, Zero-Shot Task Generalization with Robotic Imitation Learning, 2022.02, CoRL 2021. [π Paper] [π Website] [π» Code]
- Gato, A Generalist Agent, 2022.05, TMLR 2022. [π Paper] [π Website]
- VIMA, VIMA: General Robot Manipulation with Multimodal Prompts, 2022.10, ICML 2023. [π Paper] [π Website] [π» Code] [π€ Model]
- RT-1, RT-1: Robotics Transformer for Real-World Control at Scale, 2022.12, RSS 2023. [π Paper] [π Website] [π» Code]
- RT-2, RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, 2023.07, CoRL 2023. [π Paper] [π Website]
- RT-X, Open X-Embodiment: Robotic Learning Datasets and RT-X Models, 2023.10, ICRA 2024. [π Paper] [π Website] [π» Code]
- RoboFlamingo, Vision-Language Foundation Models as Effective Robot Imitators, 2023.11, ICLR 2024. [π Paper] [π Website] [π» Code] [π€ Model]
- LEO, An Embodied Generalist Agent in 3D World, 2023.11, ICML 2024. [π Paper] [π Website] [π» Code]
- GR-1, Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation, 2023.12, ICLR 2024. [π Paper] [π Website] [π» Code]
- Octo, Octo: An Open-Source Generalist Robot Policy, 2024.05, ICRA 2024. [π Paper] [π Website] [π» Code] [π€ Model]
- OpenVLA, OpenVLA: An open-source vision-language-action model, 2024.06, CoRL 2024. [π Paper] [π Website] [π» Code] [π€ Model]
- TinyVLA, TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation, 2024.09, RA-L 2025. [π Paper] [π Website] [π» Code]
- HiRT, HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers, 2024.09, CoRL 2024. [π Paper]
- GR-2, GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation, 2024.10. [π Paper] [π Website]
- RDT-1B, RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation, 2024.10, ICLR 2025. [π Paper] [π Website] [π» Code] [π€ Model]
-
$\pi_0$ , A Vision-Language-Action Flow Model for General Robot Control, 2024.10. [π Paper] [π Website] [π€ Model] - CogACT, CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation, 2024.11. [π Paper] [π Website] [π» Code] [π€ Model]
-
$\pi_0$ -FAST, FAST: Efficient Action Tokenization for Vision-Language-Action Models, 2025.01. [π Paper] [π Website] [π€ Model] - UniAct, Universal Actions for Enhanced Embodied Foundation Models, 2025.01, CVPR 2025. [π Paper] [π Website] [π» Code]
- OpenVLA-OFT, Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success, 2025.02. [π Paper] [π Website] [π» Code]
- JARVIS-VLA, Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse, 2025.03. [π Paper] [π Website] [π» Code] [π€ Model]
- HybridVLA, HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model, 2025.03. [π Paper] [π Website] [π» Code]
- MoManipVLA, MoManipVLA: Transferring Vision-Language-Action Models for General Mobile Manipulation, 2025.03, CVPR 2025. [π Paper]
- GR00T N1, GR00T N1: An Open Foundation Model for Generalist Humanoid Robots, 2025.03. [π Paper] [π Website] [π» Code] [π€ Model]
-
$\pi_0$ +KI, Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better, 2025.05. [π Paper] [π Website] - RTC, Real-Time Execution of Action Chunking Flow Policies, 2025.06. [π Paper] [π Website]
- Inner Monologue, Inner Monologue: Embodied Reasoning through Planning with Language Models, 2022.07, CoRL 2022. [π Paper] [π Website]
- DriveVLM, DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models, 2024.02, CoRL 2024. [π Paper] [π Website]
- ECoT, Robotic Control via Embodied Chain-of-Thought Reasoning, 2024.07, CoRL 2024. [π Paper] [π Website]
- RAD, Action-Free Reasoning for Policy Generalization, 2025.02. [π Paper] [π Website]
- AlphaDrive, AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning, 2025.03. [π Paper] [π Website]
- Cosmos-Reason1, Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning, 2025.03. [π Paper] [π Website] [π» Code]
- Something-Something V2, The" something something" video database for learning and evaluating visual common sense, 2017.06. [π Paper] [π Website]
- EPIC-KITCHENS-100, Scaling Egocentric Vision: The EPIC-KITCHENS Dataset, 2018.04. [π Paper] [π Website]
- Ego4D, Ego4D: Around the World in 3,000 Hours of Egocentric Video, 2021.10. [π Paper] [π Website]
- Ego-Exo4D, Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives, 2023.11. [π Paper] [π Website]
- MimicGen, MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations, 2023.10, CoRL 2023. [π Paper]
- RoboCase, RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots, 2024.06, RSS 2024. [π Paper] [π Website] [π» Code]
- DexMimicGen, DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning, 2024.10, ICRA 2025. [π Paper] [π Website]
- AgiBot DigitalWorld, AgiBot DigitalWorld, 2025.02. [π Website]
- nuScenes, nuScenes: A multimodal dataset for autonomous driving, 2019.03, CVPR 2020. [π Paper] [π Website] [π» Code]
- WOMD, Large Scale Interactive Motion Forecasting for Autonomous Driving: The Waymo Open Motion Dataset, 2021.04, ICCV 2021. [π Paper] [π Website] [π» Code]
- RT-1, RT-1: Robotics Transformer for Real-World Control at Scale, 2022.12. [π Paper] [π Website]
- RH20T, RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot, 2023.06, ICRA 2024. [π Paper] [π Website]
- BridgeData V2, BridgeData V2: A Dataset for Robot Learning at Scale, 2023.08, CoRL 2o23. [π Paper] [π Website]
- OXE, Open X-Embodiment: Robotic Learning Datasets and RT-X Models, 2023.10, ICRA 2024. [π Paper] [π Website]
- HoNY, On Bringing Robots Home, 2023.11. [π Paper] [π Website]
- DROID, DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset, 2024.03, RSS 2024. [π Paper] [π Website]
- CoVLA, CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving, 2024.08, WACV 2025. [π Paper] [π Website]
- RoboMIND, RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation, 2024.12, RSS 2025. [π Paper] [π Website]
- AgiBot World, AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems, 2025.03. [π Paper] [π Website] [π» Code]
- Robot Learning in the Era of Foundation Models: A Survey, 2023.11, Neurocomputing Volume 638. [π Paper]
- A Survey on Robotics with Foundation Models: toward Embodied AI, 2024.02. [π Paper]
- A Survey on Integration of Large Language Models with Intelligent Robots, 2024.04, Intelligent Service Robotics 2024. [π Paper]
- What Foundation Models can Bring for Robot Learning in Manipulation: A Survey, 2024.04. [π Paper]
- A Survey on Vision-Language-Action Models for Embodied AI, 2024.05. [π Paper]
- Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI, 2024.07. [π Paper]
- Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions, 2025.02, Information Fusion Volume 122. [π Paper]
- Generative Artificial Intelligence in Robotic Manipulation: A Survey, 2025.03. [π Paper]
- OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation, 2025.05. [π Paper]
- Vision-Language-Action Models: Concepts, Progress, Applications and Challenges, 2025.05. [π Paper]