- GRUtopia: Dream General Robots in a City at Scale [paper] [code]
- Diffusion for Multi-Embodiment Grasping [paper]
- Gen2sim: Scaling up robot learning in simulation with generative model (ICRA 2024) [paper] [code] [webpage]
- RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation (ICML 2024) [paper] [code] [webpage]
- Holodeck: Language Guided Generation of 3D Embodied AI Environments (CVPR 2024) [paper] [code] [webpage]
- Video Generation Models as World Simulators [paper] [webpage]
- Learning Interactive Real-World Simulators [paper]
- MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations (CoRL 2023) [paper] [code] [webpage]
- CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation (CVPR 2024) [paper] [code] [webpage]
- Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning [paper] [code] [webpage]
- DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning (ICRA 2025) [paper] [code] [webpage]
- IntervenGen: Interventional Data Generation for Robust and Data-Efficient Robot Imitation Learning [paper] [webpage]
- Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition (CoRL 2023) [paper] [code] [webpage]
- GenAug: Retargeting behaviors to unseen situations via Generative Augmentation (RSS 2023) [paper] [code] [webpage]
- Scaling Robot Learning with Semantically Imagined Experience (RSS 2023) [paper] [webpage]
- RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning (CoRL 2024) [paper] [code] [webpage]
- Learning Robust Real-World Dexterous Grasping Policies via Implicit Shape Augmentation (CoRL 2022) [paper] [webpage]
- DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics [paper]
- Shadow: Leveraging Segmentation Masks for Cross-Embodiment Policy Transfer (CoRL 2024) [paper] [code]
- Human-to-Robot Imitation in the Wild (RSS 2022) [paper] [webpage]
- Mirage: Cross-Embodiment Zero-Shot Policy Transfer with Cross-Painting (RSS 2024) [paper] [code] [webpage]
- CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning [paper] [webpage]
- RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking (ICRA 2024) [paper] [code] [webpage]
- ExAug: Robot-Conditioned Navigation Policies via Geometric Experience Augmentation (ICRA 2023) [paper] [code] [webpage]
- RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning (ICRA 2025) [paper] [code] [webpage]
- Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models (RSS 2023) [paper] [webpage]
- Language to Rewards for Robotic Skill Synthesis (CoRL 2023) [paper] [code] [webpage]
- Vision-Language Models as Success Detectors (CoLLA 2023) [paper]
- Scaling robot policy learning via zero-shot labeling with foundation models (CoRL 2024) [paper] [code] [webpage]
- FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning (ICML 2024) [paper] [code]
- Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning (ICLR 2024) [paper]
- Eureka: Human-Level Reward Design via Coding Large Language Models (NeurIPS 2023) [paper]
- Agentic Skill Discovery (CoRL 2024 workshop & ICRA@40) [paper] [code]
- CLIPort: What and Where Pathways for Robotic Manipulation [paper]
- R3M: A Universal Visual Representation for Robot Manipulation [paper] [code] [webpage]
- LIV: Language-Image Representations and Rewards for Robotic Control (ICML 2023) [paper] [code] [webpage]
- Learning Reward Functions for Robotic Manipulation by Observing Humans [paper]
- Deep visual foresight for planning robot motion (ICRA 2017) [paper]
- VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation (RSS 2024) [paper] [code]
- Learning Reward for Robot Skills Using Large Language Models via Self-Alignment (ICML 2024) [paper]
- Video Prediction Models as Rewards for Reinforcement Learning [paper] [code]
- Vip: Towards universal visual reward and representation via value-implicit pre-training (ICLR 2023) [paper] [code]
- Learning to Understand Goal Specifications by Modelling Reward [paper]
- Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks [paper]
- Policy improvement using language feedback models (NeurIPS 2024) [paper]
- Reinforcement learning with action-free pre-training from videos (ICML2022) [paper] [code]
- Mastering diverse domains through world models [paper] [code] [webpage]
- Dream to Control: Learning Behaviors by Latent Imagination [paper]
- Robot Shape and Location Retention in Video Generation Using Diffusion Models [paper] [code]
- Uncertainty-aware active learning of nerf-based object models for robot manipulators using visual and re-orientation actions [paper] [code] [webpage]
- Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors [paper] [code] [webpage]
- Empowering Embodied Visual Tracking with Visual Foundation Models and Offline RL (ECCV2024) [paper]
- Doughnet: A visual predictive model for topological manipulation of deformable objects [paper]
- KISA: A Unified Keyframe Identifier and Skill Annotator for Long-Horizon Robotics Demonstrations (ICML2024) [paper]
- DynSyn: Dynamical Synergistic Representation for Efficient Learning and Control in Overactuated Embodied Systems (ICML2024) [paper]
- Symmetry-Aware Robot Design with Structured Subgroups (ICML2023) [paper]
- Total-recon: Deformable scene reconstruction for embodied view synthesis (ICCV2023) [paper] [code & data] [webpage]
- Explore and Tell: Embodied Visual Captioning in 3D Environments (ICCV2023) [paper] [code & data]
- Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation (ECCV2024) [paper]
- Manigaussian: Dynamic gaussian splatting for multi-task robotic manipulation (CoRL2024) [paper] [code] [webpage]
- Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics [paper]
- Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training (NeurIPS2024) [paper] [code] [webpage]
- PreLAR: World Model Pre-training with Learnable Action Representation (ECCV2024) [paper] [code]
- Octopus: Embodied vision-language programmer from environmental feedback [paper] [code] [webpage]
- Ec2: Emergent communication for embodied control (CVPR2023) [papar]
- Voxposer: Composable 3d value maps for robotic manipulation with language models [paper]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents (PMLR 2022) [paper] [code]
- Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition (PMLR 2023) [paper] [code]
- Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks (ICLR 2024) [paper] [code]
- Large language models as commonsense knowledge for large-scale task planning (NeurIPS 2023) [paper] [code]
- REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction (CoRL 2023) [paper] [code]
- Gesture-Informed Robot Assistance via Foundation Models (CoRL 2023) [paper]
- Large Language Models for Robotics: Opportunities, Challenges, and Perspectives [paper]
- Embodied Agent Interface (EAI): Benchmarking LLMs for Embodied Decision Making (NeurIPS 2024 Track Datasets and Benchmarks) [paper] [code]
- Embodiedgpt: Vision-language pre-training via embodied chain of thought (NeurIPS 2023) [paper] [code]
- Chat with the Environment: Interactive Multimodal Perception using Large Language Models (IROS 2023) [paper] [code]
- Embodied CoT Distillation From LLM To Off-the-shelf Agents (ICML 2024) [paper]
- Do as i can, not as i say: Grounding language in robotic affordances [paper] [code]
- Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents (NeurIPS 2023) [paper]
- Inner Monologue: Embodied Reasoning through Planning with Language Models (CoRL 2022) [paper]
- PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models [paper] [code]
- SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning (CoRL 2023) [paper]
- Robomp2: A robotic multimodal perception-planning framework with multimodal large language models (ICML 2024) [paper] [code]
- Text2Motion: From Natural Language Instructions to Feasible Plans (Autonomous Robots 2023) [paper]
- STAP: Sequencing Task-Agnostic Policies (ICRA 2023) [paper] [code]
- Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V (arXiv 2024) [paper]
- ProgPrompt: Program Generation for Situated Robot Task Planning Using Large Language Models (Autonomous Robots 2023) [paper]
- See and Think: Embodied Agent in Virtual Environment (arXiv 2023) [paper]
- Octopus: Embodied Vision-Language Programmer from Environmental Feedback (ECCV 2024) [paper] [webpage] [code]
- Demo2Code: From Summarizing Demonstrations to Synthesizing Code via Extended Chain-of-Thought (NeurIPS 2023) [paper] [webpage] [code]
- EC2: Emergent Communication for Embodied Control (CVPR 2023) [paper]
- When Prolog Meets Generative Models: A New Approach for Managing Knowledge and Planning in Robotic Applications (ICRA 2024) [paper]
- Code as Policies: Language Model Programs for Embodied Control (ICRA 2023) [paper] [webpage] [code]
- GenCHiP: Generating Robot Policy Code for High-Precision and Contact-Rich Manipulation Tasks (arXiv 2024) [paper]
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models (CoRL 2023) [paper] [webpage] [code]
- ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation (arXiv 2024) [paper] [webpage] [code]
- RoboScript: Code Generation for Free-Form Manipulation Tasks Across Real and Simulation (arXiv 2024) [paper]
- RobotGPT: Robot Manipulation Learning From ChatGPT (RAL 2024) [paper]
- RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis (ICML 2024) [paper] [webpage] [code]
- Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model (arXiv 2023) [paper] [code]
- GenSim: Generating Robotic Simulation Tasks via Large Language Models (ICLR 2024) [paper] [code]
- Learning Universal Policies via Text-Guided Video Generation (NeurIPS 2023) [paper] [webpage]
- SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation (ICLR 2025) [paper] [webpage]
- Using Left and Right Brains Together: Towards Vision and Language Planning (ICML 2024) [paper]
- Compositional Foundation Models for Hierarchical Planning (NeurIPS 2023) [paper] [webpage]
- Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation (NeurIPS 2024) [paper] [code]
- GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation [webpage] [code]
- GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [webpage]
- Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models (ICLR 2024) [paper] [webpage] [code]
- Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts (CVPR 2024) [paper] [webpage]
- Surfer: Progressive Reasoning with World Models for Robotic Manipulation [paper]
- TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation (CoRL 2022) [paper] [webpage] [code]
- Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies (CoRL 2024) [paper] [webpage] [code]
- Uncertainty-aware Active Learning of NeRF-based Object Models for Robot Manipulators using Visual and Re-orientation Actions [paper] [webpage] [code]
- Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation (CoRL 2022) [paper] [webpage] [code]
- ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation (ECCV 2024) [paper] [webpage] [code]
- GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields (CoRL 2023) [paper] [webpage] [code]
- WorldVLA: Towards Autoregressive Action World Model [paper] [webpage] [code]
- Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation [webpage]
- Diffusion Policy: Visuomotor Policy Learning via Action Diffusion [webpage]
- 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations [webpage]
- RT-1: Robotics Transformer for Real-World Control at Scale [webpage]
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [webpage]
- RVT: Robotic View Transformer for 3D Object Manipulation [webpage]
- RVT-2: Learning Precise Manipulation from Few Examples [webpage]
- GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation [webpage]
- GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [webpage]
- ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation [webpage]
- Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation [webpage]
- OpenVLA: An Open-Source Vision-Language-Action Model [webpage]
- RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [webpage]
- π0: Our First Generalist Policy [webpage]