Skip to content

H-EmbodVis/DOMINO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang1, Shangru Li1, Shuhan Wang1, Xuanyang Xi2, Dingkang Liang1, Xiang Bai1
1 Huazhong University of Science and Technology, 2 Huawei Technologies Co. Ltd

πŸ” Overview

Dynamic manipulation requires robots to continuously adapt to moving objects and unpredictable environmental changes. Existing Vision-Language-Action (VLA) models rely on static single-frame observations, failing to capture essential spatiotemporal dynamics. We introduce DOMINO, a comprehensive benchmark for this underexplored frontier, and PUMA, a predictive architecture that couples historical motion cues with future state anticipation to achieve highly reactive embodied intelligence.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks.

πŸŽ₯ Visual Demos

More visual demos can be found on our project homepage.

✨ Key Idea

  • Current VLA models struggle with dynamic manipulation tasks due to a scarcity of dynamic datasets and a reliance on single-frame observations.
  • We introduce DOMINO, a large-scale benchmark for dynamic manipulation comprising 35 tasks and over 110K expert trajectories.
  • We propose PUMA, a dynamics-aware VLA architecture that integrates historical optical flow and world queries to forecast future object states.
  • Training on dynamic data fosters robust spatiotemporal representations, demonstrating enhanced generalization capabilities.

πŸ“… TODO

  • Release the paper
  • Release DOMINO benchmark code
  • Release DOMINO dataset on HuggingFace and ModelScope
  • Release PUMA training code and evaluation code
  • Release PUMA checkpoint
  • Support Huawei Ascend NPUs

πŸ› οΈ Getting Started

This project is divided into two main components that operate in separate environments and communicate via WebSockets:

  • DOMINO: The simulation environment and data generation pipeline.
  • PUMA: The Vision-Language-Action policy framework.

You will need to set up both environments to run the full pipeline.

1. DOMINO (Simulation & Data Pipeline)

1.0. System Requirements

  • OS: Linux (Windows/MacOS have limited or no support)
  • Hardware: NVIDIA GPU (RTX recommended for ray tracing)
  • Software: Python 3.10, CUDA 12.1 (Recommended), NVIDIA Driver >= 520

Note: If running inside a Docker container, you must include the graphics capability to avoid Vulkan-related segmentation faults:

docker run ... -e NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics

1.1. Installation Steps

Step 1: Install System Dependencies Ensure Vulkan and FFmpeg are installed on your system:

sudo apt update
sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools ffmpeg

(Verify installations by running vulkaninfo and ffmpeg -version)

Step 2: Create Conda Environment

conda create -n domino python=3.10 -y
conda activate domino

Step 3: Clone and Install

git clone https://github.com/h-embodvis/DOMINO.git
cd DOMINO

# Install basic environments and CuRobo
bash script/_install.sh

Troubleshooting: If you encounter a CuRobo config path issue, run python script/update_embodiment_config_path.py. A failed PyTorch3D installation won't affect core functionality unless you are using 3D data.

Step 4: Download Assets Download the required assets (RoboTwin-OD, Texture Library, and Embodiments). If you hit rate limits, log in to Hugging Face first (huggingface-cli login).

bash script/_download_assets.sh

1.2. Data Collection

We provide an automated pipeline for data collection. You can collect data by running:

bash collect_data.sh ${task_name} ${task_config} ${gpu_id}
# Example: bash collect_data.sh adjust_bottle demo_clean_dynamic 0

After collection, the data will be stored under data/${task_name}/${task_config} in HDF5 format. For the full data collection process and common issues, please refer to the RoboTwin Data Collection Tutorial.

Dynamic Task Configurations

To enable dynamic environments, we introduce four specific configurations in the task config files (e.g., task_config/demo_clean_dynamic.yml and task_config/demo_random_dynamic.yml):

Click to view Dynamic Configurations
  • use_dynamic (bool): Whether to enable dynamic motion in the environment (e.g., moving objects).
  • dynamic_level (int): The complexity level of the dynamic motion (1, 2, or 3). Higher levels introduce more challenging dynamic behaviors.
  • dynamic_coefficient (float): A scaling factor (default: 0.1) that controls the speed of the dynamic movements.
  • check_render_success (bool): Whether to verify rendering success during data collection, ensuring that dynamic interactions do not cause visual or physical glitches.

For all other detailed configurations (like domain randomization, cameras, and data types), we maintain the original RoboTwin 2.0 settings. You can find more information in the RoboTwin Configurations Tutorial.

1.3. Policy Evaluation

To evaluate a trained policy, use the following command. The task_config field refers to the evaluation environment configuration, while the ckpt_setting field refers to the training data configuration used during policy learning.

bash eval.sh ${task_name} ${task_config} ${ckpt_setting} ${expert_data_num} ${seed} ${gpu_id}

# Example: Evaluate a policy trained on `demo_clean_dynamic` and tested on `demo_clean_dynamic`
# bash eval.sh adjust_bottle demo_clean_dynamic demo_clean_dynamic 50 0 0
Click to view Dynamic Adaptations in Evaluation

To better evaluate dynamic manipulation, we have introduced several modifications in script/eval_policy.py and script/eval_metrics.py:

  • Enhanced Evaluation Metrics: Alongside the standard Success Rate (SR), we introduce the Manipulation Score (MS), a comprehensive metric that evaluates route completion while applying penalties for undesirable behaviors (e.g., collisions or out-of-bounds).
  • Strict Success Conditions: We added rigorous success criteria for dynamic objects, including out-of-bounds detection (failing if the object leaves the workspace before grasping) and lifting verification (ensuring the object is lifted beyond a specific height threshold to prevent false positives from accidental touches).

Note: The policy evaluation framework is fully compatible with RoboTwin 2.0. You can seamlessly migrate and evaluate any policies between the two repositories by simply loading a new task configuration within our codebase.

2. PUMA (VLA Policy)

More details about the PUMA architecture can be found in the PUMA README.

PUMA is a predictive VLA architecture that couples historical motion cues with future state anticipation to achieve highly reactive embodied intelligence.

2.1 Installation Steps

The codebase is provided in policy/PUMA. Please set up the environment from this directory.

Step 1: Create Conda Environment

conda create -n puma python=3.10 -y
conda activate puma

Step 2: Install Dependencies and PUMA Make sure to install a PyTorch version that matches your CUDA toolkit. We recommend CUDA 12.4.

# 1. Install PUMA Core Dependencies
cd policy/PUMA
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation

# 2. Install GroundingDINO for Grounded-SAM-2
cd PUMA/model/modules/grounding_sam/grounding_dino
pip install -r requirements.txt
pip install --no-build-isolation -e .
python setup.py build_ext --inplace
cd ..

# 3. Install SAM2
pip install --no-build-isolation -e .
cd ../../../..

# 4. Install PUMA Package
pip install -e .
Common Issues (Flash-Attn)

flash-attn can be tricky to install because it must match your system’s CUDA toolkit (nvcc) and PyTorch versions. The --no-build-isolation flag resolves most issues, but on newer systems you may need to manually choose a compatible flash-attn version. Ensure your CUDA driver/toolkit and torch versions are aligned. Check your environment:

nvcc -V
pip list | grep -E 'torch|transformers|flash-attn'

If issues persist, pick a flash-attn release that matches your versions (CUDA and torch) or ask ChatGPT to help with the outputs above. We have verified that flash-attn==2.7.4.post1 works well with nvcc versions 12.0 and 12.4.

2.2 Download Pre-trained Weights

PUMA requires both a Vision-Language-Action base model and grounding models (SAM2 + GroundingDINO). Please download the following weights and place them under policy/PUMA/playground/Pretrained_models.

  1. Base VLM Model

    • Download the Qwen3-VL-4B-Instruct-Action base model from Hugging Face: StarVLA/Qwen3-VL-4B-Instruct-Action
    • Place it at: policy/PUMA/playground/Pretrained_models/Qwen3-VL-4B-Instruct-Action
  2. Grounded-SAM-2 Models

Click to view example directory structure The resulting directory structure should look like this:
policy/PUMA/playground/Pretrained_models/
β”œβ”€β”€ Qwen3-VL-4B-Instruct-Action/
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors.index.json
β”‚   └── ...
└── grounded_sam2/
    β”œβ”€β”€ groundingdino_swint_ogc.pth
    └── sam2.1_hiera_large.pt

2.3 Training PUMA

We provide the main training launch script inside policy/PUMA/scripts/run_scripts/run_lerobot_robotwin_puma.sh.

  1. Review and modify the environment variables in scripts/run_scripts/run_lerobot_robotwin_puma.sh (e.g., DATA_ROOT_DIR, RUN_ROOT_DIR) to match your system settings.
  2. Launch the training:
cd policy/PUMA
bash scripts/run_scripts/run_lerobot_robotwin_puma.sh

2.4 Evaluation

The evaluation involves communication between the PUMA policy server and the DOMINO simulation environment via WebSockets.

Step 1: Start the PUMA Policy Server Open a new terminal, activate the puma environment, and launch the server:

conda activate puma
cd policy/PUMA
# Make sure to edit your checkpoint path in `examples/Robotwin/eval_files/deploy_policy.yml` and `run_policy_server.sh` first!
bash examples/Robotwin/eval_files/run_policy_server.sh

Step 2: Start the DOMINO Simulation In another terminal, activate your simulation environment (domino) and launch the evaluation loop:

conda activate domino
cd policy/PUMA/examples/Robotwin/eval_files
# Example: Evaluate on adjust_bottle
bash eval.sh adjust_bottle demo_clean_dynamic puma_demo 0 0

πŸ‘ Acknowledgement

We build upon the following great works and open source repositories

πŸ“– Citation

@article{fang2026towards,
      title={Towards Generalizable Robotic Manipulation in Dynamic Environments},
      author={Fang, Heng and Li, Shangru and Wang, Shuhan and Xi, Xuanyang and Liang, Dingkang and Bai, Xiang},
      journal={arXiv preprint arXiv:2603.15620},
      year={2026}
}

About

Towards Generalizable Robotic Manipulation in Dynamic Environments

Resources

License

Stars

Watchers

Forks

Languages