Skip to content

bimanibench/BiManiBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BiManiBench

A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

1Tsinghua University    2The University of Hong Kong    3HKUST    4Beijing Innovation Center of Humanoid Robotics
*Equal Contribution    Corresponding Authors

📢 News

  • [2026-02] BiManiBench project page and preprint are released!
  • [2026-02] Evaluation code and benchmark assets are now available!

💡 About BiManiBench

BiManiBench is the first hierarchical benchmark specifically designed to systematically evaluate the bimanual coordination capabilities of Multimodal Large Language Models (MLLMs).

While current research in embodied AI has made significant strides in single-arm manipulation, bimanual coordination remains a formidable challenge. It requires more than just parallel execution; it demands rigorous spatiotemporal synchronization and dynamic role assignment to navigate complex kinematic constraints and prevent self-collisions.

🌟 Key Features

  • Hierarchical Evaluation Framework: Deconstructs bimanual tasks into three levels:
    • Tier 1 (Dual-Arm Spatial Reasoning): Fundamental workspace awareness and arm allocation.
    • Tier 2 (High-Level Action Planning): Long-horizon reasoning under diverse coordination modes (parallel & sequential).
    • Tier 3 (Low-Level End-Effector Control): Direct generation of fine-grained, 16-DoF continuous poses.
  • Vision-Driven Agent Pipeline: A structured closed-loop reasoning framework where the MLLM functions as a central "brain" for iterative perception, reasoning, and action.
  • Extensive Empirical Study: Analysis of over 30+ state-of-the-art models, revealing a significant "reasoning-actuation gap" in current foundation models.


Figure 1: The hierarchical evaluation framework of BiManiBench.


Figure 2: The vision-driven agent pipeline for multimodal perception and reasoning.


🛠️ Installation and Deployment

Installation tutorial can be found at RoboTwin installation guide. Since this project is built upon the RoboTwin framework, the installation process is largely identical.

1. Install Vulkan (if not installed)

This project requires Vulkan for rendering. Please ensure the drivers and tools are installed:

sudo apt update
sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools

Verification: Check the installation by running:

vulkaninfo

2. Basic Environment Setup

First, prepare a Conda environment and clone the repository:

# Create and activate environment
conda create -n bimanibench python=3.10 -y
conda activate bimanibench

# Clone the repository
git clone https://github.com/bimanibench/BiManiBench.git
cd BiManiBench

Next, run the installation script to install basic dependencies and CuRobo:

bash script/_install.sh

Troubleshooting & Notes:

  • CuRobo Config Path: If you encounter a curobo configuration path issue, try running:
    python script/update_embodiment_config_path.py
  • Manual Installation: If you encounter any problems during the automated script, please refer to the manual installation section in the original RoboTwin docs.
  • Pytorch3D: If the installation of pytorch3d fails and you are not using 3D data, it will not affect the core functionality of the project.
  • FFmpeg: This project requires FFmpeg. Please ensure it is installed by checking:
    ffmpeg -version
    If it is not installed, please visit https://ffmpeg.org/ for instructions.

3. Download Assets

Download the required assets (RoboTwin-OD, Texture Library, and Embodiments) by running the following command.

Note: If you encounter any rate-limit issues with Hugging Face, please log in to your account first:

# Optional: Login to Hugging Face
huggingface-cli login

# Download assets
bash script/_download_assets.sh

Assets Folder Structure

After downloading, your assets folder should follow this structure:

assets
├── background_texture
├── embodiments
│   ├── embodiment_1
│   │   ├── config.yml
│   │   └── ...
│   └── ...
├── objects
└── ...

🚀 Quick Start

This section provides a quick start guide for configuring your models and running the evaluation benchmarks.

1. Model Configuration

Before running the evaluation scripts, you must configure your model settings in script/remote_model.py. This script handles both locally deployed open-source models and remote API-based models.

Local Deployment

By default, the project assumes local models are served via an OpenAI-compatible server (e.g., vLLM or LMDeploy) at: http://localhost:8000/v1

API Configuration

If you are using proprietary models (e.g., GPT, Claude, Gemini) or specific remote providers, you need to edit script/remote_model.py to provide your API Key and Base URL.

Locate the following block and update the placeholders:

# script/remote_model.py

if "gpt" in self.model_name:
    self.model = OpenAI(
        api_key="YOUR_API_KEY",
        base_url="YOUR_API_URL"
    )
elif "claude" in self.model_name:
    # Set your Claude API details here
    ...

2. Running Evaluation

We provide evaluation scripts for three different hierarchical levels. Ensure your environment is activated and you are in the project root directory.

Level 1: Dual-Arm Spatial Reasoning

This level evaluates the model's ability to understand spatial relationships in different scene complexities.

Command:

python script/run_eval_spatial.py --setting <SETTING> --gpu <GPU_ID> --model <MODEL_NAME>
  • --setting: Choose from sparse, cluttered, or dense.
  • --gpu: GPU ID(s) to use (e.g., 0).
  • --model: The name/path of the model configured in remote_model.py.

Level 2: High-Level Action Planning

This level evaluates the model's ability to generate long-horizon task plans.

Command:

python script/run_eval_high_level.py --task <TASK_NAME> --gpu <GPU_ID> --model <MODEL_NAME>

Supported Tasks (--task):

Robotic arm types Task Names
ARX-X5 handover_mic, blocks_ranking_size, hanging_mug, place_cans_plasticbox, place_burger_fries, stack_blocks_three, handover_block
Franka-Panda blocks_ranking_rgb, place_object_basket, place_bread_skillet, stack_bowls_three, blocks_tower, blocks_cross_shape
Piper put_bottles_dustbin

Level 3: Low-Level End-Effector Control

This level evaluates fine-grained control and precise end-effector positioning.

Command:

python script/run_eval_low_level.py --task <TASK_NAME> --gpu <GPU_ID> --model <MODEL_NAME>

Supported Tasks (--task):

  • place_object_scale
  • place_burger_fries
  • grab_roller
  • stack_blocks_two
  • place_bread_skillet

Config Path Issue: If you encounter issue like Error setting up environment for round0, episode 0: 'Robot' object has no attribute 'left_planner', try running:

python script/update_embodiment_config_path.py

Example Usage

To evaluate a model named gpt-4o on the cluttered spatial reasoning setting using GPU 0:

python script/run_eval_spatial.py --setting cluttered --gpu 0 --model gpt-4o

🖋️ Citation

If you find BiManiBench useful in your research, please cite our work:

@article{wu2026bimanibench,
  author    = {Wu, Xin and Liang, Zhixuan and Ma, Yue and Hu, Mengkang and Qin, Zhiyuan and Li, Xiu},
  title     = {{BiManiBench}: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models},
  journal   = {arXiv preprint},
  year      = {2026},
}

🤝 Acknowledgements

BiManiBench is built upon the great efforts of the open-source community. We would like to express our gratitude to the following projects:

  • RoboTwin 2.0: We utilize RoboTwin 2.0 as our primary simulation platform. While the original framework is designed for evaluating VLA (Vision-Language-Action) policies, we adapted its high-quality bimanual environments, assets, and task configurations to support the direct evaluation of Multimodal Large Language Models (MLLMs). We also customized several task environments to better align with our hierarchical coordination tiers.
  • EmbodiedBench: Our agent's decision-making pipeline is inspired by EmbodiedBench. We adapted its structured evaluation paradigm—originally designed for various single-arm tasks—and re-engineered the top-level logic, including environment-agent interaction and API communication, to facilitate bimanual manipulation within the RoboTwin.

We sincerely thank the authors of these projects for their pioneering contributions to the field.

About

Official implementation of "BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages