BiManiBench

A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

Xin Wu^1*, Zhixuan Liang^2*, Yue Ma^3,4†, Mengkang Hu², Zhiyuan Qin⁴, Xiu Li^1†

¹Tsinghua University ²The University of Hong Kong ³HKUST ⁴Beijing Innovation Center of Humanoid Robotics

^*Equal Contribution ^†Corresponding Authors

📢 News

[2026-02] BiManiBench project page and preprint are released!
[2026-02] Evaluation code and benchmark assets are now available!

💡 About BiManiBench

BiManiBench is the first hierarchical benchmark specifically designed to systematically evaluate the bimanual coordination capabilities of Multimodal Large Language Models (MLLMs).

While current research in embodied AI has made significant strides in single-arm manipulation, bimanual coordination remains a formidable challenge. It requires more than just parallel execution; it demands rigorous spatiotemporal synchronization and dynamic role assignment to navigate complex kinematic constraints and prevent self-collisions.

🌟 Key Features

Hierarchical Evaluation Framework: Deconstructs bimanual tasks into three levels:
- Tier 1 (Dual-Arm Spatial Reasoning): Fundamental workspace awareness and arm allocation.
- Tier 2 (High-Level Action Planning): Long-horizon reasoning under diverse coordination modes (parallel & sequential).
- Tier 3 (Low-Level End-Effector Control): Direct generation of fine-grained, 16-DoF continuous poses.
Vision-Driven Agent Pipeline: A structured closed-loop reasoning framework where the MLLM functions as a central "brain" for iterative perception, reasoning, and action.
Extensive Empirical Study: Analysis of over 30+ state-of-the-art models, revealing a significant "reasoning-actuation gap" in current foundation models.

Figure 1: The hierarchical evaluation framework of BiManiBench.

Figure 2: The vision-driven agent pipeline for multimodal perception and reasoning.

🛠️ Installation and Deployment

Installation tutorial can be found at RoboTwin installation guide. Since this project is built upon the RoboTwin framework, the installation process is largely identical.

1. Install Vulkan (if not installed)

This project requires Vulkan for rendering. Please ensure the drivers and tools are installed:

sudo apt update
sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools

Verification: Check the installation by running:

vulkaninfo

2. Basic Environment Setup

First, prepare a Conda environment and clone the repository:

# Create and activate environment
conda create -n bimanibench python=3.10 -y
conda activate bimanibench

# Clone the repository
git clone https://github.com/bimanibench/BiManiBench.git
cd BiManiBench

Next, run the installation script to install basic dependencies and CuRobo:

bash script/_install.sh

Troubleshooting & Notes:

CuRobo Config Path: If you encounter a curobo configuration path issue, try running:
```
python script/update_embodiment_config_path.py
```
Manual Installation: If you encounter any problems during the automated script, please refer to the manual installation section in the original RoboTwin docs.
Pytorch3D: If the installation of pytorch3d fails and you are not using 3D data, it will not affect the core functionality of the project.
FFmpeg: This project requires FFmpeg. Please ensure it is installed by checking:
```
ffmpeg -version
```
If it is not installed, please visit https://ffmpeg.org/ for instructions.

3. Download Assets

Download the required assets (RoboTwin-OD, Texture Library, and Embodiments) by running the following command.

Note: If you encounter any rate-limit issues with Hugging Face, please log in to your account first:

# Optional: Login to Hugging Face
huggingface-cli login

# Download assets
bash script/_download_assets.sh

Assets Folder Structure

After downloading, your assets folder should follow this structure:

assets
├── background_texture
├── embodiments
│   ├── embodiment_1
│   │   ├── config.yml
│   │   └── ...
│   └── ...
├── objects
└── ...

🚀 Quick Start

This section provides a quick start guide for configuring your models and running the evaluation benchmarks.

1. Model Configuration

Before running the evaluation scripts, you must configure your model settings in script/remote_model.py. This script handles both locally deployed open-source models and remote API-based models.

Local Deployment

By default, the project assumes local models are served via an OpenAI-compatible server (e.g., vLLM or LMDeploy) at: http://localhost:8000/v1

API Configuration

If you are using proprietary models (e.g., GPT, Claude, Gemini) or specific remote providers, you need to edit script/remote_model.py to provide your API Key and Base URL.

Locate the following block and update the placeholders:

# script/remote_model.py

if "gpt" in self.model_name:
    self.model = OpenAI(
        api_key="YOUR_API_KEY",
        base_url="YOUR_API_URL"
    )
elif "claude" in self.model_name:
    # Set your Claude API details here
    ...

2. Running Evaluation

We provide evaluation scripts for three different hierarchical levels. Ensure your environment is activated and you are in the project root directory.

Level 1: Dual-Arm Spatial Reasoning

This level evaluates the model's ability to understand spatial relationships in different scene complexities.

Command:

python script/run_eval_spatial.py --setting <SETTING> --gpu <GPU_ID> --model <MODEL_NAME>

--setting: Choose from sparse, cluttered, or dense.
--gpu: GPU ID(s) to use (e.g., 0).
--model: The name/path of the model configured in remote_model.py.

Level 2: High-Level Action Planning

This level evaluates the model's ability to generate long-horizon task plans.

Command:

python script/run_eval_high_level.py --task <TASK_NAME> --gpu <GPU_ID> --model <MODEL_NAME>

Supported Tasks (--task):

Robotic arm types	Task Names
ARX-X5	`handover_mic`, `blocks_ranking_size`, `hanging_mug`, `place_cans_plasticbox`, `place_burger_fries`, `stack_blocks_three`, `handover_block`
Franka-Panda	`blocks_ranking_rgb`, `place_object_basket`, `place_bread_skillet`, `stack_bowls_three`, `blocks_tower`, `blocks_cross_shape`
Piper	`put_bottles_dustbin`

Level 3: Low-Level End-Effector Control

This level evaluates fine-grained control and precise end-effector positioning.

Command:

python script/run_eval_low_level.py --task <TASK_NAME> --gpu <GPU_ID> --model <MODEL_NAME>

Supported Tasks (--task):

place_object_scale
place_burger_fries
grab_roller
stack_blocks_two
place_bread_skillet

Config Path Issue: If you encounter issue like Error setting up environment for round0, episode 0: 'Robot' object has no attribute 'left_planner', try running:

python script/update_embodiment_config_path.py

Example Usage

To evaluate a model named gpt-4o on the cluttered spatial reasoning setting using GPU 0:

python script/run_eval_spatial.py --setting cluttered --gpu 0 --model gpt-4o

🖋️ Citation

If you find BiManiBench useful in your research, please cite our work:

@article{wu2026bimanibench,
  author    = {Wu, Xin and Liang, Zhixuan and Ma, Yue and Hu, Mengkang and Qin, Zhiyuan and Li, Xiu},
  title     = {{BiManiBench}: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models},
  journal   = {arXiv preprint},
  year      = {2026},
}

🤝 Acknowledgements

BiManiBench is built upon the great efforts of the open-source community. We would like to express our gratitude to the following projects:

RoboTwin 2.0: We utilize RoboTwin 2.0 as our primary simulation platform. While the original framework is designed for evaluating VLA (Vision-Language-Action) policies, we adapted its high-quality bimanual environments, assets, and task configurations to support the direct evaluation of Multimodal Large Language Models (MLLMs). We also customized several task environments to better align with our hierarchical coordination tiers.
EmbodiedBench: Our agent's decision-making pipeline is inspired by EmbodiedBench. We adapted its structured evaluation paradigm—originally designed for various single-arm tasks—and re-engineered the top-level logic, including environment-agent interaction and API communication, to facilitate bimanual manipulation within the RoboTwin.

We sincerely thank the authors of these projects for their pioneering contributions to the field.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
code_gen		code_gen
data		data
description		description
envs		envs
policy		policy
script		script
static/images		static/images
task_config		task_config
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BiManiBench

📢 News

💡 About BiManiBench

🌟 Key Features

🛠️ Installation and Deployment

1. Install Vulkan (if not installed)

2. Basic Environment Setup

Troubleshooting & Notes:

3. Download Assets

Assets Folder Structure

🚀 Quick Start

1. Model Configuration

Local Deployment

API Configuration

2. Running Evaluation

Level 1: Dual-Arm Spatial Reasoning

Level 2: High-Level Action Planning

Level 3: Low-Level End-Effector Control

Example Usage

🖋️ Citation

🤝 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BiManiBench

📢 News

💡 About BiManiBench

🌟 Key Features

🛠️ Installation and Deployment

1. Install Vulkan (if not installed)

2. Basic Environment Setup

Troubleshooting & Notes:

3. Download Assets

Assets Folder Structure

🚀 Quick Start

1. Model Configuration

Local Deployment

API Configuration

2. Running Evaluation

Level 1: Dual-Arm Spatial Reasoning

Level 2: High-Level Action Planning

Level 3: Low-Level End-Effector Control

Example Usage

🖋️ Citation

🤝 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages