A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models
- [2026-02] BiManiBench project page and preprint are released!
- [2026-02] Evaluation code and benchmark assets are now available!
BiManiBench is the first hierarchical benchmark specifically designed to systematically evaluate the bimanual coordination capabilities of Multimodal Large Language Models (MLLMs).
While current research in embodied AI has made significant strides in single-arm manipulation, bimanual coordination remains a formidable challenge. It requires more than just parallel execution; it demands rigorous spatiotemporal synchronization and dynamic role assignment to navigate complex kinematic constraints and prevent self-collisions.
- Hierarchical Evaluation Framework: Deconstructs bimanual tasks into three levels:
- Tier 1 (Dual-Arm Spatial Reasoning): Fundamental workspace awareness and arm allocation.
- Tier 2 (High-Level Action Planning): Long-horizon reasoning under diverse coordination modes (parallel & sequential).
- Tier 3 (Low-Level End-Effector Control): Direct generation of fine-grained, 16-DoF continuous poses.
- Vision-Driven Agent Pipeline: A structured closed-loop reasoning framework where the MLLM functions as a central "brain" for iterative perception, reasoning, and action.
- Extensive Empirical Study: Analysis of over 30+ state-of-the-art models, revealing a significant "reasoning-actuation gap" in current foundation models.
Figure 1: The hierarchical evaluation framework of BiManiBench.
Figure 2: The vision-driven agent pipeline for multimodal perception and reasoning.
Installation tutorial can be found at RoboTwin installation guide. Since this project is built upon the RoboTwin framework, the installation process is largely identical.
This project requires Vulkan for rendering. Please ensure the drivers and tools are installed:
sudo apt update
sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-toolsVerification: Check the installation by running:
vulkaninfoFirst, prepare a Conda environment and clone the repository:
# Create and activate environment
conda create -n bimanibench python=3.10 -y
conda activate bimanibench
# Clone the repository
git clone https://github.com/bimanibench/BiManiBench.git
cd BiManiBenchNext, run the installation script to install basic dependencies and CuRobo:
bash script/_install.sh- CuRobo Config Path: If you encounter a
curoboconfiguration path issue, try running:python script/update_embodiment_config_path.py
- Manual Installation: If you encounter any problems during the automated script, please refer to the manual installation section in the original RoboTwin docs.
- Pytorch3D: If the installation of
pytorch3dfails and you are not using 3D data, it will not affect the core functionality of the project. - FFmpeg: This project requires FFmpeg. Please ensure it is installed by checking:
If it is not installed, please visit https://ffmpeg.org/ for instructions.
ffmpeg -version
Download the required assets (RoboTwin-OD, Texture Library, and Embodiments) by running the following command.
Note: If you encounter any rate-limit issues with Hugging Face, please log in to your account first:
# Optional: Login to Hugging Face
huggingface-cli login
# Download assets
bash script/_download_assets.shAfter downloading, your assets folder should follow this structure:
assets
├── background_texture
├── embodiments
│ ├── embodiment_1
│ │ ├── config.yml
│ │ └── ...
│ └── ...
├── objects
└── ...
This section provides a quick start guide for configuring your models and running the evaluation benchmarks.
Before running the evaluation scripts, you must configure your model settings in script/remote_model.py. This script handles both locally deployed open-source models and remote API-based models.
By default, the project assumes local models are served via an OpenAI-compatible server (e.g., vLLM or LMDeploy) at:
http://localhost:8000/v1
If you are using proprietary models (e.g., GPT, Claude, Gemini) or specific remote providers, you need to edit script/remote_model.py to provide your API Key and Base URL.
Locate the following block and update the placeholders:
# script/remote_model.py
if "gpt" in self.model_name:
self.model = OpenAI(
api_key="YOUR_API_KEY",
base_url="YOUR_API_URL"
)
elif "claude" in self.model_name:
# Set your Claude API details here
...We provide evaluation scripts for three different hierarchical levels. Ensure your environment is activated and you are in the project root directory.
This level evaluates the model's ability to understand spatial relationships in different scene complexities.
Command:
python script/run_eval_spatial.py --setting <SETTING> --gpu <GPU_ID> --model <MODEL_NAME>--setting: Choose fromsparse,cluttered, ordense.--gpu: GPU ID(s) to use (e.g.,0).--model: The name/path of the model configured inremote_model.py.
This level evaluates the model's ability to generate long-horizon task plans.
Command:
python script/run_eval_high_level.py --task <TASK_NAME> --gpu <GPU_ID> --model <MODEL_NAME>Supported Tasks (--task):
| Robotic arm types | Task Names |
|---|---|
| ARX-X5 | handover_mic, blocks_ranking_size, hanging_mug, place_cans_plasticbox, place_burger_fries, stack_blocks_three, handover_block |
| Franka-Panda | blocks_ranking_rgb, place_object_basket, place_bread_skillet, stack_bowls_three, blocks_tower, blocks_cross_shape |
| Piper | put_bottles_dustbin |
This level evaluates fine-grained control and precise end-effector positioning.
Command:
python script/run_eval_low_level.py --task <TASK_NAME> --gpu <GPU_ID> --model <MODEL_NAME>Supported Tasks (--task):
place_object_scaleplace_burger_friesgrab_rollerstack_blocks_twoplace_bread_skillet
Config Path Issue: If you encounter issue like Error setting up environment for round0, episode 0: 'Robot' object has no attribute 'left_planner', try running:
python script/update_embodiment_config_path.pyTo evaluate a model named gpt-4o on the cluttered spatial reasoning setting using GPU 0:
python script/run_eval_spatial.py --setting cluttered --gpu 0 --model gpt-4oIf you find BiManiBench useful in your research, please cite our work:
@article{wu2026bimanibench,
author = {Wu, Xin and Liang, Zhixuan and Ma, Yue and Hu, Mengkang and Qin, Zhiyuan and Li, Xiu},
title = {{BiManiBench}: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models},
journal = {arXiv preprint},
year = {2026},
}
BiManiBench is built upon the great efforts of the open-source community. We would like to express our gratitude to the following projects:
- RoboTwin 2.0: We utilize RoboTwin 2.0 as our primary simulation platform. While the original framework is designed for evaluating VLA (Vision-Language-Action) policies, we adapted its high-quality bimanual environments, assets, and task configurations to support the direct evaluation of Multimodal Large Language Models (MLLMs). We also customized several task environments to better align with our hierarchical coordination tiers.
- EmbodiedBench: Our agent's decision-making pipeline is inspired by EmbodiedBench. We adapted its structured evaluation paradigm—originally designed for various single-arm tasks—and re-engineered the top-level logic, including environment-agent interaction and API communication, to facilitate bimanual manipulation within the RoboTwin.
We sincerely thank the authors of these projects for their pioneering contributions to the field.