RoboBrain-2.5 is a next-generation Embodied AI foundation model that significantly evolves its predecessor's core capabilities in general perception, spatial reasoning, and temporal modeling through extensive training on high-quality spatiotemporal data. It achieves a paradigm shift in 3D Spatial Reasoning, transitioning from 2D relative points to predicting 3D coordinates with depth information, understanding absolute metric constraints, and generating complete manipulation trajectories tailored for complex tasks with physical constraints. Furthermore, it establishes a breakthrough in Temporal Value Prediction by constructing a General Reward Modeling Method that provides dense progress tracking and multi-granular execution state estimation across varying viewpoints. This empowers VLA reinforcement learning with immediate, dense feedback signals, enabling robots to achieve high task success rates and robustness in fine-grained manipulation scenarios.

RoboBrain 2.5 Features
RoboBrain 2.5 Results

🚀 Key Highlights

1. Comprehensive Upgrade in Native 3D Spatial Reasoning

Compared to version 2.0, RoboBrain-2.5 achieves a leap in spatial perception and reasoning capabilities:

  • From 2D to 3D: Upgraded from predicting coordinate points on 2D images to predicting coordinate points with depth information in 3D space (3D Spatial Referring).
  • Relative to Absolute: Evolved from understanding relative spatial relationships to measuring absolute 3D spatial metric information (3D Spatial Measuring). The model can comprehend precise physical constraint instructions (e.g., "hovering 1-5 cm above").
  • Point to Trace: Advanced from predicting a single target point for pick-and-place to predicting a series of key points that describe the complete manipulation process (3D Spatial Trace), naturally possessing spatial planning capabilities with 3D absolute metrics.

2. Breakthrough in Dense Temporal Value Estimation

RoboBrain-2.5 makes significant progress in temporal modeling by constructing a General Reward Model (GRM):

  • Dense Progress Prediction: Capable of multi-granularity task progress prediction across different tasks, viewpoints, and embodiments.
  • Execution State Estimation: Understands task goals and estimates various states during execution (e.g., success, failure, error occurrence).
  • Empowering VLA Reinforcement Learning: Provides real-time, dense feedback signals and rewards for VLA (Vision-Language-Action) reinforcement learning. With only one demonstration, it achieves a task success rate of 95%+ in complex, fine-grained manipulations.

3. More Powerful Core Capabilities from previous version 2.0

RoboBrain 2.5 also maintains the three core capabilities of version 2.0, which supports interactive reasoning with long-horizon planning and closed-loop feedback, spatial perception for precise point and bbox prediction from complex instructions, temporal perception for future trajectory estimation, and scene reasoning through real-time structured memory construction and update.

highlight

Demos: Dense Temporal Value Estimation

Evaluation on Different Data Sources

A Challenging Real-world Rollout

Challenging Real-world Rollout
We plot the reference reward from human annotations, the VLAC baseline, and our RoboBrain 2.5 along the same trajectory. Our model tracks the reference signal more faithfully, sharply penalizing incorrect insertions, low positions, and misalignments, and only assigning high reward near successful task completion.

Real-World RL DEMOs

Insert the Square Block.

Trigger the Circuit.

Cap the Pen.

Robustness to Artificial Disturbance

Robustness to Artificial Disturbance
We visualize a rollout of the converged policy (Insert the Square Block, success rate > 95%) under human interference using RoboBrain 2.5. Each sub-figure shows the third-person view, the ego-centric view, and the real-time inference (Top: Hop, Bottom: Progress). (a) Artificial Disturbance Position: A human hand intervenes and shifts the target board while the robot attempts to approach. (b) Fall Into Misalignment: The robot misses the new position. Note that the Progress curve drops significantly (indicated by the red dot in the bottom inset), reflecting the failure state. (c) Misalignment Recovery: The policy reacts to the visual feedback and the drop in reward, adjusting the end-effector position. (d) Move to the top: The robot realigns directly above the target slot. (e) Align with the Slot: Precise fine-tuning before insertion. (f) Successful Insertion: The task is completed, with the progress estimation reaching its peak.

highlight

Demos: Native 3D Spatial Reasoning

TraceSpatial-Bench Results

This demo shows the visualization of the performance of RoboBrain 2.5 on TraceSpatial-Bench. Yellow masks mark the target objects, and pink 3D boxes mark correct end regions. Despite similar 2D projections, our model yields more accurate spatial traces than strong general VLMs, which often produce floating or colliding traces due to inaccurate depth estimation. Leveraging richer geometric cues further improves performance.

RoboTwin 2.0 Execution Demo

The demo shows how robotic arms follow 3D spatial traces generated by RoboBrain 2.5 to successfully complete a diverse set of manipulation tasks, demonstrating its strong spatial reasoning ability and effective support for embodied task execution.

Spatial Tracing in Cluttered Scenes

Visualizations of spatial tracing in complex, cluttered environments using RoboBrain 2.5.

More Real-world Demos

Demos below show that RoboBrain 2.5 can handle challenging long-horizon spatial tracing tasks requiring complex multi-step metric-grounded reasoning in cluttered and dynamic environments by integrating various control policies diverse robots.

highlight

Demos: Spatial Reasoning & Core Capabilities

System Stability

This video demonstrates the model's referential ability in color recognition and its stability in continuous operation.

Real-time Scene Adaptation

This video demonstrate the model's rapid scene adaptation ability and its capability to judge object proximity, recognize orientation, and determine distance.

Real-time Voice Interruption Adjustment

This video demonstrates the model's capabilities in object spatial relationship recognition, multi-step reasoning, rapid interactive reasoning, and real-time interruption adjustment.

Part-level Orientation-related Referring

This video demonstrates the model's capabilities in object spatial height recognition and part-level orientation-related region identification.

Functionality-oriented Referring

This video demonstrating the model's capabilities in object spatial height recognition and illuminated area identification.

Multi-step Spatial Referring with Reasoning

This video demonstrates the model's object spatial relationship recognition and multi-step spaital referring with reasoning capability.

Structured Arrangement

This video demonstrates the model's ability to understand spatial relationships and pattern reasoning between objects.

Mobile Manipulation

This video demonstrates the model's ability to control a humanoid for both tabletop object manipulation and indoor navigation.

Object Attribute Recognition

This video demonstrates the model's ability to accurately recognize and differentiate objects by their sizes and its stability in continuous operation.

Object Affordance Localization

This video demonstrates the model's capability in object affordance prediction (grasping the handle of the mug) as well as locating objects based on their colors and distances.

Spatial Relations Reasoning

This video demonstrates the model's spatial reasoning capabilities, including distance perception (nearest), position awareness (left and front), and free space localization.

Spatial Referencing and Vacancy Detection

This video demonstrates the model's object referencing capability based on spatial relations and its ability to locate vacant areas in 3D space.

Training & Evaluation

We highlight the distributed training framework FlagScale developed by BAAI Framework R&D team, and the evaluation framework FlagEvalMM developed by BAAI FlagEval team. Both are used for RoboBrain 2.5. Many thanks to the teams for their contributions!

flagscale
FlagScale is a distributed training framework designed for large-scale models, supporting efficient training and evaluation of models like RoboBrain 2.5. It provides a flexible and scalable solution for training large models across multiple GPUs and nodes.
flageval
FlagEvalMM is a comprehensive evaluation framework for multi-modal models, including RoboBrain 2.5. It provides a suite of benchmarks and metrics to assess the performance of multi-modal models in various tasks, ensuring robust evaluation and comparison.

Citation

If you find our model helpful, feel free to cite it:

@article{tan2025robo,
  title={Robo-Dopamine: General Process Reward Modeling for High-Precision Robotic Manipulation},
  author={Tan, Huajie and Chen, Sixiang and Xu, Yijie and Wang, Zixiao and Ji, Yuheng and Chi, Cheng and Lyu, Yaoxu and Zhao, Zhongxia and Chen, Xiansheng and Co, Peterson and others},
  journal={arXiv preprint arXiv:2512.23703},
  year={2025}
}

@article{zhou2025robotracer,
  title={RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics},
  author={Zhou, Enshen and Chi, Cheng and Li, Yibo and An, Jingkun and Zhang, Jiayuan and Rong, Shanyu and Han, Yi and Ji, Yuheng and Liu, Mengzhen and Wang, Pengwei and others},
  journal={arXiv preprint arXiv:2512.13660},
  year={2025}
}

@article{RoboBrain2.0TechnicalReport,
    title={RoboBrain 2.0 Technical Report},
    author={BAAI RoboBrain Team},
    journal={arXiv preprint arXiv:2507.02029},
    year={2025}
}

@article{RoboBrain1.0,
    title={Robobrain: A unified brain model for robotic manipulation from abstract to concrete},
    author={Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and others},
    journal={arXiv preprint arXiv:2502.21257},
    year={2025}
}

@article{Reason-RFT,
    title={Reason-rft: Reinforcement fine-tuning for visual reasoning},
    author={Tan, Huajie and Ji, Yuheng and Hao, Xiaoshuai and Lin, Minglan and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang},
    journal={arXiv preprint arXiv:2503.20752},
    year={2025}
}

@article{tan2025roboos,
    title={Roboos-next: A unified memory-based framework for lifelong, scalable, and robust multi-robot collaboration},
    author={Tan, Huajie and Chi, Cheng and Chen, Xiansheng and Ji, Yuheng and Zhao, Zhongxia and Hao, Xiaoshuai and Lyu, Yaoxu and Cao, Mingyu and Zhao, Junkai and Lyu, Huaihai and others},
    journal={arXiv preprint arXiv:2510.26536},
    year={2025}
  }

@article{zhou2025roborefer,
    title={RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics},
    author={Zhou, Enshen and An, Jingkun and Chi, Cheng and Han, Yi and Rong, Shanyu and Zhang, Chi and Wang, Pengwei and Wang, Zhongyuan and Huang, Tiejun and Sheng, Lu and others},
    journal={arXiv preprint arXiv:2506.04308},
    year={2025}
}