Xiaomi-Robotics-0 is a state-of-the-art Vision-Language-Action (VLA) model with 4.7B parameters, specifically engineered for high-performance robotic reasoning and seamless real-time execution.
- 🧠 Strong Generalization: Pre-trained on diverse cross-embodiment trajectories and VL data to handle complex, unseen tasks.
- 🚀 Real-Time Ready: Optimized with asynchronous execution to minimize inference latency.
- 🛠️ Flexible Deployment: Fully compatible with the Hugging Face
transformersecosystem and optimized for consumer GPUs.
- [Feb 2026] 🎉 Released the Technical Report.
- [Feb 2026] 🔥 Released Pre-trained weights and Fine-tuned weights for LIBERO, CALVIN, and SimplerEnv.
- [Feb 2026] 💻 Inference code and evaluation scripts are now live!
We evaluate Xiaomi-Robotics-0 on three standard simulation benchmarks: CALVIN, LIBERO, and SimplerEnv. The table below summarizes the performance results across different embodiments and datasets. For each setting, we provide the corresponding fine-tuned checkpoint and a guide for running the evaluation.
| 🤗 Name on Hugging Face | Description | Performance | Evaluation Guide | |
|---|---|---|---|---|
| LIBERO | Xiaomi-Robotics-0-LIBERO | Fine-tuned on four LIBERO suites. | 98.7% (Avg Success) | LIBERO Eval |
| CALVIN | Xiaomi-Robotics-0-Calvin-ABCD_D | Fine-tuned on ABCD→D Split. | 4.80 (Avg Length) | CALVIN Eval |
| Xiaomi-Robotics-0-Calvin-ABC_D | Fine-tuned on ABC→D Split. | 4.75 (Avg Length) | CALVIN Eval | |
| SimplerEnv | Xiaomi-Robotics-0-SimplerEnv-Google-Robot | Fine-tuned on Fractal dataset. | 85.5% (VM) 74.7% (VA) |
SimplerEnv Eval |
| Xiaomi-Robotics-0-SimplerEnv-WidowX | Fine-tuned on Bridge dataset. | 79.2% | SimplerEnv Eval | |
| Base | Xiaomi-Robotics-0 | Pre-trained model. | - | - |
Our project relies primarily on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports transformers >= 4.57.1, you can use our project seamlessly—we recommend PyTorch 2.8.0 (paired with torchvision 0.23.0 and torchaudio 2.8.0), as this combination has been fully tested by our team and ensures optimal compatibility.
Here’s a simple installation guide to get you started:
git clone https://github.com/XiaomiRobotics/Xiaomi-Robotics-0
cd Xiaomi-Robotics-0
# Create a Conda environment with Python 3.12
conda create -n mibot python=3.12 -y
conda activate mibot
# Install PyTorch
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
# Install transformers
pip install transformers==4.57.1
# Install flash-attn
pip uninstall -y ninja && pip install ninja
pip install flash-attn==2.8.3 --no-build-isolation
# or pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
sudo apt-get install -y libegl1 libgl1 libgles2Xiaomi-Robotics-0 is deployed on top of the HuggingFace Transformers 🤗 ecosystem, enabling straightforward deployment for robotic manipulation tasks. By leveraging Flash Attention 2 and bfloat16 precision, the model can be loaded and run efficiently on consumer-grade GPUs.
import torch
from transformers import AutoModel, AutoProcessor
# 1. Load model and processor
model_path = "XiaomiRobotics/Xiaomi-Robotics-0-LIBERO"
model = AutoModel.from_pretrained(
model_path,
trust_remote_code=True,
attn_implementation="flash_attention_2",
dtype=torch.bfloat16
).cuda().eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
# 2. Construct the prompt with multi-view inputs
language_instruction = "Pick up the red block."
instruction = (
f"<|im_start|>user\nThe following observations are captured from multiple views.\n"
f"# Base View\n<|vision_start|><|image_pad|><|vision_end|>\n"
f"# Left-Wrist View\n<|vision_start|><|image_pad|><|vision_end|>\n"
f"Generate robot actions for the task:\n{language_instruction} /no_cot<|im_end|>\n"
f"<|im_start|>assistant\n<cot></cot><|im_end|>\n"
)
# 3. Prepare inputs
# Assuming `image_base`, `image_wrist`, and `proprio_state` are already loaded
inputs = processor(
text=[instruction],
images=[image_base, image_wrist], # [PIL.Image, PIL.Image]
videos=None,
padding=True,
return_tensors="pt",
).to(model.device)
# Add proprioceptive state and action mask
robot_type = "libero_all"
inputs["seed"] = 42
inputs["state"] = torch.from_numpy(proprio_state).to(model.device, model.dtype).view(1, 1, -1)
inputs["action_mask"] = processor.get_action_mask(robot_type).to(model.device, model.dtype)
# 4. Generate action
with torch.no_grad():
outputs = model(**inputs)
# Decode raw outputs into actionable control commands
action_chunk = processor.decode_action(outputs.actions, robot_type=robot_type)
print(f"Generated Action Chunk Shape: {action_chunk.shape}")If you find this project useful, please consider citing:
@article{cai2026xiaomi,
title={Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution},
author={Cai, Rui and Guo, Jun and He, Xinze and Jin, Piaopiao and Li, Jie and Lin, Bingxuan and Liu, Futeng and Liu, Wei and Ma, Fei and Ma, Kun and Qiu, Feng and Qu, Heng and Su, Yifei and Sun, Qiao and Wang, Dong and Wang, Donghao and Wang, Yunhong and Wu, Rujie and Xiang, Diyun and Yang, Yu and Ye, Hangjun and Zhang, Yuan and Zhou, Quanyun},
journal={arXiv preprint arXiv:2602.12684},
year={2026}
}This project is licensed under the Apache License 2.0.