Skip to content

XiaomiRobotics/Xiaomi-Robotics-0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Xiaomi-Robotics-0

An Open-Sourced Vision-Language-Action Model with Real-Time Inference

Paper Project Page Hugging Face License


💡 About Xiaomi-Robotics-0

Xiaomi-Robotics-0 is a state-of-the-art Vision-Language-Action (VLA) model with 4.7B parameters, specifically engineered for high-performance robotic reasoning and seamless real-time execution.

Key Features:

  • 🧠 Strong Generalization: Pre-trained on diverse cross-embodiment trajectories and VL data to handle complex, unseen tasks.
  • 🚀 Real-Time Ready: Optimized with asynchronous execution to minimize inference latency.
  • 🛠️ Flexible Deployment: Fully compatible with the Hugging Face transformers ecosystem and optimized for consumer GPUs.

📅 Updates

  • [Feb 2026] 🎉 Released the Technical Report.
  • [Feb 2026] 🔥 Released Pre-trained weights and Fine-tuned weights for LIBERO, CALVIN, and SimplerEnv.
  • [Feb 2026] 💻 Inference code and evaluation scripts are now live!

🏆 Benchmark

We evaluate Xiaomi-Robotics-0 on three standard simulation benchmarks: CALVIN, LIBERO, and SimplerEnv. The table below summarizes the performance results across different embodiments and datasets. For each setting, we provide the corresponding fine-tuned checkpoint and a guide for running the evaluation.

🤗 Name on Hugging Face Description Performance Evaluation Guide
LIBERO Xiaomi-Robotics-0-LIBERO Fine-tuned on four LIBERO suites. 98.7% (Avg Success) LIBERO Eval
CALVIN Xiaomi-Robotics-0-Calvin-ABCD_D Fine-tuned on ABCD→D Split. 4.80 (Avg Length) CALVIN Eval
Xiaomi-Robotics-0-Calvin-ABC_D Fine-tuned on ABC→D Split. 4.75 (Avg Length) CALVIN Eval
SimplerEnv Xiaomi-Robotics-0-SimplerEnv-Google-Robot Fine-tuned on Fractal dataset. 85.5% (VM)
74.7% (VA)
SimplerEnv Eval
Xiaomi-Robotics-0-SimplerEnv-WidowX Fine-tuned on Bridge dataset. 79.2% SimplerEnv Eval
Base Xiaomi-Robotics-0 Pre-trained model. - -

🚀 Quick Start: Installation & Deployment

Our project relies primarily on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports transformers >= 4.57.1, you can use our project seamlessly—we recommend PyTorch 2.8.0 (paired with torchvision 0.23.0 and torchaudio 2.8.0), as this combination has been fully tested by our team and ensures optimal compatibility.

1️⃣ Installation Guides

Here’s a simple installation guide to get you started:

git clone https://github.com/XiaomiRobotics/Xiaomi-Robotics-0 
cd Xiaomi-Robotics-0

# Create a Conda environment with Python 3.12
conda create -n mibot python=3.12 -y
conda activate mibot

# Install PyTorch
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
# Install transformers
pip install transformers==4.57.1
# Install flash-attn
pip uninstall -y ninja && pip install ninja
pip install flash-attn==2.8.3 --no-build-isolation
# or pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

sudo apt-get install -y libegl1 libgl1 libgles2

2️⃣ Deployment Guides

Xiaomi-Robotics-0 is deployed on top of the HuggingFace Transformers 🤗 ecosystem, enabling straightforward deployment for robotic manipulation tasks. By leveraging Flash Attention 2 and bfloat16 precision, the model can be loaded and run efficiently on consumer-grade GPUs.

import torch
from transformers import AutoModel, AutoProcessor

# 1. Load model and processor 
model_path = "XiaomiRobotics/Xiaomi-Robotics-0-LIBERO"
model = AutoModel.from_pretrained(
    model_path, 
    trust_remote_code=True, 
    attn_implementation="flash_attention_2", 
    dtype=torch.bfloat16
).cuda().eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, use_fast=False)


# 2. Construct the prompt with multi-view inputs
language_instruction = "Pick up the red block."
instruction = (
    f"<|im_start|>user\nThe following observations are captured from multiple views.\n"
    f"# Base View\n<|vision_start|><|image_pad|><|vision_end|>\n"
    f"# Left-Wrist View\n<|vision_start|><|image_pad|><|vision_end|>\n"
    f"Generate robot actions for the task:\n{language_instruction} /no_cot<|im_end|>\n"
    f"<|im_start|>assistant\n<cot></cot><|im_end|>\n"
)

# 3. Prepare inputs
# Assuming `image_base`, `image_wrist`, and `proprio_state` are already loaded
inputs = processor(
    text=[instruction],
    images=[image_base, image_wrist], # [PIL.Image, PIL.Image]
    videos=None,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Add proprioceptive state and action mask
robot_type = "libero_all"
inputs["seed"] = 42 
inputs["state"] = torch.from_numpy(proprio_state).to(model.device, model.dtype).view(1, 1, -1)
inputs["action_mask"] = processor.get_action_mask(robot_type).to(model.device, model.dtype)

# 4. Generate action 
with torch.no_grad():
    outputs = model(**inputs)
    
# Decode raw outputs into actionable control commands
action_chunk = processor.decode_action(outputs.actions, robot_type=robot_type)
print(f"Generated Action Chunk Shape: {action_chunk.shape}")

📚 Citation

If you find this project useful, please consider citing:

@article{cai2026xiaomi,
  title={Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution},
  author={Cai, Rui and Guo, Jun and He, Xinze and Jin, Piaopiao and Li, Jie and Lin, Bingxuan and Liu, Futeng and Liu, Wei and Ma, Fei and Ma, Kun and Qiu, Feng and Qu, Heng and Su, Yifei and Sun, Qiao and Wang, Dong and Wang, Donghao and Wang, Yunhong and Wu, Rujie and Xiang, Diyun and Yang, Yu and Ye, Hangjun and Zhang, Yuan and Zhou, Quanyun},
  journal={arXiv preprint arXiv:2602.12684},
  year={2026}
}

📄 License

This project is licensed under the Apache License 2.0.

Releases

No releases published

Packages

 
 
 

Contributors