- [2026/02] π Technical Report released! Read the paper
intro.mp4
ABot-N0 is a unified Vision-Language-Action (VLA) foundation model that achieves a "Grand Unification" across 5 core embodied navigation tasks:
| Task | Description |
|---|---|
| π― Point-Goal | Reach precise metric coordinates for robust locomotion and obstacle avoidance |
| π Object-Goal | Search for and navigate to a specific object category in unseen environments |
| π Instruction-Following | Execute complex natural language navigation instructions |
| π POI-Goal | Navigate to specific Points of Interest and their physical entrances |
| πΆ Person-Following | Real-time tracking and following of dynamic human targets |
ABot-N0 adopts a hierarchical "Brain-Action" architecture:
- Universal Multi-Modal Encoder β unifies heterogeneous inputs (RGB, visual history, goals) into a shared latent space
- Cognitive Brain β a pre-trained LLM (Qwen3-4B) for deep semantic understanding and spatial reasoning
- Action Expert β Flow Matching-based trajectory generator for precise, continuous control
| Unified Tasks | 5 core navigation paradigms in a single model |
| SOTA Benchmarks | New state-of-the-art on 7 authoritative benchmarks |
| Data Scale | 16.9M expert trajectories + 5.0M reasoning samples |
| 3D Scenes | 7,802 high-fidelity scenes covering 10.3 kmΒ² |
| Real-world Deployment | Deployed on Unitree Go2 with NVIDIA Jetson Orin NX, achieving 2Hz VLA inference |
ABot-N0 follows a hierarchical "Brain-Action" design comprising three pillars:
-
Universal Multi-Modal Encoder: Supports flexible vision inputs (panoramic / front-view), heterogeneous goal definitions (text-based semantic goals & point-based geometric goals), and reasoning task encoding.
-
Cognitive Brain: Built upon a pre-trained LLM, it supports dual-mode operation β a Reasoning Head for high-level semantic understanding and an Action Head for motion planning.
-
Action Expert: Employs Flow Matching to generate multi-modal trajectory distributions (5 waypoints with position + yaw), enabling precise continuous control.
The ABot-N0 Data Engine is the largest embodied navigation data pipeline, integrating three synergistic layers:
- High-Fidelity 3D Scene Ecosystem: 7,802 scenes (indoor: homes, offices, malls, stations; outdoor: intersections, parks, virtual city) covering 10.3 kmΒ²
- Universal Trajectories Dataset: ~16.9M expert trajectories across 5 navigation paradigms
- Cognitive Reasoning Dataset: ~5.0M reasoning samples grounding decision-making in spatial-social logic
ABot-N0 is trained via a three-stage curriculum:
- Phase 1 β Cognitive Warm-up: Fine-tune the LLM backbone on reasoning tasks to learn "what to see" and "how to reason"
- Phase 2 β Unified Sensorimotor SFT: Joint multi-task training with dual-head optimization (AR reasoning + Flow Matching actions)
- Phase 3 β SAFE-GRPO: Post-training value alignment via socially-aware reinforcement learning for social compliance
Beyond the foundation model, we propose an Agentic Navigation System for real-world deployment:
- Agentic Planner: VLM-powered intent decomposition with CoT reasoning and closed-loop self-reflection
- Topo-Memory (Map-as-Memory): Hierarchical topological memory for cross-scale spatial knowledge (Block β Road β Function β Object/POI layers)
- Neural Controller: High-speed reactive control (>10Hz) bridging strategic waypoints and real-time execution
- Hardware: Unitree Go2 quadrupedal robot + NVIDIA Jetson Orin NX (157 TOPS)
|
|
|
|
|
|
|
|
|
ABot-N0 achieves new SOTA on 7 benchmarks:
- CityWalker (Point-Goal, Open-Loop)
- SocNav (Point-Goal, Closed-Loop)
- VLN-CE R2R (Instruction-Following)
- VLN-CE RxR (Instruction-Following)
- HM3D-OVON (Object-Goal)
- BridgeNav (POI-Goal)
- EVT-Bench (Person-Following)
We are committed to progressively open-sourcing resources to support the research community:
| Phase | Content | Status |
|---|---|---|
| Phase 1 | Technical Report | β Released |
| Phase 2 | Data | π Coming Soon |
| Phase 3 | Code | π Coming Soon |
β οΈ Note on Data Release: Due to privacy and security concerns associated with certain data, we will conduct thorough data cleaning and de-identification before releasing a compliant version for community research use. We prioritize data compliance over release speed β thank you for your patience and understanding.
If you find this work useful, please consider citing:
@misc{chu2026abotn0technicalreportvla,
title={ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation},
author={Zedong Chu and Shichao Xie and Xiaolong Wu and Yanfen Shen and Minghua Luo and Zhengbo Wang and Fei Liu and Xiaoxu Leng and Junjun Hu and Mingyang Yin and Jia Lu and Yingnan Guo and Kai Yang and Jiawei Han and Xu Chen and Yanqing Zhu and Yuxiang Zhao and Xin Liu and Yirong Yang and Ye He and Jiahang Wang and Yang Cai and Tianlin Zhang and Li Gao and Liu Liu and Mingchao Sun and Fan Jiang and Chiyu Wang and Zhicheng Liu and Hongyu Pan and Honglin Han and Zhining Gu and Kuan Yang and Jianfang Zhang and Di Jing and Zihao Guan and Wei Guo and Guoqing Liu and Di Yang and Xiangpo Yang and Menglin Yang and Hongguang Xing and Weiguo Li and Mu Xu},
year={2026},
eprint={2602.11598},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.11598},
}This project is released under the Apache 2.0 License.
This work is developed by AMAP CV Lab. See the Technical Report for a full list of contributors.