Learning Geometrically-Grounded
3D Visual Representations for View-Generalizable Robotic Manipulation
Di Zhang*, Weicheng Duan*, Dasen Gu, Hongye Lu, Hai Zhang Hang Yu, Junqiao Zhao†, Guang Chen
Real-world robotic manipulation demands visuomotor policies capable of robust spatial scene understanding and strong generalization across diverse camera viewpoints. While recent advances in 3D-aware visual representations have shown promise, they still suffer from several key limitations: (i) reliance on multi-view observations during inference, which is impractical in single-view restricted scenarios; (ii) incomplete scene modeling that fails to capture holistic and fine-grained geometric structures essential for precise manipulation; and (iii) lack of effective policy training strategies to retain and exploit the acquired 3D knowledge. To address these challenges, we present GEM3D (Geometrically-Grounded 3D Manipulation), a unified representation-policy learning framework for view-generalizable robotic manipulation. GEM3D introduces a single-view 3D pretraining paradigm that leverages point cloud reconstruction and feed-forward gaussian splatting under multi-view supervision to learn holistic geometric representations. During policy learning, GEM3D performs multi-step distillation to preserve the pretrained geometric understanding and effectively transfer it to manipulation skills. We conduct experiments on 12 RLBench tasks, where our approach outperforms the previous state-of-the-art (SOTA) method by 12.7% in average success rate. Further evaluation on six representative tasks demonstrates the strong zero-shot view generalization of our approach, with the success rate drops by only 22.0% and 29.7% under moderate and large viewpoint shifts, respectively, whereas the SOTA method suffers larger decreases of 41.6% and 51.5%.
- [2026-02-02]: 📄 Initial version of the code is now available! Please stay tuned for the further updates!
This repository is released under the MIT license.
If you find our work helpful for your research, please consider citing:
@misc{zhang2026learninggeometricallygrounded3dvisual,
title={Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation},
author={Di Zhang and Weicheng Duan and Dasen Gu and Hongye Lu and Hai Zhang and Hang Yu and Junqiao Zhao and Guang Chen},
year={2026},
eprint={2601.22988},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2601.22988},
}
This work builds upon several excellent open-source projects:
- PerAct - Agent Training and Evaluation Framework
- ManiGaussian - Feed-forward Gaussian Splatting Reference
- SnowFlakenet - Point cloud reconstruction module
- DUNE - Vision transformer backbone
- The broader open-source computer vision and robotics communities
For questions, discussions, or collaborations:
- Issues: Open an issue on GitHub
- Email: Contact Author Di Zhang ([email protected]) or Weicheng Duan ([email protected]) for any question about this project!
