Learning Geometrically-Grounded
3D Visual Representations for View-Generalizable Robotic Manipulation

Di Zhang*, Weicheng Duan*, Dasen Gu, Hongye Lu, Hai Zhang Hang Yu, Junqiao Zhao†, Guang Chen

Real-world robotic manipulation demands visuomotor policies capable of robust spatial scene understanding and strong generalization across diverse camera viewpoints. While recent advances in 3D-aware visual representations have shown promise, they still suffer from several key limitations: (i) reliance on multi-view observations during inference, which is impractical in single-view restricted scenarios; (ii) incomplete scene modeling that fails to capture holistic and fine-grained geometric structures essential for precise manipulation; and (iii) lack of effective policy training strategies to retain and exploit the acquired 3D knowledge. To address these challenges, we present GEM3D (Geometrically-Grounded 3D Manipulation), a unified representation-policy learning framework for view-generalizable robotic manipulation. GEM3D introduces a single-view 3D pretraining paradigm that leverages point cloud reconstruction and feed-forward gaussian splatting under multi-view supervision to learn holistic geometric representations. During policy learning, GEM3D performs multi-step distillation to preserve the pretrained geometric understanding and effectively transfer it to manipulation skills. We conduct experiments on 12 RLBench tasks, where our approach outperforms the previous state-of-the-art (SOTA) method by 12.7% in average success rate. Further evaluation on six representative tasks demonstrates the strong zero-shot view generalization of our approach, with the success rate drops by only 22.0% and 29.7% under moderate and large viewpoint shifts, respectively, whereas the SOTA method suffers larger decreases of 41.6% and 51.5%.

🎉 NEWS:

[2026-02-02]: 📄 Initial version of the code is now available! Please stay tuned for the further updates!

🏷️ License

This repository is released under the MIT license.

🔗 Citation

If you find our work helpful for your research, please consider citing:

@misc{zhang2026learninggeometricallygrounded3dvisual,
      title={Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation}, 
      author={Di Zhang and Weicheng Duan and Dasen Gu and Hongye Lu and Hai Zhang and Hang Yu and Junqiao Zhao and Guang Chen},
      year={2026},
      eprint={2601.22988},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.22988}, 
}

🧩 Acknowledgments

This work builds upon several excellent open-source projects:

PerAct - Agent Training and Evaluation Framework
ManiGaussian - Feed-forward Gaussian Splatting Reference
SnowFlakenet - Point cloud reconstruction module
DUNE - Vision transformer backbone
The broader open-source computer vision and robotics communities

For questions, discussions, or collaborations:

Issues: Open an issue on GitHub
Email: Contact Author Di Zhang ([email protected]) or Weicheng Duan ([email protected]) for any question about this project!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
conf		conf
figures		figures
helpers		helpers
perception		perception
train/bc		train/bc
voxel		voxel
README.md		README.md
eval.py		eval.py
train_bc.py		train_bc.py
train_gaussian_rendering.py		train_gaussian_rendering.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Geometrically-Grounded
3D Visual Representations for View-Generalizable Robotic Manipulation

🎉 NEWS:

🏷️ License

🔗 Citation

🧩 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation

🎉 NEWS:

🏷️ License

🔗 Citation

🧩 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Learning Geometrically-Grounded
3D Visual Representations for View-Generalizable Robotic Manipulation

Packages