Haoyang He1,2 Jay Patrikar2 Dong-Ki Kim2 Max Smith2 Daniel McGann2
Ali-akbar Agha-mohammadi2 Shayegan Omidshafiei2 Sebastian Scherer1,2
1 Carnegie Mellon University 2 FieldAI
Recent advances in video world modeling have enabled large-scale generative models to simulate embodied environments with high visual fidelity, providing strong priors for prediction, planning, and control. Yet, despite their realism, these models often lack geometric grounding, limiting their use in navigation tasks that require spatial coherence and long-horizon stability.
GrndCtrl introduces Reinforcement Learning with World Grounding (RLWG), a self-supervised post-training framework that aligns pretrained world models with a physically verifiable structure through geometric and perceptual rewards. Analogous to reinforcement learning from verifiable feedback (RLVR) in language models, RLWG uses multiple rewards that measure pose cycle-consistency, depth reprojection, and temporal coherence. GrndCtrl instantiates this framework using Group Relative Policy Optimization (GRPO), yielding world models that maintain stable trajectories, consistent geometry, and reliable rollouts for embodied navigation.
Stay tuned...
If you find our work useful, please cite our paper:
@misc{he2025grndctrlgroundingworldmodels,
title={GrndCtrl: Grounding World Models via Self-Supervised Reward Alignment},
author={Haoyang He and Jay Patrikar and Dong-Ki Kim and Max Smith and Daniel McGann
and Ali-akbar Agha-mohammadi and Shayegan Omidshafiei and Sebastian Scherer},
year={2025},
eprint={2512.01952},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.01952},
}