Describe the bug
When running Gym example examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml I encountered strage bug. Config is outdated, using 4096 * 16 samples (instead of 64 * 16) which results in CPU OOM during rollout collection and killing Gym agent silently. This does not crash the training, but only results in "empty epochs" in logs.
Steps/Code to reproduce bug
Run examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml with examples/nemo_gym/run_grpo_nemo_gym.py script. Tested on 8 DGXH100 nodes
Expected behavior
More reasonable error, training crashes instead of producing empty runs
Additional context
I believe config should also be updated and number of rollouts decreased
Describe the bug
When running Gym example
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yamlI encountered strage bug. Config is outdated, using 4096 * 16 samples (instead of 64 * 16) which results in CPU OOM during rollout collection and killing Gym agent silently. This does not crash the training, but only results in "empty epochs" in logs.Steps/Code to reproduce bug
Run
examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yamlwithexamples/nemo_gym/run_grpo_nemo_gym.pyscript. Tested on 8 DGXH100 nodesExpected behavior
More reasonable error, training crashes instead of producing empty runs
Additional context
I believe config should also be updated and number of rollouts decreased