Gym: Empty epochs if Gym agent fails

**Describe the bug**
When running Gym example `examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml` I encountered strage bug. Config is outdated, using 4096 * 16 samples (instead of 64 * 16) which results in CPU OOM during rollout collection and killing Gym agent silently. This does not crash the training, but only results in "empty epochs" in logs.

**Steps/Code to reproduce bug**

Run `examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml` with ` examples/nemo_gym/run_grpo_nemo_gym.py` script. Tested on 8 DGXH100 nodes

**Expected behavior**
More reasonable error, training crashes instead of producing empty runs

**Additional context**
I believe config should also be updated and number of rollouts decreased


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gym: Empty epochs if Gym agent fails #2305

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gym: Empty epochs if Gym agent fails #2305

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions