Skip to content

Gym: Empty epochs if Gym agent fails #2305

@hXl3s

Description

@hXl3s

Describe the bug
When running Gym example examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml I encountered strage bug. Config is outdated, using 4096 * 16 samples (instead of 64 * 16) which results in CPU OOM during rollout collection and killing Gym agent silently. This does not crash the training, but only results in "empty epochs" in logs.

Steps/Code to reproduce bug

Run examples/nemo_gym/grpo_qwen3_30ba3b_instruct.yaml with examples/nemo_gym/run_grpo_nemo_gym.py script. Tested on 8 DGXH100 nodes

Expected behavior
More reasonable error, training crashes instead of producing empty runs

Additional context
I believe config should also be updated and number of rollouts decreased

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions