Describe the bug
We have some jobs in our nightly tests (e.g. vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1) that get killed by the Idle Job Reaper because we always request 8 gpus even if we are using less in the run.
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
Describe the bug
We have some jobs in our nightly tests (e.g.
vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1) that get killed by the Idle Job Reaper because we always request 8 gpus even if we are using less in the run.RL/tools/launch
Line 165 in fa379ff
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.