Skip to content

Update tools/launch to take into account number of gpus to avoid reaper #1577

@yfw

Description

@yfw

Describe the bug

We have some jobs in our nightly tests (e.g. vlm_grpo-qwen2.5-vl-3b-instruct-clevr-1n2g-dtensor2tp1.v1) that get killed by the Idle Job Reaper because we always request 8 gpus even if we are using less in the run.

RL/tools/launch

Line 165 in fa379ff

--gres=gpu:8 \\

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingqa_rcca_donewhen RCCA finished for the issue, the qa will mark with this label .

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions