[None][fix] Cap infra-retry budget at 2 attempts total#14415
Conversation
A failing SLURM stage that consistently hits the partition walltime
(observed: GB300-4_GPUs-PyTorch-Post-Merge-1 in L0_Test-SBSA-Multi-GPU/1219)
burned ~21 hours over 5 attempts. The dispatcher pod's
runLLMTestlistOnSlurm exhausted its SLURM-inner retry budget after 3
attempts, then runKubernetesPodWithInfraRetry's outer K8s pod retry
relaunched the dispatcher pod and restarted a fresh inner retry. Worst
case under the prior SLURM_INFRA_RETRY_MAX=2 + K8S_INFRA_RETRY_MAX=2
configuration was 9 attempts at ~4h each — ~36h.
The dominant retryable failure on these stages is AgentOfflineException
("Unable to create live FilePath") raised when the SLURM-allocated agent
dies. For tests that ran for hours before the agent disappeared this is
effectively a test-timeout proxy, not a transient blip worth multi-retry.
Two changes:
1. SLURM_INFRA_RETRY_MAX 2 → 1; K8S_INFRA_RETRY_MAX 2 → 1. Each layer
gets one retry, two attempts max.
2. SLURM dispatcher closures (x86, SBSA single-node, SBSA multi-node)
pass singleAttempt:true through to runKubernetesPodWithInfraRetry so
the outer K8s pod retry doesn't nest on top of the inner SLURM retry.
Net effect:
SLURM stages: 1 dispatcher pod × 2 SLURM attempts = 2 total
Non-SLURM K8s: 2 pods × 1 attempt = 2 total
runKubernetesPodWithInfraRetry now accepts an opts Map (singleAttempt
bool, default false) as its leading argument. Default-value resolution
keeps existing positional callers (sanity-check sites, the parallelJobs
dispatch) source-compatible. The parallelJobs dispatch reads an optional
third element from each list entry and forwards it as opts.
Trade-off: a real K8s eviction of a SLURM dispatcher pod (rare) now
fails the stage immediately rather than relaunching. Justified by the
~10× worst-case-cost reduction for the common test-timeout case. The
follow-up work to make AgentOfflineException-style failures structurally
non-retryable (via SLURM job-status check or PATTERN_CATALOG severity
tuning) is tracked separately.
Signed-off-by: Derek Pitman <[email protected]>
📝 WalkthroughWalkthroughThis PR reduces infrastructure retry budgets from 2 to 1 and adds an ChangesRetry infrastructure budget and control flow
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run |
|
PR_Github #49722 [ run ] triggered by Bot. Commit: |
|
PR_Github #49722 [ run ] completed with state
|
|
/bot run |
|
PR_Github #49730 [ run ] triggered by Bot. Commit: |
|
PR_Github #49730 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
Signed-off-by: dpitman-nvda <[email protected]>
|
/bot run --disable-fail-fast |
|
PR_Github #49768 [ run ] triggered by Bot. Commit: |
|
PR_Github #49768 [ run ] completed with state
|
|
/bot run |
|
PR_Github #49945 [ run ] triggered by Bot. Commit: |
|
PR_Github #49945 [ run ] completed with state |
Signed-off-by: Derek Pitman <[email protected]> Signed-off-by: dpitman-nvda <[email protected]>
Signed-off-by: Derek Pitman <[email protected]> Signed-off-by: dpitman-nvda <[email protected]>
Summary by CodeRabbit
Description
A failing SLURM stage that consistently hits the partition walltime (observed: GB300-4_GPUs-PyTorch-Post-Merge-1 in L0_Test-SBSA-Multi-GPU/1219) burned ~21 hours over 5 attempts. The dispatcher pod's runLLMTestlistOnSlurm exhausted its SLURM-inner retry budget after 3 attempts, then runKubernetesPodWithInfraRetry's outer K8s pod retry relaunched the dispatcher pod and restarted a fresh inner retry. Worst case under the prior SLURM_INFRA_RETRY_MAX=2 + K8S_INFRA_RETRY_MAX=2 configuration was 9 attempts at ~4h each — ~36h.
The dominant retryable failure on these stages is AgentOfflineException ("Unable to create live FilePath") raised when the SLURM-allocated agent dies. For tests that ran for hours before the agent disappeared this is effectively a test-timeout proxy, not a transient blip worth multi-retry.
Two changes:
SLURM_INFRA_RETRY_MAX 2 → 1; K8S_INFRA_RETRY_MAX 2 → 1. Each layer gets one retry, two attempts max.
SLURM dispatcher closures (x86, SBSA single-node, SBSA multi-node) pass singleAttempt:true through to runKubernetesPodWithInfraRetry so the outer K8s pod retry doesn't nest on top of the inner SLURM retry. Net effect:
SLURM stages: 1 dispatcher pod × 2 SLURM attempts = 2 total
Non-SLURM K8s: 2 pods × 1 attempt = 2 total
runKubernetesPodWithInfraRetry now accepts an opts Map (singleAttempt bool, default false) as its leading argument. Default-value resolution keeps existing positional callers (sanity-check sites, the parallelJobs dispatch) source-compatible. The parallelJobs dispatch reads an optional third element from each list entry and forwards it as opts.
Trade-off: a real K8s eviction of a SLURM dispatcher pod (rare) now fails the stage immediately rather than relaunching. Justified by the ~10× worst-case-cost reduction for the common test-timeout case. The follow-up work to make AgentOfflineException-style failures structurally non-retryable (via SLURM job-status check or PATTERN_CATALOG severity tuning) is tracked separately (TRTLLMINF-103).
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.