Skip to content

[None][fix] Cap infra-retry budget at 2 attempts total#14415

Merged
dpitman-nvda merged 2 commits into
NVIDIA:mainfrom
dpitman-nvda:fix/cap-infra-retry-budget
May 22, 2026
Merged

[None][fix] Cap infra-retry budget at 2 attempts total#14415
dpitman-nvda merged 2 commits into
NVIDIA:mainfrom
dpitman-nvda:fix/cap-infra-retry-budget

Conversation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator

@dpitman-nvda dpitman-nvda commented May 21, 2026

Summary by CodeRabbit

  • Chores
    • Optimized continuous integration infrastructure retry budgets and parameters to improve testing feedback cycles and reduce unnecessary resource waste in CI pipelines.
    • Enhanced Kubernetes pod execution controls with improved retry-loop management and configurable options, providing more granular control over retry behavior in distributed testing environments while preventing unnecessary cascading retry operations.

Review Change Stack

Description

A failing SLURM stage that consistently hits the partition walltime (observed: GB300-4_GPUs-PyTorch-Post-Merge-1 in L0_Test-SBSA-Multi-GPU/1219) burned ~21 hours over 5 attempts. The dispatcher pod's runLLMTestlistOnSlurm exhausted its SLURM-inner retry budget after 3 attempts, then runKubernetesPodWithInfraRetry's outer K8s pod retry relaunched the dispatcher pod and restarted a fresh inner retry. Worst case under the prior SLURM_INFRA_RETRY_MAX=2 + K8S_INFRA_RETRY_MAX=2 configuration was 9 attempts at ~4h each — ~36h.

The dominant retryable failure on these stages is AgentOfflineException ("Unable to create live FilePath") raised when the SLURM-allocated agent dies. For tests that ran for hours before the agent disappeared this is effectively a test-timeout proxy, not a transient blip worth multi-retry.

Two changes:

  1. SLURM_INFRA_RETRY_MAX 2 → 1; K8S_INFRA_RETRY_MAX 2 → 1. Each layer gets one retry, two attempts max.

  2. SLURM dispatcher closures (x86, SBSA single-node, SBSA multi-node) pass singleAttempt:true through to runKubernetesPodWithInfraRetry so the outer K8s pod retry doesn't nest on top of the inner SLURM retry. Net effect:
    SLURM stages: 1 dispatcher pod × 2 SLURM attempts = 2 total
    Non-SLURM K8s: 2 pods × 1 attempt = 2 total

runKubernetesPodWithInfraRetry now accepts an opts Map (singleAttempt bool, default false) as its leading argument. Default-value resolution keeps existing positional callers (sanity-check sites, the parallelJobs dispatch) source-compatible. The parallelJobs dispatch reads an optional third element from each list entry and forwards it as opts.

Trade-off: a real K8s eviction of a SLURM dispatcher pod (rare) now fails the stage immediately rather than relaunching. Justified by the ~10× worst-case-cost reduction for the common test-timeout case. The follow-up work to make AgentOfflineException-style failures structurally non-retryable (via SLURM job-status check or PATTERN_CATALOG severity tuning) is tracked separately (TRTLLMINF-103).

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

A failing SLURM stage that consistently hits the partition walltime
(observed: GB300-4_GPUs-PyTorch-Post-Merge-1 in L0_Test-SBSA-Multi-GPU/1219)
burned ~21 hours over 5 attempts. The dispatcher pod's
runLLMTestlistOnSlurm exhausted its SLURM-inner retry budget after 3
attempts, then runKubernetesPodWithInfraRetry's outer K8s pod retry
relaunched the dispatcher pod and restarted a fresh inner retry. Worst
case under the prior SLURM_INFRA_RETRY_MAX=2 + K8S_INFRA_RETRY_MAX=2
configuration was 9 attempts at ~4h each — ~36h.

The dominant retryable failure on these stages is AgentOfflineException
("Unable to create live FilePath") raised when the SLURM-allocated agent
dies. For tests that ran for hours before the agent disappeared this is
effectively a test-timeout proxy, not a transient blip worth multi-retry.

Two changes:

1. SLURM_INFRA_RETRY_MAX 2 → 1; K8S_INFRA_RETRY_MAX 2 → 1. Each layer
   gets one retry, two attempts max.

2. SLURM dispatcher closures (x86, SBSA single-node, SBSA multi-node)
   pass singleAttempt:true through to runKubernetesPodWithInfraRetry so
   the outer K8s pod retry doesn't nest on top of the inner SLURM retry.
   Net effect:
     SLURM stages:     1 dispatcher pod × 2 SLURM attempts = 2 total
     Non-SLURM K8s:    2 pods × 1 attempt           = 2 total

runKubernetesPodWithInfraRetry now accepts an opts Map (singleAttempt
bool, default false) as its leading argument. Default-value resolution
keeps existing positional callers (sanity-check sites, the parallelJobs
dispatch) source-compatible. The parallelJobs dispatch reads an optional
third element from each list entry and forwards it as opts.

Trade-off: a real K8s eviction of a SLURM dispatcher pod (rare) now
fails the stage immediately rather than relaunching. Justified by the
~10× worst-case-cost reduction for the common test-timeout case. The
follow-up work to make AgentOfflineException-style failures structurally
non-retryable (via SLURM job-status check or PATTERN_CATALOG severity
tuning) is tracked separately.

Signed-off-by: Derek Pitman <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

This PR reduces infrastructure retry budgets from 2 to 1 and adds an opts parameter to runKubernetesPodWithInfraRetry to support a singleAttempt flag that bypasses the outer K8s infra-retry loop. SLURM dispatcher stages (x86 and arm/SBSA) are updated to use this flag, preventing nested retry cascades between outer K8s and inner SLURM retry loops. The option is wired through test job launching to enable per-stage configuration.

Changes

Retry infrastructure budget and control flow

Layer / File(s) Summary
Retry budget constants
jenkins/L0_Test.groovy
SLURM_INFRA_RETRY_MAX and K8S_INFRA_RETRY_MAX reduced from 2 to 1 with updated documentation explaining retry layer interaction.
runKubernetesPodWithInfraRetry opts parameter
jenkins/L0_Test.groovy
Function signature updated to accept opts map (defaulting to {}), supporting singleAttempt flag to bypass outer K8s infra-retry loop and run with attemptTag="" and isFinalAttempt=true on first call; DEBUG_MODE continues to force single-attempt.
Test job launching integration
jenkins/L0_Test.groovy
launchTestJobs extracts optional third element (values[2]) as opts map and passes it into runKubernetesPodWithInfraRetry, enabling per-stage configuration like singleAttempt.
x86 SLURM dispatcher singleAttempt wiring
jenkins/L0_Test.groovy
x86 SLURM dispatcher pod stages attach [singleAttempt: true] to stage configuration, preventing nested K8s retry budget consumption on top of SLURM's inner retry loop.
arm/SBSA SLURM dispatcher singleAttempt wiring
jenkins/L0_Test.groovy
arm/SBSA SLURM dispatcher pod stages attach [singleAttempt: true], ensuring only the inner SLURM retry loop governs retries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • mzweilz
  • zeroepoch
  • tburt-nv
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description comprehensively explains the problem, solution, and trade-offs; the Test Coverage section is empty, which is a required section per the template. Complete the Test Coverage section by specifying which test cases safeguard these retry-budget changes, such as existing L0_Test stages or new test scenarios.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: capping infrastructure retry budget at 2 attempts total by reducing SLURM_INFRA_RETRY_MAX and K8S_INFRA_RETRY_MAX from 2 to 1.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49722 [ run ] triggered by Bot. Commit: 9e203d0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49722 [ run ] completed with state SUCCESS. Commit: 9e203d0
/LLM/main/L0_MergeRequest_PR pipeline #39328 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49730 [ run ] triggered by Bot. Commit: 9e203d0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49730 [ run ] completed with state SUCCESS. Commit: 9e203d0
/LLM/main/L0_MergeRequest_PR pipeline #39335 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49768 [ run ] triggered by Bot. Commit: 8839d44 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49768 [ run ] completed with state SUCCESS. Commit: 8839d44
/LLM/main/L0_MergeRequest_PR pipeline #39366 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49945 [ run ] triggered by Bot. Commit: 8839d44 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49945 [ run ] completed with state SUCCESS. Commit: 8839d44
/LLM/main/L0_MergeRequest_PR pipeline #39516 completed with status: 'SUCCESS'

CI Report

Link to invocation

@dpitman-nvda dpitman-nvda merged commit 25f7ca8 into NVIDIA:main May 22, 2026
7 checks passed
KleinBlueC pushed a commit to KleinBlueC/TensorRT-LLM that referenced this pull request May 26, 2026
bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants