[None][fix] Cap infra-retry budget at 2 attempts total by dpitman-nvda · Pull Request #14415 · NVIDIA/TensorRT-LLM

dpitman-nvda · 2026-05-21T13:58:08Z

Summary by CodeRabbit

Chores
- Optimized continuous integration infrastructure retry budgets and parameters to improve testing feedback cycles and reduce unnecessary resource waste in CI pipelines.
- Enhanced Kubernetes pod execution controls with improved retry-loop management and configurable options, providing more granular control over retry behavior in distributed testing environments while preventing unnecessary cascading retry operations.

Description

A failing SLURM stage that consistently hits the partition walltime (observed: GB300-4_GPUs-PyTorch-Post-Merge-1 in L0_Test-SBSA-Multi-GPU/1219) burned ~21 hours over 5 attempts. The dispatcher pod's runLLMTestlistOnSlurm exhausted its SLURM-inner retry budget after 3 attempts, then runKubernetesPodWithInfraRetry's outer K8s pod retry relaunched the dispatcher pod and restarted a fresh inner retry. Worst case under the prior SLURM_INFRA_RETRY_MAX=2 + K8S_INFRA_RETRY_MAX=2 configuration was 9 attempts at ~4h each — ~36h.

The dominant retryable failure on these stages is AgentOfflineException ("Unable to create live FilePath") raised when the SLURM-allocated agent dies. For tests that ran for hours before the agent disappeared this is effectively a test-timeout proxy, not a transient blip worth multi-retry.

Two changes:

SLURM_INFRA_RETRY_MAX 2 → 1; K8S_INFRA_RETRY_MAX 2 → 1. Each layer gets one retry, two attempts max.
SLURM dispatcher closures (x86, SBSA single-node, SBSA multi-node) pass singleAttempt:true through to runKubernetesPodWithInfraRetry so the outer K8s pod retry doesn't nest on top of the inner SLURM retry. Net effect:
SLURM stages: 1 dispatcher pod × 2 SLURM attempts = 2 total
Non-SLURM K8s: 2 pods × 1 attempt = 2 total

runKubernetesPodWithInfraRetry now accepts an opts Map (singleAttempt bool, default false) as its leading argument. Default-value resolution keeps existing positional callers (sanity-check sites, the parallelJobs dispatch) source-compatible. The parallelJobs dispatch reads an optional third element from each list entry and forwards it as opts.

Trade-off: a real K8s eviction of a SLURM dispatcher pod (rare) now fails the stage immediately rather than relaunching. Justified by the ~10× worst-case-cost reduction for the common test-timeout case. The follow-up work to make AgentOfflineException-style failures structurally non-retryable (via SLURM job-status check or PATTERN_CATALOG severity tuning) is tracked separately (TRTLLMINF-103).

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

A failing SLURM stage that consistently hits the partition walltime (observed: GB300-4_GPUs-PyTorch-Post-Merge-1 in L0_Test-SBSA-Multi-GPU/1219) burned ~21 hours over 5 attempts. The dispatcher pod's runLLMTestlistOnSlurm exhausted its SLURM-inner retry budget after 3 attempts, then runKubernetesPodWithInfraRetry's outer K8s pod retry relaunched the dispatcher pod and restarted a fresh inner retry. Worst case under the prior SLURM_INFRA_RETRY_MAX=2 + K8S_INFRA_RETRY_MAX=2 configuration was 9 attempts at ~4h each — ~36h. The dominant retryable failure on these stages is AgentOfflineException ("Unable to create live FilePath") raised when the SLURM-allocated agent dies. For tests that ran for hours before the agent disappeared this is effectively a test-timeout proxy, not a transient blip worth multi-retry. Two changes: 1. SLURM_INFRA_RETRY_MAX 2 → 1; K8S_INFRA_RETRY_MAX 2 → 1. Each layer gets one retry, two attempts max. 2. SLURM dispatcher closures (x86, SBSA single-node, SBSA multi-node) pass singleAttempt:true through to runKubernetesPodWithInfraRetry so the outer K8s pod retry doesn't nest on top of the inner SLURM retry. Net effect: SLURM stages: 1 dispatcher pod × 2 SLURM attempts = 2 total Non-SLURM K8s: 2 pods × 1 attempt = 2 total runKubernetesPodWithInfraRetry now accepts an opts Map (singleAttempt bool, default false) as its leading argument. Default-value resolution keeps existing positional callers (sanity-check sites, the parallelJobs dispatch) source-compatible. The parallelJobs dispatch reads an optional third element from each list entry and forwards it as opts. Trade-off: a real K8s eviction of a SLURM dispatcher pod (rare) now fails the stage immediately rather than relaunching. Justified by the ~10× worst-case-cost reduction for the common test-timeout case. The follow-up work to make AgentOfflineException-style failures structurally non-retryable (via SLURM job-status check or PATTERN_CATALOG severity tuning) is tracked separately. Signed-off-by: Derek Pitman <[email protected]>

coderabbitai · 2026-05-21T14:00:26Z

📝 Walkthrough

Walkthrough

This PR reduces infrastructure retry budgets from 2 to 1 and adds an opts parameter to runKubernetesPodWithInfraRetry to support a singleAttempt flag that bypasses the outer K8s infra-retry loop. SLURM dispatcher stages (x86 and arm/SBSA) are updated to use this flag, preventing nested retry cascades between outer K8s and inner SLURM retry loops. The option is wired through test job launching to enable per-stage configuration.

Changes

Retry infrastructure budget and control flow

Layer / File(s)	Summary
Retry budget constants `jenkins/L0_Test.groovy`	SLURM_INFRA_RETRY_MAX and K8S_INFRA_RETRY_MAX reduced from 2 to 1 with updated documentation explaining retry layer interaction.
runKubernetesPodWithInfraRetry opts parameter `jenkins/L0_Test.groovy`	Function signature updated to accept `opts` map (defaulting to `{}`), supporting `singleAttempt` flag to bypass outer K8s infra-retry loop and run with `attemptTag=""` and `isFinalAttempt=true` on first call; `DEBUG_MODE` continues to force single-attempt.
Test job launching integration `jenkins/L0_Test.groovy`	launchTestJobs extracts optional third element (`values[2]`) as `opts` map and passes it into `runKubernetesPodWithInfraRetry`, enabling per-stage configuration like `singleAttempt`.
x86 SLURM dispatcher singleAttempt wiring `jenkins/L0_Test.groovy`	x86 SLURM dispatcher pod stages attach `[singleAttempt: true]` to stage configuration, preventing nested K8s retry budget consumption on top of SLURM's inner retry loop.
arm/SBSA SLURM dispatcher singleAttempt wiring `jenkins/L0_Test.groovy`	arm/SBSA SLURM dispatcher pod stages attach `[singleAttempt: true]`, ensuring only the inner SLURM retry loop governs retries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#13809: Both PRs adjust nested SLURM-within-Kubernetes retry behavior—[TRTLLMINF-54][feat] SlurmConfig boundary throws typed InfraFailure #13809 by throwing a typed SLURM InfraFailure to prevent outer K8s retries, and this PR by adding singleAttempt plumbing and lowering retry caps.
NVIDIA/TensorRT-LLM#14269: Changes to runKubernetesPodWithInfraRetry and SLURM dispatcher invocation to control the outer K8s retry layer and avoid nesting with inner SLURM retries at the same coordination points in jenkins/L0_Test.groovy.

Suggested reviewers

mzweilz
zeroepoch
tburt-nv

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description comprehensively explains the problem, solution, and trade-offs; the Test Coverage section is empty, which is a required section per the template.	Complete the Test Coverage section by specifying which test cases safeguard these retry-budget changes, such as existing L0_Test stages or new test scenarios.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: capping infrastructure retry budget at 2 attempts total by reducing SLURM_INFRA_RETRY_MAX and K8S_INFRA_RETRY_MAX from 2 to 1.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

dpitman-nvda · 2026-05-21T15:01:27Z

/bot run

tensorrt-cicd · 2026-05-21T15:08:23Z

PR_Github #49722 [ run ] triggered by Bot. Commit: 9e203d0 Link to invocation

tensorrt-cicd · 2026-05-21T15:58:16Z

PR_Github #49722 [ run ] completed with state SUCCESS. Commit: 9e203d0
/LLM/main/L0_MergeRequest_PR pipeline #39328 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda · 2026-05-21T16:12:06Z

/bot run

tensorrt-cicd · 2026-05-21T16:18:51Z

PR_Github #49730 [ run ] triggered by Bot. Commit: 9e203d0 Link to invocation

tensorrt-cicd · 2026-05-21T19:50:56Z

PR_Github #49730 [ run ] completed with state SUCCESS. Commit: 9e203d0
/LLM/main/L0_MergeRequest_PR pipeline #39335 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda · 2026-05-21T20:53:37Z

/bot run --disable-fail-fast

Signed-off-by: dpitman-nvda <[email protected]>

dpitman-nvda · 2026-05-21T21:05:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-21T21:11:28Z

PR_Github #49768 [ run ] triggered by Bot. Commit: 8839d44 Link to invocation

tensorrt-cicd · 2026-05-22T04:33:38Z

PR_Github #49768 [ run ] completed with state SUCCESS. Commit: 8839d44
/LLM/main/L0_MergeRequest_PR pipeline #39366 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda · 2026-05-22T13:47:31Z

/bot run

tensorrt-cicd · 2026-05-22T13:52:57Z

PR_Github #49945 [ run ] triggered by Bot. Commit: 8839d44 Link to invocation

tensorrt-cicd · 2026-05-22T17:08:15Z

PR_Github #49945 [ run ] completed with state SUCCESS. Commit: 8839d44
/LLM/main/L0_MergeRequest_PR pipeline #39516 completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Derek Pitman <[email protected]> Signed-off-by: dpitman-nvda <[email protected]>

dpitman-nvda requested review from a team as code owners May 21, 2026 13:58

dpitman-nvda requested review from mlefeb01 and zeroepoch May 21, 2026 13:58

github-actions Bot assigned dpitman-nvda May 21, 2026

zeroepoch approved these changes May 21, 2026

View reviewed changes

Merge branch 'main' into fix/cap-infra-retry-budget

8839d44

Signed-off-by: dpitman-nvda <[email protected]>

dpitman-nvda merged commit 25f7ca8 into NVIDIA:main May 22, 2026
7 checks passed

KleinBlueC pushed a commit to KleinBlueC/TensorRT-LLM that referenced this pull request May 26, 2026

[None][fix] Cap infra-retry budget at 2 attempts total (NVIDIA#14415)

156d2e6

Signed-off-by: Derek Pitman <[email protected]> Signed-off-by: dpitman-nvda <[email protected]>

bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026

[None][fix] Cap infra-retry budget at 2 attempts total (NVIDIA#14415)

d802f10

Signed-off-by: Derek Pitman <[email protected]> Signed-off-by: dpitman-nvda <[email protected]>

Conversation

dpitman-nvda commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

dpitman-nvda commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

dpitman-nvda commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

dpitman-nvda commented May 21, 2026

Uh oh!

dpitman-nvda commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

dpitman-nvda commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

tensorrt-cicd commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dpitman-nvda commented May 21, 2026 •

edited

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading