Skip to content

[None][infra] Handle sacct error when checking slurm job status#14367

Merged
yuanjingx87 merged 2 commits into
NVIDIA:mainfrom
yuanjingx87:user/yuanjingx/handle_sacct_error
May 21, 2026
Merged

[None][infra] Handle sacct error when checking slurm job status#14367
yuanjingx87 merged 2 commits into
NVIDIA:mainfrom
yuanjingx87:user/yuanjingx/handle_sacct_error

Conversation

@yuanjingx87
Copy link
Copy Markdown
Collaborator

@yuanjingx87 yuanjingx87 commented May 20, 2026

Summary by CodeRabbit

  • Improvements
    • Enhanced test execution robustness by adding transient failure tolerance to job status tracking. The test runner now retries on temporary infrastructure failures instead of immediately failing, reducing false test failures and improving overall test reliability.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@yuanjingx87 yuanjingx87 requested review from a team as code owners May 20, 2026 17:13
@yuanjingx87 yuanjingx87 marked this pull request as draft May 20, 2026 17:13
@yuanjingx87
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49462 [ run ] triggered by Bot. Commit: a98a776 Link to invocation

@yuanjingx87 yuanjingx87 force-pushed the user/yuanjingx/handle_sacct_error branch 2 times, most recently from f017770 to 709b0ed Compare May 20, 2026 18:45
@yuanjingx87
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49471 [ run ] triggered by Bot. Commit: 709b0ed Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49462 [ run ] completed with state ABORTED. Commit: a98a776

Link to invocation

@yuanjingx87 yuanjingx87 force-pushed the user/yuanjingx/handle_sacct_error branch from 709b0ed to 73a7c8b Compare May 20, 2026 19:23
@yuanjingx87
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49471 [ run ] completed with state FAILURE. Commit: 709b0ed
/LLM/main/L0_MergeRequest_PR pipeline #39112 (Partly Tested) completed with status: 'ABORTED'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49474 [ run ] triggered by Bot. Commit: 73a7c8b Link to invocation

Signed-off-by: Yuanjing Xue <[email protected]>
@yuanjingx87 yuanjingx87 force-pushed the user/yuanjingx/handle_sacct_error branch from 73a7c8b to 3ec344c Compare May 20, 2026 20:52
@yuanjingx87
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49488 [ run ] triggered by Bot. Commit: 3ec344c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49474 [ run ] completed with state ABORTED. Commit: 73a7c8b

Link to invocation

Signed-off-by: Yuanjing Xue <[email protected]>
@yuanjingx87
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-Post-Merge-1"

@yuanjingx87
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-PyTorch-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49501 [ run ] triggered by Bot. Commit: a0188c5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49500 [ ] completed with state ABORTED. Commit: a0188c5

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49488 [ run ] completed with state ABORTED. Commit: 3ec344c

Link to invocation

@yuanjingx87 yuanjingx87 marked this pull request as ready for review May 20, 2026 23:20
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

The runLLMTestlistWithSbatch function's Slurm job status polling loop now handles transient sacct command failures by logging a warning and retrying after 60 seconds instead of failing or continuing with invalid state.

Changes

Slurm job status polling resilience

Layer / File(s) Summary
sacct failure handling in Slurm tracking loop
jenkins/L0_Test.groovy
When sacct exits non-zero during Slurm status polling, the script logs a warning message, sleeps 60 seconds, and continues the polling loop without altering downstream state logic.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description lacks substantive content in all key sections: no actual description of the issue or solution, no test coverage documentation, and the PR checklist is completed without verification of required items. Author should provide: (1) a clear explanation of the sacct error issue and why error-handling is needed, (2) specific test cases that verify the new error-handling behavior, and (3) verification that each checklist item has been genuinely addressed.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: handling sacct errors when checking Slurm job status, which directly relates to the code modification.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49501 [ run ] completed with state SUCCESS. Commit: a0188c5
/LLM/main/L0_MergeRequest_PR pipeline #39138 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@yuanjingx87
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "Single sbatch slurm run is sufficient"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49599 [ skip ] triggered by Bot. Commit: a0188c5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49599 [ skip ] completed with state SUCCESS. Commit: a0188c5
Skipping testing for commit a0188c5

Link to invocation

@yuanjingx87 yuanjingx87 merged commit c7d609b into NVIDIA:main May 21, 2026
12 of 14 checks passed
xxi-nv pushed a commit to xxi-nv/TensorRT-LLM that referenced this pull request May 22, 2026
bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants