Skip to content

[https://nvbugs/6076767][fix] Add barrier before warmup to prevent PP hang with guided decoding#13132

Merged
ziyixiong-nv merged 1 commit into
NVIDIA:mainfrom
ziyixiong-nv:repair-bot-bug6076767
May 14, 2026

Conversation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator

@ziyixiong-nv ziyixiong-nv commented Apr 16, 2026

Summary

  • Fix for NVBugs 6076767: [TensorRT-LLM][release/1.2.1]: accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_guided_decoding_4gpus[llguidance] stably hang
  • Root cause: Add barrier before warmup to prevent PP hang with guided decoding
  • Fix: (auto-detected from git commit)
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Chores
    • Added global synchronization during initialization for pipeline-parallel execution stages to ensure coordinated startup behavior across all ranks.

@ziyixiong-nv ziyixiong-nv requested a review from a team as a code owner April 16, 2026 19:55
@ziyixiong-nv ziyixiong-nv requested a review from lfr-0531 April 16, 2026 19:55
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

Added a distributed barrier synchronization in PyExecutor.__init__ before warmup initialization. All pipeline-parallel ranks now wait at this point during initialization to ensure synchronized warmup startup across all stages.

Changes

Cohort / File(s) Summary
Synchronization Coordination
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added self.dist.barrier() call immediately before entering warmup phase to enforce all-rank synchronization in pipeline-parallel execution.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description covers the Summary, Test Coverage sections well and includes links, but lacks specific details in some sections and omits verification of PR checklist items.
Title check ✅ Passed The title clearly and specifically identifies the main change: adding a barrier before warmup to prevent pipeline-parallel hangs with guided decoding, which directly addresses the root cause described in the PR objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43855 [ run ] triggered by Bot. Commit: f250fef Link to invocation

@ziyixiong-nv ziyixiong-nv changed the title [https://nvbugs/6076767][fix] [TensorRT-LLM][release/1.2.1]: accuracy/test_llm_a [https://nvbugs/6076767][fix] Add barrier before warmup to prevent PP hang with guided decoding Apr 17, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43855 [ run ] completed with state FAILURE. Commit: f250fef
/LLM/main/L0_MergeRequest_PR pipeline #34312 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug6076767 branch from f250fef to abaa191 Compare April 17, 2026 08:00
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug6076767 branch from abaa191 to 03e073c Compare April 20, 2026 02:23
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44257 [ run ] triggered by Bot. Commit: 03e073c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44257 [ run ] completed with state SUCCESS. Commit: 03e073c
/LLM/main/L0_MergeRequest_PR pipeline #34678 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44340 [ run ] triggered by Bot. Commit: 03e073c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44340 [ run ] completed with state FAILURE. Commit: 03e073c
/LLM/main/L0_MergeRequest_PR pipeline #34757 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug6076767 branch from 03e073c to d94ffb1 Compare April 28, 2026 00:20
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45804 [ run ] triggered by Bot. Commit: d94ffb1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45804 [ run ] completed with state SUCCESS. Commit: d94ffb1
/LLM/main/L0_MergeRequest_PR pipeline #35994 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45891 [ run ] triggered by Bot. Commit: d94ffb1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45891 [ run ] completed with state FAILURE. Commit: d94ffb1
/LLM/main/L0_MergeRequest_PR pipeline #36059 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug6076767 branch from d94ffb1 to ab66d5a Compare April 29, 2026 00:23
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46002 [ run ] triggered by Bot. Commit: ab66d5a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46002 [ run ] completed with state FAILURE. Commit: ab66d5a
/LLM/main/L0_MergeRequest_PR pipeline #36151 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46028 [ run ] triggered by Bot. Commit: ab66d5a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47511 [ run ] completed with state SUCCESS. Commit: a4853e4
/LLM/main/L0_MergeRequest_PR pipeline #37429 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug6076767 branch from a4853e4 to 3f15170 Compare May 10, 2026 07:11
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47571 [ run ] triggered by Bot. Commit: 3f15170 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47571 [ run ] completed with state SUCCESS. Commit: 3f15170
/LLM/main/L0_MergeRequest_PR pipeline #37482 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47595 [ run ] triggered by Bot. Commit: 3f15170 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47595 [ run ] completed with state SUCCESS. Commit: 3f15170
/LLM/main/L0_MergeRequest_PR pipeline #37503 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug6076767 branch from 3f15170 to 6680107 Compare May 11, 2026 00:01
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47625 [ run ] triggered by Bot. Commit: 6680107 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47625 [ run ] completed with state FAILURE. Commit: 6680107
/LLM/main/L0_MergeRequest_PR pipeline #37530 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47660 [ run ] triggered by Bot. Commit: 6680107 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47660 [ run ] completed with state SUCCESS. Commit: 6680107
/LLM/main/L0_MergeRequest_PR pipeline #37562 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug6076767 branch from 6680107 to 664b783 Compare May 12, 2026 06:33
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47912 [ run ] triggered by Bot. Commit: 664b783 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47912 [ run ] completed with state SUCCESS. Commit: 664b783
/LLM/main/L0_MergeRequest_PR pipeline #37759 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…th guided decoding

When pipeline parallelism is used with guided decoding (llguidance
backend), ranks on the last PP stage can be delayed by the
LLGuidanceMatcherFactory / LLTokenizer initialization while earlier
PP stages skip guided-decoder creation entirely. Without
synchronization, the earlier stages enter warmup forward passes that
issue pp_send operations expecting matching pp_recv on the later
stages — but those stages have not entered warmup yet, causing a
permanent NCCL communication deadlock.

Add a dist.barrier() call in PyExecutor.__init__ immediately before
the warmup phase so that all ranks are synchronized before any PP
communication begins.

Signed-off-by: Ziyi Xiong <[email protected]>
@ziyixiong-nv ziyixiong-nv force-pushed the repair-bot-bug6076767 branch from 664b783 to ab6eb2d Compare May 12, 2026 23:28
@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48045 [ run ] triggered by Bot. Commit: ab6eb2d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48045 [ run ] completed with state SUCCESS. Commit: ab6eb2d
/LLM/main/L0_MergeRequest_PR pipeline #37879 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@ziyixiong-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48114 [ run ] triggered by Bot. Commit: ab6eb2d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48114 [ run ] completed with state SUCCESS. Commit: ab6eb2d
/LLM/main/L0_MergeRequest_PR pipeline #37941 completed with status: 'SUCCESS'

CI Report

Link to invocation

@ziyixiong-nv ziyixiong-nv enabled auto-merge (squash) May 13, 2026 11:38
@ziyixiong-nv ziyixiong-nv merged commit 54c5c4b into NVIDIA:main May 14, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants