[https://nvbugs/6076767][fix] Add barrier before warmup to prevent PP hang with guided decoding#13132
Conversation
📝 WalkthroughWalkthroughAdded a distributed barrier synchronization in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run |
|
PR_Github #43855 [ run ] triggered by Bot. Commit: |
|
PR_Github #43855 [ run ] completed with state
|
f250fef to
abaa191
Compare
|
/bot run |
abaa191 to
03e073c
Compare
|
/bot run |
|
PR_Github #44257 [ run ] triggered by Bot. Commit: |
|
PR_Github #44257 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44340 [ run ] triggered by Bot. Commit: |
|
PR_Github #44340 [ run ] completed with state
|
03e073c to
d94ffb1
Compare
|
/bot run |
|
PR_Github #45804 [ run ] triggered by Bot. Commit: |
|
PR_Github #45804 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45891 [ run ] triggered by Bot. Commit: |
|
PR_Github #45891 [ run ] completed with state
|
d94ffb1 to
ab66d5a
Compare
|
/bot run |
|
PR_Github #46002 [ run ] triggered by Bot. Commit: |
|
PR_Github #46002 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46028 [ run ] triggered by Bot. Commit: |
|
PR_Github #47511 [ run ] completed with state
|
a4853e4 to
3f15170
Compare
|
/bot run |
|
PR_Github #47571 [ run ] triggered by Bot. Commit: |
|
PR_Github #47571 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47595 [ run ] triggered by Bot. Commit: |
|
PR_Github #47595 [ run ] completed with state
|
3f15170 to
6680107
Compare
|
/bot run |
|
PR_Github #47625 [ run ] triggered by Bot. Commit: |
|
PR_Github #47625 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47660 [ run ] triggered by Bot. Commit: |
|
PR_Github #47660 [ run ] completed with state
|
6680107 to
664b783
Compare
|
/bot run |
|
PR_Github #47912 [ run ] triggered by Bot. Commit: |
|
PR_Github #47912 [ run ] completed with state
|
…th guided decoding When pipeline parallelism is used with guided decoding (llguidance backend), ranks on the last PP stage can be delayed by the LLGuidanceMatcherFactory / LLTokenizer initialization while earlier PP stages skip guided-decoder creation entirely. Without synchronization, the earlier stages enter warmup forward passes that issue pp_send operations expecting matching pp_recv on the later stages — but those stages have not entered warmup yet, causing a permanent NCCL communication deadlock. Add a dist.barrier() call in PyExecutor.__init__ immediately before the warmup phase so that all ranks are synchronized before any PP communication begins. Signed-off-by: Ziyi Xiong <[email protected]>
664b783 to
ab6eb2d
Compare
|
/bot run |
|
PR_Github #48045 [ run ] triggered by Bot. Commit: |
|
PR_Github #48045 [ run ] completed with state
|
|
/bot run |
|
PR_Github #48114 [ run ] triggered by Bot. Commit: |
|
PR_Github #48114 [ run ] completed with state |
Summary
Test plan
Links
Summary by CodeRabbit