[https://nvbugs/6000658][fix] Fix disagg gen-only hang where 10s sleep in can_forward blocks KV transfers and overflows CTX memory by peihu-nv · Pull Request #12640 · NVIDIA/TensorRT-LLM

peihu-nv · 2026-03-31T21:50:18Z

Summary by CodeRabbit

Chores
- Optimized executor polling performance for benchmark operations by reducing sleep intervals during queue readiness assessment, enabling faster detection of scheduling readiness in benchmark "generation-only" mode. This reduces initialization latency while maintaining existing control flow. Logging messages were updated to reflect the new behavior.

Description

In disaggregated gen-only benchmarking mode with ADP enabled, the can_forward gate polls with time.sleep(10) while waiting for enough generation requests to accumulate. During this 10s sleep, GEN does not call _prepare_and_schedule_batch(), so it never posts receiveAsync() for incoming UCX KV transfers from CTX servers.

Since UCX uses a rendezvous protocol, CTX's sendAsync() futures block until GEN posts a matching receive. With CTX running TP=1 (full model on one GPU), the KV cache is small (~32 GiB on GB200 after loading the 132 GiB model). The stalled transfers fill up CTX's KV cache within ~5 seconds, preventing CTX from prefilling new requests. GEN then never reaches the can_forward threshold, causing a deadlock.

This issue does not affect GB300 because its larger GPU memory (~288 GiB vs 184 GiB) allows a much larger CTX KV cache, which can absorb the full 10s backpressure without overflowing. For most benchmark YAML configs in this repo, the CTX KV cache is large enough (e.g., using higher CTX TP, smaller models, or shorter ISL) that the 10s sleep does not trigger the overflow. The issue specifically manifests on GB200 with large models (e.g., Qwen3-235B), long input sequences, and CTX TP=1.

Fix: reduce the ADP waiting sleep from 10s to 0.1s. This allows GEN to post receives frequently enough that CTX's KV cache never overflows. The 0.1s interval provides 49x margin over the theoretical CTX KV overflow threshold (~4.9s for Qwen3-235B with ISL=1024 on GB200).

Test Coverage

Validated on Lyris GB200 with the failing config - Qwen3-235B-FP4, 4 CTX servers, concurrency=4096:

sleep(10): deadlock (73 CTX timeouts, 56 "not enough kvCache" warnings)
sleep(0.1): success (0 warnings, E2EL=57076ms)

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…p in can_forward gate blocks KV transfers and overflows CTX KV cache In disaggregated gen-only benchmarking mode with ADP enabled, the can_forward gate polls with time.sleep(10) while waiting for enough generation requests to accumulate. During this 10s sleep, GEN does not call _prepare_and_schedule_batch(), so it never posts receiveAsync() for incoming UCX KV transfers from CTX servers. Since UCX uses a rendezvous protocol, CTX's sendAsync() futures block until GEN posts a matching receive. With CTX running TP=1 (full model on one GPU), the KV cache is small (~32 GiB on GB200 after loading the 132 GiB model). The stalled transfers fill up CTX's KV cache within ~5 seconds, preventing CTX from prefilling new requests. GEN then never reaches the can_forward threshold, causing a deadlock. This issue does not affect GB300 because its larger GPU memory (~288 GiB vs 184 GiB) allows a much larger CTX KV cache, which can absorb the full 10s backpressure without overflowing. For most benchmark YAML configs in this repo, the CTX KV cache is large enough (e.g., using higher CTX TP, smaller models, or shorter ISL) that the 10s sleep does not trigger the overflow. The issue specifically manifests on GB200 with large models (e.g., Qwen3-235B), long input sequences, and CTX TP=1. Fix: reduce the ADP waiting sleep from 10s to 0.1s. This allows GEN to post receives frequently enough that CTX's KV cache never overflows. The 0.1s interval provides 49x margin over the theoretical CTX KV overflow threshold (~4.9s for Qwen3-235B with ISL=1024 on GB200). Validated on Lyris GB200 with Qwen3-235B-FP4, 4 CTX servers, concurrency=4096: - sleep(10): deadlock (73 CTX timeouts, 56 "not enough kvCache" warnings) - sleep(0.1): success (0 warnings, E2EL=57076ms) The other two time.sleep(10) calls in this function are intentional: - Line 2061: stabilization wait after can_forward becomes True - Line 2076: non-ADP path (no UCX rendezvous issue without ADP) Signed-off-by: peihengh <[email protected]>

coderabbitai · 2026-03-31T21:53:39Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 861eeedc-f576-4208-b741-cde781f189ba

📥 Commits

Reviewing files that changed from the base of the PR and between 6ac5c15 and ab09910.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py

📝 Walkthrough

Walkthrough

Modified the polling wait interval in PyExecutor._executor_loop_overlap benchmark behavior, reducing the sleep duration from 10 seconds to 0.1 seconds when total generation count is below the benchmark request queue size, with corresponding log message updates.

Changes

Cohort / File(s)	Summary
Benchmark polling timing adjustment `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Reduced sleep interval from 10 to 0.1 seconds in the "gen-only" waiting logic and updated the associated log message to reflect the new polling cadence.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main fix: reducing a 10s sleep that was causing a deadlock in disaggregated gen-only mode by blocking KV transfers and overflowing CTX memory.
Description check	✅ Passed	The PR description comprehensively covers the issue, root cause, why it affects specific hardware/configs, the fix rationale with margin calculations, and detailed test validation results demonstrating the fix resolves the deadlock.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

peihu-nv · 2026-03-31T22:01:11Z

/bot run

tensorrt-cicd · 2026-03-31T22:07:07Z

PR_Github #41028 [ run ] triggered by Bot. Commit: ab09910 Link to invocation

tensorrt-cicd · 2026-04-01T01:51:50Z

PR_Github #41028 [ run ] completed with state FAILURE. Commit: ab09910
/LLM/main/L0_MergeRequest_PR pipeline #32008 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…to prevent KV transfer backpressure The benchmark disagg fill loop inside _prepare_and_schedule_batch() retries with time.sleep(1) when the request queue is not yet full. During this sleep, GEN does not process KV transfers or post receiveAsync(), creating the same backpressure risk as the can_forward gate's sleep(10) fixed in the previous commit. For tight configs (e.g., ISL=8192 with CTX TP=1), the CTX KV overflow threshold can be as low as ~0.75s, making sleep(1) borderline unsafe. Reduce to sleep(0.1) for consistency with the can_forward gate fix. Signed-off-by: peihengh <[email protected]>

peihu-nv · 2026-04-02T00:06:52Z

/bot run

tensorrt-cicd · 2026-04-02T00:12:35Z

PR_Github #41285 [ run ] triggered by Bot. Commit: feac977 Link to invocation

tensorrt-cicd · 2026-04-02T00:27:29Z

PR_Github #41285 [ run ] completed with state FAILURE. Commit: feac977
/LLM/main/L0_MergeRequest_PR pipeline #32243 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-04-02T03:25:17Z

/bot run

tensorrt-cicd · 2026-04-02T03:30:54Z

PR_Github #41319 [ run ] triggered by Bot. Commit: feac977 Link to invocation

tensorrt-cicd · 2026-04-02T07:10:48Z

PR_Github #41319 [ run ] completed with state SUCCESS. Commit: feac977
/LLM/main/L0_MergeRequest_PR pipeline #32270 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-04-02T16:15:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T16:21:22Z

PR_Github #41462 [ run ] triggered by Bot. Commit: feac977 Link to invocation

tensorrt-cicd · 2026-04-02T23:12:20Z

PR_Github #41462 [ run ] completed with state SUCCESS. Commit: feac977
/LLM/main/L0_MergeRequest_PR pipeline #32391 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-04-06T16:13:13Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-06T16:19:52Z

PR_Github #41947 [ run ] triggered by Bot. Commit: feac977 Link to invocation

tensorrt-cicd · 2026-04-06T17:44:32Z

PR_Github #41947 [ run ] completed with state SUCCESS. Commit: feac977
/LLM/main/L0_MergeRequest_PR pipeline #32803 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-04-06T18:27:11Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-06T18:34:22Z

PR_Github #41970 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

tensorrt-cicd · 2026-04-07T03:07:35Z

PR_Github #41970 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #32823 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-04-07T06:11:52Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-07T06:17:39Z

PR_Github #42076 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

tensorrt-cicd · 2026-04-07T12:15:10Z

PR_Github #42076 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #32916 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-04-07T15:39:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-07T15:44:55Z

PR_Github #42160 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

pcastonguay · 2026-04-07T16:58:19Z

As @qiaoxj07 mentionned in the nvbug, I am not convinced this is a proper fix. What if we have a ctx server that fills its KV cache in less than 0.1sec in the future (instead of the 5s it takes for this model)? Won't we run into the same issue?

peihu-nv · 2026-04-07T21:22:41Z

As @qiaoxj07 mentionned in the nvbug, I am not convinced this is a proper fix. What if we have a ctx server that fills its KV cache in less than 0.1sec in the future (instead of the 5s it takes for this model)? Won't we run into the same issue?

In the original config from NVBug, it has Qwen3-235B-FP4, GB200, CTX TP=1, ISL=1024:
184 GiB (GPU total memory) - 132 GiB (Model weight with TP1) = 52 GiB
For KV cache 52 GiB × 0.6 (KV fraction) = 32.65 GiB KV budget
KV per batch: 32 (batch) × 1024 (ISL) × 120 KiB/tok = 3.75 GiB
Number of batches can hold: 32.65 / 3.75 = 8.7
Observed prefill time per batch: ~560 ms
Overflow threshold: 8.7 × 560 ms = 4.9 seconds which is less than the current 10 seconds and causing back pressure

This bug config is actually already a very extreme case. For the configs that we regularly benchmark in https://github.com/NVIDIA/TensorRT-LLM/tree/main/tests/integration/defs/perf/disagg/test_configs/disagg/perf
I am seeing all of them >300s which is why other cases did not having this issue.

As for the question that is 0.1s future proof, if prefill server one day generate tokens 10x+ faster, this 0.1s polling will be close to the limit.

Also in PR #12208 (comment), the refactoring removes the inner fill loop, but the major actual issue is the can forward loop, and that is still there. In other words, the refactoring in the PR will not help because the can forward loop is not removed and still polls with sleep. It is the sleep reduction from 10s that actually fixes this deadlock. So changing it to 0.1s or 1s from 10s is still needed. And I recommend doing 0.1s to be more future proof.

Feel free to LMK if there are more concerns or suggestions. Thx

tensorrt-cicd · 2026-04-07T22:51:45Z

PR_Github #42160 [ run ] completed with state FAILURE. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #32989 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-04-08T00:37:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-08T00:43:40Z

PR_Github #42207 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

tensorrt-cicd · 2026-04-08T03:53:42Z

PR_Github #42207 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #33025 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-04-08T16:14:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-08T16:20:13Z

PR_Github #42363 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

tensorrt-cicd · 2026-04-08T20:54:13Z

PR_Github #42363 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #33147 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

peihu-nv · 2026-04-08T21:26:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-08T21:33:03Z

PR_Github #42399 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

tensorrt-cicd · 2026-04-09T05:15:29Z

PR_Github #42399 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #33173 completed with status: 'SUCCESS'

CI Report

Link to invocation

@Tabrizian

- Refine ADP dummy skip logic per @Tabrizian's review: only allow dummy insertion in terminal case (all fetched AND benchmark_size < tp_size) where ranks will permanently have no real request - Extract _count_schedulable_active_requests and _should_skip_dummy_for_benchmark_disagg helpers for readability - Align gate sleep to 0.1s consistent with PR NVIDIA#12640 - Update tests for refined dummy logic and sleep value Signed-off-by: Chien-Chun Hung <[email protected]> Made-with: Cursor

@Tabrizian

- Refine ADP dummy skip logic per @Tabrizian's review: only allow dummy insertion in terminal case (all fetched AND benchmark_size < tp_size) where ranks will permanently have no real request - Extract _count_schedulable_active_requests and _should_skip_dummy_for_benchmark_disagg helpers for readability - Align gate sleep to 0.1s consistent with PR NVIDIA#12640 - Update tests for refined dummy logic and sleep value Signed-off-by: Chien-Chun Hung <[email protected]> Made-with: Cursor

- Update status to Merged - Document _benchmark_fill_phase_active flag and fill vs taper-down phases - Replace old fill_phase_complete/some_ranks_permanently_empty logic - Update test count (44 → 40) and descriptions - Align sleep references (1s → 0.1s) with PR NVIDIA#12640 - Add taper-down safety to correctness table Signed-off-by: Chien-Chun Hung <[email protected]> Made-with: Cursor

peihu-nv requested a review from a team as a code owner March 31, 2026 21:50

peihu-nv requested a review from lancelly March 31, 2026 21:50

github-actions Bot assigned peihu-nv Mar 31, 2026

peihu-nv mentioned this pull request Apr 1, 2026

[TRTLLM-11492][fix] Fix benchmark disagg deadlock by eliminating blocking fill loop #12208

Merged

1 task

Merge branch 'main' into fix/nvbug-6000658-disagg-genonly-hang

2478158

pcastonguay approved these changes Apr 8, 2026

View reviewed changes

pcastonguay merged commit 1d24866 into NVIDIA:main Apr 9, 2026
5 checks passed

chienchunhung mentioned this pull request Apr 22, 2026

[https://nvbugs/6093911][fix] Fix disagg gen-only benchmark hang under ADP router imbalance #13347

Merged

1 task

Conversation

peihu-nv commented Mar 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Mar 31, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

peihu-nv commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

peihu-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

peihu-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

peihu-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

peihu-nv commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

peihu-nv commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 6, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

peihu-nv commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

peihu-nv commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

pcastonguay commented Apr 7, 2026

Uh oh!

peihu-nv commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

peihu-nv commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

peihu-nv commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

peihu-nv commented Apr 8, 2026

peihu-nv commented Mar 31, 2026 •

edited by coderabbitai Bot

Loading