Skip to content

[https://nvbugs/6000658][fix] Fix disagg gen-only hang where 10s sleep in can_forward blocks KV transfers and overflows CTX memory#12640

Merged
pcastonguay merged 3 commits into
NVIDIA:mainfrom
peihu-nv:fix/nvbug-6000658-disagg-genonly-hang
Apr 9, 2026

Conversation

@peihu-nv
Copy link
Copy Markdown
Collaborator

@peihu-nv peihu-nv commented Mar 31, 2026

Summary by CodeRabbit

  • Chores
    • Optimized executor polling performance for benchmark operations by reducing sleep intervals during queue readiness assessment, enabling faster detection of scheduling readiness in benchmark "generation-only" mode. This reduces initialization latency while maintaining existing control flow. Logging messages were updated to reflect the new behavior.

Description

In disaggregated gen-only benchmarking mode with ADP enabled, the can_forward gate polls with time.sleep(10) while waiting for enough generation requests to accumulate. During this 10s sleep, GEN does not call _prepare_and_schedule_batch(), so it never posts receiveAsync() for incoming UCX KV transfers from CTX servers.

Since UCX uses a rendezvous protocol, CTX's sendAsync() futures block until GEN posts a matching receive. With CTX running TP=1 (full model on one GPU), the KV cache is small (~32 GiB on GB200 after loading the 132 GiB model). The stalled transfers fill up CTX's KV cache within ~5 seconds, preventing CTX from prefilling new requests. GEN then never reaches the can_forward threshold, causing a deadlock.

This issue does not affect GB300 because its larger GPU memory (~288 GiB vs 184 GiB) allows a much larger CTX KV cache, which can absorb the full 10s backpressure without overflowing. For most benchmark YAML configs in this repo, the CTX KV cache is large enough (e.g., using higher CTX TP, smaller models, or shorter ISL) that the 10s sleep does not trigger the overflow. The issue specifically manifests on GB200 with large models (e.g., Qwen3-235B), long input sequences, and CTX TP=1.

Fix: reduce the ADP waiting sleep from 10s to 0.1s. This allows GEN to post receives frequently enough that CTX's KV cache never overflows. The 0.1s interval provides 49x margin over the theoretical CTX KV overflow threshold (~4.9s for Qwen3-235B with ISL=1024 on GB200).

Test Coverage

Validated on Lyris GB200 with the failing config - Qwen3-235B-FP4, 4 CTX servers, concurrency=4096:

  • sleep(10): deadlock (73 CTX timeouts, 56 "not enough kvCache" warnings)
  • sleep(0.1): success (0 warnings, E2EL=57076ms)

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…p in can_forward gate blocks KV transfers and overflows CTX KV cache

In disaggregated gen-only benchmarking mode with ADP enabled, the can_forward
gate polls with time.sleep(10) while waiting for enough generation requests to
accumulate. During this 10s sleep, GEN does not call _prepare_and_schedule_batch(),
so it never posts receiveAsync() for incoming UCX KV transfers from CTX servers.

Since UCX uses a rendezvous protocol, CTX's sendAsync() futures block until GEN
posts a matching receive. With CTX running TP=1 (full model on one GPU), the KV
cache is small (~32 GiB on GB200 after loading the 132 GiB model). The stalled
transfers fill up CTX's KV cache within ~5 seconds, preventing CTX from prefilling
new requests. GEN then never reaches the can_forward threshold, causing a deadlock.

This issue does not affect GB300 because its larger GPU memory (~288 GiB vs 184 GiB)
allows a much larger CTX KV cache, which can absorb the full 10s backpressure without
overflowing. For most benchmark YAML configs in this repo, the CTX KV cache is large
enough (e.g., using higher CTX TP, smaller models, or shorter ISL) that the 10s sleep
does not trigger the overflow. The issue specifically manifests on GB200 with large
models (e.g., Qwen3-235B), long input sequences, and CTX TP=1.

Fix: reduce the ADP waiting sleep from 10s to 0.1s. This allows GEN to post
receives frequently enough that CTX's KV cache never overflows. The 0.1s interval
provides 49x margin over the theoretical CTX KV overflow threshold (~4.9s for
Qwen3-235B with ISL=1024 on GB200).

Validated on Lyris GB200 with Qwen3-235B-FP4, 4 CTX servers, concurrency=4096:
- sleep(10): deadlock (73 CTX timeouts, 56 "not enough kvCache" warnings)
- sleep(0.1): success (0 warnings, E2EL=57076ms)

The other two time.sleep(10) calls in this function are intentional:
- Line 2061: stabilization wait after can_forward becomes True
- Line 2076: non-ADP path (no UCX rendezvous issue without ADP)

Signed-off-by: peihengh <[email protected]>
@peihu-nv peihu-nv requested a review from a team as a code owner March 31, 2026 21:50
@peihu-nv peihu-nv requested a review from lancelly March 31, 2026 21:50
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 31, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 861eeedc-f576-4208-b741-cde781f189ba

📥 Commits

Reviewing files that changed from the base of the PR and between 6ac5c15 and ab09910.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py

📝 Walkthrough

Walkthrough

Modified the polling wait interval in PyExecutor._executor_loop_overlap benchmark behavior, reducing the sleep duration from 10 seconds to 0.1 seconds when total generation count is below the benchmark request queue size, with corresponding log message updates.

Changes

Cohort / File(s) Summary
Benchmark polling timing adjustment
tensorrt_llm/_torch/pyexecutor/py_executor.py
Reduced sleep interval from 10 to 0.1 seconds in the "gen-only" waiting logic and updated the associated log message to reflect the new polling cadence.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: reducing a 10s sleep that was causing a deadlock in disaggregated gen-only mode by blocking KV transfers and overflowing CTX memory.
Description check ✅ Passed The PR description comprehensively covers the issue, root cause, why it affects specific hardware/configs, the fix rationale with margin calculations, and detailed test validation results demonstrating the fix resolves the deadlock.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41028 [ run ] triggered by Bot. Commit: ab09910 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41028 [ run ] completed with state FAILURE. Commit: ab09910
/LLM/main/L0_MergeRequest_PR pipeline #32008 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…to prevent KV transfer backpressure

The benchmark disagg fill loop inside _prepare_and_schedule_batch() retries
with time.sleep(1) when the request queue is not yet full. During this sleep,
GEN does not process KV transfers or post receiveAsync(), creating the same
backpressure risk as the can_forward gate's sleep(10) fixed in the previous
commit.

For tight configs (e.g., ISL=8192 with CTX TP=1), the CTX KV overflow
threshold can be as low as ~0.75s, making sleep(1) borderline unsafe.
Reduce to sleep(0.1) for consistency with the can_forward gate fix.

Signed-off-by: peihengh <[email protected]>
@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 2, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41285 [ run ] triggered by Bot. Commit: feac977 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41285 [ run ] completed with state FAILURE. Commit: feac977
/LLM/main/L0_MergeRequest_PR pipeline #32243 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 2, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41319 [ run ] triggered by Bot. Commit: feac977 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41319 [ run ] completed with state SUCCESS. Commit: feac977
/LLM/main/L0_MergeRequest_PR pipeline #32270 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 2, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41462 [ run ] triggered by Bot. Commit: feac977 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41462 [ run ] completed with state SUCCESS. Commit: feac977
/LLM/main/L0_MergeRequest_PR pipeline #32391 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 6, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41947 [ run ] triggered by Bot. Commit: feac977 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41947 [ run ] completed with state SUCCESS. Commit: feac977
/LLM/main/L0_MergeRequest_PR pipeline #32803 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 6, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41970 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41970 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #32823 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 7, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42076 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42076 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #32916 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 7, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42160 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

@pcastonguay
Copy link
Copy Markdown
Collaborator

As @qiaoxj07 mentionned in the nvbug, I am not convinced this is a proper fix. What if we have a ctx server that fills its KV cache in less than 0.1sec in the future (instead of the 5s it takes for this model)? Won't we run into the same issue?

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 7, 2026

As @qiaoxj07 mentionned in the nvbug, I am not convinced this is a proper fix. What if we have a ctx server that fills its KV cache in less than 0.1sec in the future (instead of the 5s it takes for this model)? Won't we run into the same issue?

In the original config from NVBug, it has Qwen3-235B-FP4, GB200, CTX TP=1, ISL=1024:
184 GiB (GPU total memory) - 132 GiB (Model weight with TP1) = 52 GiB
For KV cache 52 GiB × 0.6 (KV fraction) = 32.65 GiB KV budget
KV per batch: 32 (batch) × 1024 (ISL) × 120 KiB/tok = 3.75 GiB
Number of batches can hold: 32.65 / 3.75 = 8.7
Observed prefill time per batch: ~560 ms
Overflow threshold: 8.7 × 560 ms = 4.9 seconds which is less than the current 10 seconds and causing back pressure

This bug config is actually already a very extreme case. For the configs that we regularly benchmark in https://github.com/NVIDIA/TensorRT-LLM/tree/main/tests/integration/defs/perf/disagg/test_configs/disagg/perf
I am seeing all of them >300s which is why other cases did not having this issue.

As for the question that is 0.1s future proof, if prefill server one day generate tokens 10x+ faster, this 0.1s polling will be close to the limit.

Also in PR #12208 (comment), the refactoring removes the inner fill loop, but the major actual issue is the can forward loop, and that is still there. In other words, the refactoring in the PR will not help because the can forward loop is not removed and still polls with sleep. It is the sleep reduction from 10s that actually fixes this deadlock. So changing it to 0.1s or 1s from 10s is still needed. And I recommend doing 0.1s to be more future proof.

Feel free to LMK if there are more concerns or suggestions. Thx

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42160 [ run ] completed with state FAILURE. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #32989 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 8, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42207 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42207 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #33025 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 8, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42363 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42363 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #33147 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@peihu-nv
Copy link
Copy Markdown
Collaborator Author

peihu-nv commented Apr 8, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42399 [ run ] triggered by Bot. Commit: 2478158 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42399 [ run ] completed with state SUCCESS. Commit: 2478158
/LLM/main/L0_MergeRequest_PR pipeline #33173 completed with status: 'SUCCESS'

CI Report

Link to invocation

@pcastonguay pcastonguay merged commit 1d24866 into NVIDIA:main Apr 9, 2026
5 checks passed
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Apr 9, 2026
- Refine ADP dummy skip logic per @Tabrizian's review: only allow
  dummy insertion in terminal case (all fetched AND benchmark_size
  < tp_size) where ranks will permanently have no real request
- Extract _count_schedulable_active_requests and
  _should_skip_dummy_for_benchmark_disagg helpers for readability
- Align gate sleep to 0.1s consistent with PR NVIDIA#12640
- Update tests for refined dummy logic and sleep value

Signed-off-by: Chien-Chun Hung <[email protected]>
Made-with: Cursor
brb-nv pushed a commit to brb-nv/TensorRT-LLM that referenced this pull request Apr 10, 2026
- Refine ADP dummy skip logic per @Tabrizian's review: only allow
  dummy insertion in terminal case (all fetched AND benchmark_size
  < tp_size) where ranks will permanently have no real request
- Extract _count_schedulable_active_requests and
  _should_skip_dummy_for_benchmark_disagg helpers for readability
- Align gate sleep to 0.1s consistent with PR NVIDIA#12640
- Update tests for refined dummy logic and sleep value

Signed-off-by: Chien-Chun Hung <[email protected]>
Made-with: Cursor
chienchunhung added a commit to chienchunhung/TensorRT-LLM that referenced this pull request Apr 13, 2026
- Update status to Merged
- Document _benchmark_fill_phase_active flag and fill vs taper-down phases
- Replace old fill_phase_complete/some_ranks_permanently_empty logic
- Update test count (44 → 40) and descriptions
- Align sleep references (1s → 0.1s) with PR NVIDIA#12640
- Add taper-down safety to correctness table

Signed-off-by: Chien-Chun Hung <[email protected]>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants