[https://nvbugs/6000658][fix] Fix disagg gen-only hang where 10s sleep in can_forward blocks KV transfers and overflows CTX memory#12640
Conversation
…p in can_forward gate blocks KV transfers and overflows CTX KV cache In disaggregated gen-only benchmarking mode with ADP enabled, the can_forward gate polls with time.sleep(10) while waiting for enough generation requests to accumulate. During this 10s sleep, GEN does not call _prepare_and_schedule_batch(), so it never posts receiveAsync() for incoming UCX KV transfers from CTX servers. Since UCX uses a rendezvous protocol, CTX's sendAsync() futures block until GEN posts a matching receive. With CTX running TP=1 (full model on one GPU), the KV cache is small (~32 GiB on GB200 after loading the 132 GiB model). The stalled transfers fill up CTX's KV cache within ~5 seconds, preventing CTX from prefilling new requests. GEN then never reaches the can_forward threshold, causing a deadlock. This issue does not affect GB300 because its larger GPU memory (~288 GiB vs 184 GiB) allows a much larger CTX KV cache, which can absorb the full 10s backpressure without overflowing. For most benchmark YAML configs in this repo, the CTX KV cache is large enough (e.g., using higher CTX TP, smaller models, or shorter ISL) that the 10s sleep does not trigger the overflow. The issue specifically manifests on GB200 with large models (e.g., Qwen3-235B), long input sequences, and CTX TP=1. Fix: reduce the ADP waiting sleep from 10s to 0.1s. This allows GEN to post receives frequently enough that CTX's KV cache never overflows. The 0.1s interval provides 49x margin over the theoretical CTX KV overflow threshold (~4.9s for Qwen3-235B with ISL=1024 on GB200). Validated on Lyris GB200 with Qwen3-235B-FP4, 4 CTX servers, concurrency=4096: - sleep(10): deadlock (73 CTX timeouts, 56 "not enough kvCache" warnings) - sleep(0.1): success (0 warnings, E2EL=57076ms) The other two time.sleep(10) calls in this function are intentional: - Line 2061: stabilization wait after can_forward becomes True - Line 2076: non-ADP path (no UCX rendezvous issue without ADP) Signed-off-by: peihengh <[email protected]>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughModified the polling wait interval in Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run |
|
PR_Github #41028 [ run ] triggered by Bot. Commit: |
|
PR_Github #41028 [ run ] completed with state
|
…to prevent KV transfer backpressure The benchmark disagg fill loop inside _prepare_and_schedule_batch() retries with time.sleep(1) when the request queue is not yet full. During this sleep, GEN does not process KV transfers or post receiveAsync(), creating the same backpressure risk as the can_forward gate's sleep(10) fixed in the previous commit. For tight configs (e.g., ISL=8192 with CTX TP=1), the CTX KV overflow threshold can be as low as ~0.75s, making sleep(1) borderline unsafe. Reduce to sleep(0.1) for consistency with the can_forward gate fix. Signed-off-by: peihengh <[email protected]>
|
/bot run |
|
PR_Github #41285 [ run ] triggered by Bot. Commit: |
|
PR_Github #41285 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41319 [ run ] triggered by Bot. Commit: |
|
PR_Github #41319 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41462 [ run ] triggered by Bot. Commit: |
|
PR_Github #41462 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41947 [ run ] triggered by Bot. Commit: |
|
PR_Github #41947 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41970 [ run ] triggered by Bot. Commit: |
|
PR_Github #41970 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42076 [ run ] triggered by Bot. Commit: |
|
PR_Github #42076 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42160 [ run ] triggered by Bot. Commit: |
|
As @qiaoxj07 mentionned in the nvbug, I am not convinced this is a proper fix. What if we have a ctx server that fills its KV cache in less than 0.1sec in the future (instead of the 5s it takes for this model)? Won't we run into the same issue? |
In the original config from NVBug, it has Qwen3-235B-FP4, GB200, CTX TP=1, ISL=1024: This bug config is actually already a very extreme case. For the configs that we regularly benchmark in https://github.com/NVIDIA/TensorRT-LLM/tree/main/tests/integration/defs/perf/disagg/test_configs/disagg/perf As for the question that is 0.1s future proof, if prefill server one day generate tokens 10x+ faster, this 0.1s polling will be close to the limit. Also in PR #12208 (comment), the refactoring removes the inner fill loop, but the major actual issue is the can forward loop, and that is still there. In other words, the refactoring in the PR will not help because the can forward loop is not removed and still polls with sleep. It is the sleep reduction from 10s that actually fixes this deadlock. So changing it to 0.1s or 1s from 10s is still needed. And I recommend doing 0.1s to be more future proof. Feel free to LMK if there are more concerns or suggestions. Thx |
|
PR_Github #42160 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42207 [ run ] triggered by Bot. Commit: |
|
PR_Github #42207 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42363 [ run ] triggered by Bot. Commit: |
|
PR_Github #42363 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42399 [ run ] triggered by Bot. Commit: |
|
PR_Github #42399 [ run ] completed with state |
- Refine ADP dummy skip logic per @Tabrizian's review: only allow dummy insertion in terminal case (all fetched AND benchmark_size < tp_size) where ranks will permanently have no real request - Extract _count_schedulable_active_requests and _should_skip_dummy_for_benchmark_disagg helpers for readability - Align gate sleep to 0.1s consistent with PR NVIDIA#12640 - Update tests for refined dummy logic and sleep value Signed-off-by: Chien-Chun Hung <[email protected]> Made-with: Cursor
- Refine ADP dummy skip logic per @Tabrizian's review: only allow dummy insertion in terminal case (all fetched AND benchmark_size < tp_size) where ranks will permanently have no real request - Extract _count_schedulable_active_requests and _should_skip_dummy_for_benchmark_disagg helpers for readability - Align gate sleep to 0.1s consistent with PR NVIDIA#12640 - Update tests for refined dummy logic and sleep value Signed-off-by: Chien-Chun Hung <[email protected]> Made-with: Cursor
- Update status to Merged - Document _benchmark_fill_phase_active flag and fill vs taper-down phases - Replace old fill_phase_complete/some_ranks_permanently_empty logic - Update test count (44 → 40) and descriptions - Align sleep references (1s → 0.1s) with PR NVIDIA#12640 - Add taper-down safety to correctness table Signed-off-by: Chien-Chun Hung <[email protected]> Made-with: Cursor
Summary by CodeRabbit
Description
In disaggregated gen-only benchmarking mode with ADP enabled, the can_forward gate polls with time.sleep(10) while waiting for enough generation requests to accumulate. During this 10s sleep, GEN does not call _prepare_and_schedule_batch(), so it never posts receiveAsync() for incoming UCX KV transfers from CTX servers.
Since UCX uses a rendezvous protocol, CTX's sendAsync() futures block until GEN posts a matching receive. With CTX running TP=1 (full model on one GPU), the KV cache is small (~32 GiB on GB200 after loading the 132 GiB model). The stalled transfers fill up CTX's KV cache within ~5 seconds, preventing CTX from prefilling new requests. GEN then never reaches the can_forward threshold, causing a deadlock.
This issue does not affect GB300 because its larger GPU memory (~288 GiB vs 184 GiB) allows a much larger CTX KV cache, which can absorb the full 10s backpressure without overflowing. For most benchmark YAML configs in this repo, the CTX KV cache is large enough (e.g., using higher CTX TP, smaller models, or shorter ISL) that the 10s sleep does not trigger the overflow. The issue specifically manifests on GB200 with large models (e.g., Qwen3-235B), long input sequences, and CTX TP=1.
Fix: reduce the ADP waiting sleep from 10s to 0.1s. This allows GEN to post receives frequently enough that CTX's KV cache never overflows. The 0.1s interval provides 49x margin over the theoretical CTX KV overflow threshold (~4.9s for Qwen3-235B with ISL=1024 on GB200).
Test Coverage
Validated on Lyris GB200 with the failing config - Qwen3-235B-FP4, 4 CTX servers, concurrency=4096:
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.