Skip to content

[https://nvbugs/5991576][fix] fix dummy request crash with PP + ADP + disagg + block reuse#12403

Merged
Tabrizian merged 1 commit into
NVIDIA:mainfrom
Tabrizian:fix/nvbug-5991576-dummy-request-crash
Mar 21, 2026
Merged

[https://nvbugs/5991576][fix] fix dummy request crash with PP + ADP + disagg + block reuse#12403
Tabrizian merged 1 commit into
NVIDIA:mainfrom
Tabrizian:fix/nvbug-5991576-dummy-request-crash

Conversation

@Tabrizian
Copy link
Copy Markdown
Member

@Tabrizian Tabrizian commented Mar 20, 2026

Summary by CodeRabbit

  • Bug Fixes

    • Fixed pipeline-parallel termination coordination to properly skip the termination handler for dummy requests, preventing unnecessary handler invocations during cleanup.
  • Tests

    • Added new integration test configuration and corresponding test function for validating distributed inference scenarios with advanced parallelism configurations and speculative decoding.

Description

When using pipeline parallelism (PP) on the context server in disaggregated serving with attention DP (ADP) and KV cache block reuse enabled, the system crashes with:

RuntimeError: Assertion failed: emplaceDone (kvCacheManager.cpp:2497)

Root cause: The attention DP dummy request (hardcoded request_id=0) is routed through DisaggPPTerminationHandler, which delays sequence removal via a ring protocol across PP ranks. Since the dummy ID is reused every iteration by _pad_attention_dp_dummy_request, the delayed removal causes a collision in the KV cache manager's mSequences map when the dummy is re-added.

Fix: Bypass DisaggPPTerminationHandler for dummy requests since they don't participate in disagg KV cache transfers and must be terminated immediately.

Test Coverage

  • New integration test: test_disaggregated_deepseek_v3_lite_fp8_ctxtp2ep2pp2_gentp4_one_mtp_block_reuse
    • CTX server: TP=2, PP=2, EP=2, ADP, block reuse enabled, MTP
    • GEN server: TP=4, EP=4, ADP, block reuse enabled, MTP
    • Model: DeepSeek-V3-Lite/fp8
    • Requires 8 GPUs (Hopper+)
  • Verified on EOS (8x H100):
    • Without fix: Test fails with emplaceDone assertion (timeout after 5 min)
    • With fix: Test passes in ~3 min

PR Checklist

  • Please check this after reviewing the above items as appropriate for this PR.

@Tabrizian Tabrizian requested review from a team as code owners March 20, 2026 07:24
@Tabrizian Tabrizian requested a review from HuiGao-NV March 20, 2026 07:24
@Tabrizian Tabrizian force-pushed the fix/nvbug-5991576-dummy-request-crash branch from 2d80cc5 to 4de6de9 Compare March 20, 2026 07:27
@Tabrizian
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 34df4c98-7ab4-4ee6-854d-2aea236485aa

📥 Commits

Reviewing files that changed from the base of the PR and between 4197e19 and 2d80cc5.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2ep2pp2_gentp4_deepseek_v3_lite_one_mtp_block_reuse.yaml
  • tests/integration/defs/disaggregated/test_disaggregated.py

📝 Walkthrough

Walkthrough

The PR refines disagreggated pipeline parallelism termination logic by conditionally restricting the PP termination handler to non-dummy requests, while adding a new integration test configuration for DeepSeek-V3-Lite with specific parallelism and speculative decoding settings, along with the corresponding test function.

Changes

Cohort / File(s) Summary
Disaggregated PP Termination Logic
tensorrt_llm/_torch/pyexecutor/py_executor.py
Modified _terminate_request to invoke the PP termination handler only for non-dummy requests, allowing dummy requests to bypass PP-termination coordination.
Integration Test Configuration & Test
tests/integration/defs/disaggregated/test_configs/disagg_config_ctxtp2ep2pp2_gentp4_deepseek_v3_lite_one_mtp_block_reuse.yaml, tests/integration/defs/disaggregated/test_disaggregated.py
Added new YAML configuration for distributed DeepSeek-V3-Lite inference with tensor/pipeline/MoE parallelism and MTP speculative decoding, plus corresponding pytest test function with model symlink creation and environment setup.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the core fix: preventing dummy request crashes in disaggregated serving with PP+ADP+block reuse.
Description check ✅ Passed The description provides clear root cause analysis, solution, test coverage with verification results, and follows the required template structure with all necessary sections completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can enforce grammar and style rules using `languagetool`.

Configure the reviews.tools.languagetool setting to enable/disable rules and categories. Refer to the LanguageTool Community to learn more.

@Tabrizian Tabrizian force-pushed the fix/nvbug-5991576-dummy-request-crash branch from 4de6de9 to c16b98b Compare March 20, 2026 07:31
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39708 [ run ] triggered by Bot. Commit: c16b98b Link to invocation

… disagg + block reuse

The attention DP dummy request (hardcoded request_id=0) was being routed
through DisaggPPTerminationHandler, which delays sequence removal via a
ring protocol across PP ranks. Since the dummy ID is reused every
iteration, the delayed removal caused a collision in the KV cache
manager's mSequences map, triggering an emplaceDone assertion failure.

Fix: bypass DisaggPPTerminationHandler for dummy requests since they
don't participate in disagg KV cache transfers.

Add regression test with CTX: TP=2, PP=2, EP=2, ADP + block reuse and
GEN: TP=4, EP=4, ADP using DeepSeek-V3-Lite with MTP.

Signed-off-by: Iman Tabrizian <[email protected]>
@Tabrizian
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@Tabrizian Tabrizian force-pushed the fix/nvbug-5991576-dummy-request-crash branch from c16b98b to 497b837 Compare March 20, 2026 07:43
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39711 [ run ] triggered by Bot. Commit: 497b837 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39711 [ run ] completed with state SUCCESS. Commit: 497b837
/LLM/main/L0_MergeRequest_PR pipeline #30907 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Tabrizian
Copy link
Copy Markdown
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39746 [ run ] triggered by Bot. Commit: 497b837 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39746 [ run ] completed with state SUCCESS. Commit: 497b837
/LLM/main/L0_MergeRequest_PR pipeline #30941 completed with status: 'SUCCESS'

CI Report

Link to invocation

@Tabrizian Tabrizian requested a review from tburt-nv March 20, 2026 19:30
@Tabrizian Tabrizian merged commit f31b45b into NVIDIA:main Mar 21, 2026
5 checks passed
- accuracy/test_llm_api_pytorch.py::TestQwen3NextInstruct::test_bf16_4gpu[tp4ep4_cudagraph_overlap_adp_on]
- disaggregated/test_disaggregated.py::test_disaggregated_ctxtp2pp2_gentp2pp2[TinyLlama-1.1B-Chat-v1.0]
- disaggregated/test_disaggregated.py::test_disaggregated_ctxpp4_genpp4[TinyLlama-1.1B-Chat-v1.0]
- disaggregated/test_disaggregated.py::test_disaggregated_deepseek_v3_lite_fp8_ctxtp2ep2pp2_gentp4_one_mtp_block_reuse[DeepSeek-V3-Lite-fp8]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this case into qa test list, too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants