Skip to content

[https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is#14404

Merged
hyukn merged 2 commits into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6186880
May 26, 2026

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented May 21, 2026

Summary

  • Root cause: DeepEP entered post-quant dispatch whenever its comm strategy declared nvfp4 support, but the MoE backend's quantize_input returns hidden_states_sf=None for modules excluded from quantization (e.g. MTP MoE), causing deep_ep_buffer.dispatch to return a single tensor instead of a (hidden_states, sf) tuple; additionally the FP4 DSL kernel asserts kv_flat row stride == phys_block_kv*(D/2+4), which fails when kv_fused is a non-contiguous slice.
  • Fix: In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is None; in cute_dsl_custom_ops.py FP4 runner, append .contiguous() to kv_flat. Remove the four waiver lines for fp4_indexer_dsl_mtp{0,1,2,3} now that the test passes.
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes

    • Fixed tensor stride alignment in fused key-value cache operations to correctly handle non-contiguous memory layouts.
    • Improved post-quantization dispatch fallback handling for more robust operation with unquantized states.
  • Tests

    • Re-enabled integration tests for FP4 quantization with multi-GPU configurations.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 21bd25a7-5815-4190-bdc7-b10c7e071ab0

📥 Commits

Reviewing files that changed from the base of the PR and between 97f2dda and 332f9a7.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

📝 Walkthrough

Walkthrough

This PR applies three targeted fixes to tensor handling and operator dispatch: enforcing contiguous memory layout for KV tensors in the cute DSL custom op to match TVM-FFI stride expectations, refining DeepEP dispatch to handle None hidden states quantization scaling factors, and removing test waivers for DeepSeek nvfp4 multi-GPU tests that now pass.

Changes

Tensor and Dispatch Fixes

Layer / File(s) Summary
KV tensor contiguity fix for TVM-FFI stride compliance
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
kv_flat is now created using reshape(...).contiguous() to ensure correct row stride for the TVM-FFI kernel binding, even when kv_fused is a non-contiguous view from cache slicing.
DeepEP dispatch condition for None hidden_states_sf
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
Post-quant dispatch now falls back to pre-quant behavior when hidden_states_sf is None or post-quant dispatch is unsupported, with inline documentation explaining the fallback path.
Test waiver removal for DeepSeek nvfp4 multi-GPU tests
tests/integration/test_lists/waives.txt
Four SKIP waivers are removed for TestDeepSeekV32::test_nvfp4_multi_gpus[fp4_indexer_dsl_mtp0..3], indicating these tests now pass with the operator fixes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#13196: Also removes SKIP waivers for DeepSeek nvfp4 multi-GPU tests in waives.txt.
  • NVIDIA/TensorRT-LLM#14283: Directly related—modifies SKIP waivers for the same TestDeepSeekV32::test_nvfp4_multi_gpus[fp4_indexer_dsl_mtp0..3] test cases.
  • NVIDIA/TensorRT-LLM#14062: Also modifies waives.txt by removing/adding SKIP entries for DeepSeek/PyTorch test cases.

Suggested reviewers

  • chzblych
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title is incomplete and cut off mid-sentence, making it unclear what the actual fix entails. Complete the title to clearly describe the fix, such as: '[https://nvbugs/6186880][fix] Fall back to pre-quant dispatch when hidden_states_sf is None'
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The PR description provides comprehensive details about root cause, fixes, test plan, and links, covering the essential information needed for review.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@limin2021
Copy link
Copy Markdown
Collaborator

limin2021 commented May 21, 2026

My mr (#14133) changed the cute dsl op to allow non continuous tensor, so changes for op in this mr could be removed.

@limin2021 limin2021 closed this May 21, 2026
@limin2021 limin2021 reopened this May 21, 2026
Comment thread tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py Outdated
@tensorrt-cicd tensorrt-cicd force-pushed the repair-bot-bug6186880 branch from 332f9a7 to e914689 Compare May 21, 2026 11:15
@limin2021
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #50119 [ run ] triggered by Bot. Commit: e914689 Link to invocation

…ispatch and ensure FP4 DSL kv stride

Two issues caused TestDeepSeekV32::test_nvfp4_multi_gpus[fp4_indexer_dsl_*]
to fail on B300:

1. DeepEP post-quant dispatch was entered whenever the comm strategy
   declared nvfp4 support, but the MoE backend may still return
   hidden_states_sf=None when the local module is excluded from
   quantization (e.g. an MTP layer's MoE in a partially-quantized
   checkpoint). Passing (hidden_states, None) to deep_ep_buffer.dispatch
   causes recv_x to come back as a single tensor, triggering
   'too many values to unpack (expected 2)'. Route to the pre-quant
   (plain) dispatch path when hidden_states_sf is None.

2. The FP4 paged-MQA-logits CuTe DSL kernel binding asserts
   kv_flat.stride(0) == phys_block_kv * (D//2 + 4). When kv_fused is a
   non-contiguous view of a larger buffer (FP4 indexer-K cache slice),
   reshape preserves the larger row stride and the TVM-FFI bound check
   rejects the call. Force kv_flat to be contiguous so the row stride
   always matches the kernel's compile-time fake stride.

Removes the corresponding waivers now that the test passes.

Signed-off-by: tensorrt-cicd <[email protected]>
PR NVIDIA#14133 updated the FP4 paged-MQA-logits CuTe DSL op to accept
non-contiguous kv tensors, so the .contiguous() forcing in this
custom op is no longer needed and was triggering an extra copy
kernel. Revert to the original reshape-only path.

Signed-off-by: tensorrt-cicd <[email protected]>
@tensorrt-cicd tensorrt-cicd force-pushed the repair-bot-bug6186880 branch from e914689 to 7822d38 Compare May 25, 2026 03:40
@limin2021
Copy link
Copy Markdown
Collaborator

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #50139 [ run ] triggered by Bot. Commit: 7822d38 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #50119 [ run ] completed with state ABORTED. Commit: e914689

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #50139 [ run ] completed with state SUCCESS. Commit: 7822d38
/LLM/main/L0_MergeRequest_PR pipeline #39689 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@limin2021
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #50216 [ run ] triggered by Bot. Commit: 7822d38 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator Author

PR_Github #50216 [ run ] completed with state SUCCESS. Commit: 7822d38
/LLM/main/L0_MergeRequest_PR pipeline #39753 completed with status: 'SUCCESS'

CI Report

Link to invocation

Copy link
Copy Markdown
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@limin2021 limin2021 requested a review from hyukn May 26, 2026 03:13
@hyukn hyukn merged commit 9835415 into NVIDIA:main May 26, 2026
7 checks passed
bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants