[https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is by tensorrt-cicd · Pull Request #14404 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-05-21T09:56:27Z

Summary

Root cause: DeepEP entered post-quant dispatch whenever its comm strategy declared nvfp4 support, but the MoE backend's quantize_input returns hidden_states_sf=None for modules excluded from quantization (e.g. MTP MoE), causing deep_ep_buffer.dispatch to return a single tensor instead of a (hidden_states, sf) tuple; additionally the FP4 DSL kernel asserts kv_flat row stride == phys_block_kv*(D/2+4), which fails when kv_fused is a non-contiguous slice.
Fix: In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is None; in cute_dsl_custom_ops.py FP4 runner, append .contiguous() to kv_flat. Remove the four waiver lines for fp4_indexer_dsl_mtp{0,1,2,3} now that the test passes.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6186880

Summary by CodeRabbit

Bug Fixes
- Fixed tensor stride alignment in fused key-value cache operations to correctly handle non-contiguous memory layouts.
- Improved post-quantization dispatch fallback handling for more robust operation with unquantized states.
Tests
- Re-enabled integration tests for FP4 quantization with multi-GPU configurations.

coderabbitai · 2026-05-21T09:59:34Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 21bd25a7-5815-4190-bdc7-b10c7e071ab0

📥 Commits

Reviewing files that changed from the base of the PR and between 97f2dda and 332f9a7.

📒 Files selected for processing (3)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

📝 Walkthrough

Walkthrough

This PR applies three targeted fixes to tensor handling and operator dispatch: enforcing contiguous memory layout for KV tensors in the cute DSL custom op to match TVM-FFI stride expectations, refining DeepEP dispatch to handle None hidden states quantization scaling factors, and removing test waivers for DeepSeek nvfp4 multi-GPU tests that now pass.

Changes

Tensor and Dispatch Fixes

Layer / File(s)	Summary
KV tensor contiguity fix for TVM-FFI stride compliance `tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`	`kv_flat` is now created using `reshape(...).contiguous()` to ensure correct row stride for the TVM-FFI kernel binding, even when `kv_fused` is a non-contiguous view from cache slicing.
DeepEP dispatch condition for None hidden_states_sf `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py`	Post-quant dispatch now falls back to pre-quant behavior when `hidden_states_sf` is `None` or post-quant dispatch is unsupported, with inline documentation explaining the fallback path.
Test waiver removal for DeepSeek nvfp4 multi-GPU tests `tests/integration/test_lists/waives.txt`	Four SKIP waivers are removed for `TestDeepSeekV32::test_nvfp4_multi_gpus[fp4_indexer_dsl_mtp0..3]`, indicating these tests now pass with the operator fixes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#13196: Also removes SKIP waivers for DeepSeek nvfp4 multi-GPU tests in waives.txt.
NVIDIA/TensorRT-LLM#14283: Directly related—modifies SKIP waivers for the same TestDeepSeekV32::test_nvfp4_multi_gpus[fp4_indexer_dsl_mtp0..3] test cases.
NVIDIA/TensorRT-LLM#14062: Also modifies waives.txt by removing/adding SKIP entries for DeepSeek/PyTorch test cases.

Suggested reviewers

chzblych

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title is incomplete and cut off mid-sentence, making it unclear what the actual fix entails.	Complete the title to clearly describe the fix, such as: '[https://nvbugs/6186880][fix] Fall back to pre-quant dispatch when hidden_states_sf is None'

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description provides comprehensive details about root cause, fixes, test plan, and links, covering the essential information needed for review.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

limin2021 · 2026-05-21T10:07:22Z

My mr (#14133) changed the cute dsl op to allow non continuous tensor, so changes for op in this mr could be removed.

limin2021 · 2026-05-25T01:24:16Z

/bot run

tensorrt-cicd · 2026-05-25T01:30:45Z

PR_Github #50119 [ run ] triggered by Bot. Commit: e914689 Link to invocation

…ispatch and ensure FP4 DSL kv stride Two issues caused TestDeepSeekV32::test_nvfp4_multi_gpus[fp4_indexer_dsl_*] to fail on B300: 1. DeepEP post-quant dispatch was entered whenever the comm strategy declared nvfp4 support, but the MoE backend may still return hidden_states_sf=None when the local module is excluded from quantization (e.g. an MTP layer's MoE in a partially-quantized checkpoint). Passing (hidden_states, None) to deep_ep_buffer.dispatch causes recv_x to come back as a single tensor, triggering 'too many values to unpack (expected 2)'. Route to the pre-quant (plain) dispatch path when hidden_states_sf is None. 2. The FP4 paged-MQA-logits CuTe DSL kernel binding asserts kv_flat.stride(0) == phys_block_kv * (D//2 + 4). When kv_fused is a non-contiguous view of a larger buffer (FP4 indexer-K cache slice), reshape preserves the larger row stride and the TVM-FFI bound check rejects the call. Force kv_flat to be contiguous so the row stride always matches the kernel's compile-time fake stride. Removes the corresponding waivers now that the test passes. Signed-off-by: tensorrt-cicd <[email protected]>

PR NVIDIA#14133 updated the FP4 paged-MQA-logits CuTe DSL op to accept non-contiguous kv tensors, so the .contiguous() forcing in this custom op is no longer needed and was triggering an extra copy kernel. Revert to the original reshape-only path. Signed-off-by: tensorrt-cicd <[email protected]>

limin2021 · 2026-05-25T04:20:41Z

/bot run

tensorrt-cicd · 2026-05-25T04:25:52Z

PR_Github #50139 [ run ] triggered by Bot. Commit: 7822d38 Link to invocation

tensorrt-cicd · 2026-05-25T04:30:22Z

PR_Github #50119 [ run ] completed with state ABORTED. Commit: e914689

Link to invocation

tensorrt-cicd · 2026-05-25T11:18:38Z

PR_Github #50139 [ run ] completed with state SUCCESS. Commit: 7822d38
/LLM/main/L0_MergeRequest_PR pipeline #39689 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

limin2021 · 2026-05-25T12:52:00Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-25T12:58:11Z

PR_Github #50216 [ run ] triggered by Bot. Commit: 7822d38 Link to invocation

tensorrt-cicd · 2026-05-25T16:24:01Z

PR_Github #50216 [ run ] completed with state SUCCESS. Commit: 7822d38
/LLM/main/L0_MergeRequest_PR pipeline #39753 completed with status: 'SUCCESS'

CI Report

Link to invocation

Superjomn

LGTM

…nt dispatch path when hidden_states_sf is (NVIDIA#14404) Signed-off-by: tensorrt-cicd <[email protected]>

tensorrt-cicd requested review from a team as code owners May 21, 2026 09:56

tensorrt-cicd requested review from HuiGao-NV and liji-nv May 21, 2026 09:56

github-actions Bot assigned tensorrt-cicd May 21, 2026

limin2021 closed this May 21, 2026

limin2021 reopened this May 21, 2026

limin2021 reviewed May 21, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py Outdated

tensorrt-cicd force-pushed the repair-bot-bug6186880 branch from 332f9a7 to e914689 Compare May 21, 2026 11:15

tensorrt-cicd added 2 commits May 24, 2026 20:35

tensorrt-cicd force-pushed the repair-bot-bug6186880 branch from e914689 to 7822d38 Compare May 25, 2026 03:40

Superjomn approved these changes May 26, 2026

View reviewed changes

yuxianq approved these changes May 26, 2026

View reviewed changes

limin2021 requested a review from hyukn May 26, 2026 03:13

hyukn approved these changes May 26, 2026

View reviewed changes

hyukn merged commit 9835415 into NVIDIA:main May 26, 2026
7 checks passed

Conversation

tensorrt-cicd commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

limin2021 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

limin2021 commented May 25, 2026

Uh oh!

tensorrt-cicd commented May 25, 2026

Uh oh!

limin2021 commented May 25, 2026

Uh oh!

tensorrt-cicd commented May 25, 2026

Uh oh!

tensorrt-cicd commented May 25, 2026

Uh oh!

tensorrt-cicd commented May 25, 2026

Uh oh!

limin2021 commented May 25, 2026

Uh oh!

tensorrt-cicd commented May 25, 2026

Uh oh!

tensorrt-cicd commented May 25, 2026

Uh oh!

Superjomn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tensorrt-cicd commented May 21, 2026 •

edited by coderabbitai Bot

Loading

limin2021 commented May 21, 2026 •

edited

Loading