[https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is#14404
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThis PR applies three targeted fixes to tensor handling and operator dispatch: enforcing contiguous memory layout for KV tensors in the cute DSL custom op to match TVM-FFI stride expectations, refining DeepEP dispatch to handle ChangesTensor and Dispatch Fixes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
My mr (#14133) changed the cute dsl op to allow non continuous tensor, so changes for op in this mr could be removed. |
332f9a7 to
e914689
Compare
|
/bot run |
|
PR_Github #50119 [ run ] triggered by Bot. Commit: |
…ispatch and ensure FP4 DSL kv stride Two issues caused TestDeepSeekV32::test_nvfp4_multi_gpus[fp4_indexer_dsl_*] to fail on B300: 1. DeepEP post-quant dispatch was entered whenever the comm strategy declared nvfp4 support, but the MoE backend may still return hidden_states_sf=None when the local module is excluded from quantization (e.g. an MTP layer's MoE in a partially-quantized checkpoint). Passing (hidden_states, None) to deep_ep_buffer.dispatch causes recv_x to come back as a single tensor, triggering 'too many values to unpack (expected 2)'. Route to the pre-quant (plain) dispatch path when hidden_states_sf is None. 2. The FP4 paged-MQA-logits CuTe DSL kernel binding asserts kv_flat.stride(0) == phys_block_kv * (D//2 + 4). When kv_fused is a non-contiguous view of a larger buffer (FP4 indexer-K cache slice), reshape preserves the larger row stride and the TVM-FFI bound check rejects the call. Force kv_flat to be contiguous so the row stride always matches the kernel's compile-time fake stride. Removes the corresponding waivers now that the test passes. Signed-off-by: tensorrt-cicd <[email protected]>
PR NVIDIA#14133 updated the FP4 paged-MQA-logits CuTe DSL op to accept non-contiguous kv tensors, so the .contiguous() forcing in this custom op is no longer needed and was triggering an extra copy kernel. Revert to the original reshape-only path. Signed-off-by: tensorrt-cicd <[email protected]>
e914689 to
7822d38
Compare
|
/bot run |
|
PR_Github #50139 [ run ] triggered by Bot. Commit: |
|
PR_Github #50119 [ run ] completed with state |
|
PR_Github #50139 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #50216 [ run ] triggered by Bot. Commit: |
|
PR_Github #50216 [ run ] completed with state |
…nt dispatch path when hidden_states_sf is (NVIDIA#14404) Signed-off-by: tensorrt-cicd <[email protected]>
Summary
Test plan
Links
Summary by CodeRabbit
Bug Fixes
Tests