[TRTLLM-9372][feat] Enable CuteDSL MoE with Large EP #9592

syuoni · 2025-12-01T14:15:58Z

[TRTLLM-9372][feat] Enable CuteDSL MoE with Large EP

Description

This PR enables CuteDSL MoE with Large EP (based on ConfigurableMoE #9486):

Alltoall comm
EPLB

It supports B200/GB200 NVFP4.

cat > extra_llm_api_options.yaml <<EOF
enable_attention_dp: true
enable_lm_head_tp_in_adp: true
cuda_graph_config:
  max_batch_size: 128
  enable_padding: true
moe_config:
  backend: CUTEDSL
  max_num_tokens: 9216
  load_balancer:
    layer_updates_per_iter: 1
    num_slots: 288
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
EOF

trtllm-eval --model nvidia/DeepSeek-R1-FP4 \
    --tp_size 32 \
    --ep_size 32 \
    --max_num_tokens 6144 \
    --max_seq_len 6144 \
    --kv_cache_free_gpu_memory_fraction 0.8 \
    --extra_llm_api_options extra_llm_api_options.yaml \
    gsm8k

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

New Features

Enhanced MOE (Mixture of Experts) kernel optimization with dynamic per-SM block calculation for improved performance
Added output memory initialization capability for MOE operations

Improvements

Modernized CUDA synchronization from inline assembly to runtime APIs
Improved MOE backend selection logic and integration
Added input validation checks for MOE operations

Tests

New test coverage for MOE output memory operations

_{✏️ Tip: You can customize this high-level summary in your review settings.}

cpp/include/tensorrt_llm/common/cudaUtils.h

syuoni · 2025-12-02T13:02:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-02T13:08:18Z

PR_Github #26618 [ run ] triggered by Bot. Commit: 54383d4

...t_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py

cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu

coderabbitai · 2025-12-02T13:15:20Z

📝 Walkthrough

Walkthrough

This PR introduces dynamic CUDA occupancy-aware kernel launch sizing, adds MOE output memset functionality, replaces inline assembly synchronization with CUDA runtime APIs, and extends Torch bindings and MoE backend support for CuteDslFusedMoE with load-balancer integration.

Changes

Cohort / File(s)	Summary
CUDA Occupancy Utility `cpp/include/tensorrt_llm/common/cudaUtils.h`	Added `getMaxActiveBlocksPerSM` template function with static cache (unordered_map) to memoize occupancy results per kernel, reducing redundant `cudaOccupancyMaxActiveBlocksPerMultiprocessor` calls.
MOE Kernel Launch & Synchronization `cpp/tensorrt_llm/kernels/cuteDslKernels/moeUtils.cu`, `cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu`	Replaced inline assembly grid-dependency synchronization (`griddepcontrol.*`) with CUDA runtime APIs (`cudaGridDependencySynchronize`, `cudaTriggerProgrammaticLaunchCompletion`). Replaced fixed per-SM block calculations with dynamic sizing using `getMaxActiveBlocksPerSM` in kernel launchers (moePermute, moeActivation, expandInputRowsKernelLauncher, finalizeMoeRoutingKernelLauncher).
MOE Output Memset Implementation `cpp/tensorrt_llm/kernels/cuteDslKernels/moeUtils.cu`, `cpp/tensorrt_llm/kernels/cuteDslKernels/moeUtils.h`	Introduced `moeOutputMemsetKernel` (device kernel) and `moeOutputMemset` (launch wrapper) to allocate and zero-initialize MOE output tensors per token. Added instantiation macro and explicit templates for half and bf16 types.
Torch MOE Bindings `cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp`	Added `moe_output_memset_inplace` PyTorch operation with input validation (2D tensor, int32 indices). Enhanced `moe_permute` input validation to check `permuted_idx_to_expanded_idx` is int32. Registered operation in CUDA backend and Torch library.
Autotuner & Compilation Utilities `tensorrt_llm/_torch/autotuner.py`, `tensorrt_llm/_torch/compilation/utils.py`	Updated fallback log message in autotuner to include operation name. Added inplace tensor mappings for `moe_output_memset_inplace` (input mutated) and `cute_dsl_nvfp4_grouped_gemm_finalize_blackwell` (output mutated).
Custom Ops & Fusion `tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`	Extended finalize and grouped GEMM entry points to accept optional output tensors; propagate output through fusion pipeline. Updated mutates_args to include output. Enhanced shape validations and tuning constraints to accommodate output tensor flow.
Blockscaled GEMM Fusion `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py`	Introduced intermediate-scale computation (`scale_interm_size`) and adjusted c_sf tensor layout dimensions to align with new scale-internal sizing while preserving A/B/C tensor shapes.
MoE Backend Selection & Configuration `tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`, `tensorrt_llm/_torch/modules/fused_moe/create_moe.py`	Replaced string-based backend checks with isinstance-based discrimination. Added CuteDslFusedMoE import and handling. Extended backend selection to route to CuteDslFusedMoE when quantization config supports fp8_block_scales or nvfp4. Expanded load-balancer compatibility to include CuteDslFusedMoE. Updated warning/fallback messaging.
CuteDsl MoE Implementation `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py`	Added `init_load_balancer` parameter to constructor. Consolidated forward path by introducing `quantize_input`, `run_moe_nvfp4`, `run_moe_fp8_block_scales`, and `run_moe` dispatcher methods. Removed specialized `forward_chunk_*` variants; unified path now performs routing, quantization, optional DP allgather, then dispatches via `run_moe`.
FP4 Weight Loading `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Added `load_expert_w3_w1_weight` and `load_expert_w3_w1_weight_scale_nvfp4` methods to NVFP4CuteDslFusedMoEMethod. These interleave and store W3/W1 weights and scales into destination tensors for SwiGLU fusion, modifying interleave dimension handling.
Eval Command `tensorrt_llm/commands/eval.py`	Added four new BuildConfig parameters (`max_batch_size`, `max_num_tokens`, `max_beam_width`, `max_seq_len`) to llm_args dictionary; propagates config values from BuildConfig into public LLM construction.
Tests `tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py`	Added `test_moe_output_memset_inplace` test covering multiple configurations (tile_size, top_k, num_tokens, dtype). Updated `test_nvfp4_grouped_gemm_finalize_blackwell` call site to pass None for new output parameter.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Key areas requiring attention:

Kernel launch sizing logic (moeUtils.cu, moe_kernels.cu): Verify occupancy calculation correctness and that dynamic block counts do not exceed available GPU resources across different hardware SKUs.
Output memset kernel correctness: Validate the moeOutputMemsetKernel implementation and synchronization pattern (grid dependency synchronization replacement).
Torch tensor binding contracts: Verify shape/dtype validation logic in moe_output_memset_inplace and output tensor propagation through fusion pipeline in cute_dsl_custom_ops.py.
FP4 weight interleaving: Review interleave dimension changes (1 → 0) and scale layout transformations in quantization.py to ensure compatibility with SwiGLU fusion kernel expectations.
Backend selection logic: Confirm CuteDslFusedMoE routing conditions (fp8_block_scales/nvfp4 checks) and load-balancer compatibility constraints in create_moe.py.
Forward consolidation: Ensure the unified forward path and dispatcher methods (run_moe*) preserve behavior parity with previous specialized paths.

Suggested reviewers

limin2021
zongfeijing
hyukn
kaiyux
djns99
dongxuy04
xxi-nv
yuxianq

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.28% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main feature being introduced: enabling CuteDSL MoE with Large EP support.
Description check	✅ Passed	PR description is mostly complete with clear title, objectives, and test coverage placeholder, but Test Coverage and PR Checklist sections are incomplete or contain generic items.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)
381-385: Inconsistent error message.

The error message at line 385 only mentions TRTLLMGenFusedMoE, but ConfigurableMoE now also supports CuteDslFusedMoE (as shown at line 352 and in the warning at line 379). Update for consistency.

Apply this diff:
             else:
                 # For other incompatible backends, raise error
                 raise ValueError(
                     f"ENABLE_CONFIGURABLE_MOE is set but backend {moe_cls.__name__} is not supported. "
-                    f"ConfigurableMoE only supports TRTLLMGenFusedMoE backend.")
+                    f"ConfigurableMoE only supports TRTLLMGenFusedMoE and CuteDslFusedMoE backends.")
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (1)
1219-1246: Fake function ignores provided output parameter.

The fake function signature includes output: Optional[torch.Tensor] but always creates and returns a new tensor. Since mutates_args=("output",) is declared, the fake function should match the semantics by returning the provided output when non-None.
 def _(
     input: torch.Tensor,
     weight: torch.Tensor,
     input_scale: torch.Tensor,
     weight_scale: torch.Tensor,
     alpha: torch.Tensor,
     output: Optional[torch.Tensor],
     tile_idx_to_group_idx: torch.Tensor,
     tile_idx_to_mn_limit: torch.Tensor,
     permuted_idx_to_expanded_idx: torch.Tensor,
     num_non_exiting_tiles: torch.Tensor,
     token_final_scales: torch.Tensor,
     num_experts: int,
     top_k: int,
     num_local_experts: int,
     local_expert_offset: int,
     tile_size: int,
     output_dtype: torch.dtype,
     scaling_vector_size: int = 16,
 ) -> torch.Tensor:
+    if output is not None:
+        return output
     num_tokens = token_final_scales.size(0)
     n = weight.size(1)
     return torch.empty(num_tokens,
                        n,
                        dtype=output_dtype,
                        device=input.device)

🧹 Nitpick comments (4)

tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py (1)
50-55: Backend kwargs wiring looks good, but TRTLLMGen comment is misleading

The new isinstance-based dispatch in _get_backend_kwargs and the CuteDslFusedMoE branch that forwards enable_alltoall into backend kwargs are consistent with the updated CuteDslFusedMoE.run_moe(...) API and communication handling.

The TRTLLMGen branch correctly passes router_logits only when _supports_load_balancer() is False (fused routing), which matches _forward_chunk_impl where routing is skipped in that case.

The inline comment around router_logits_arg currently says “If backend doesn't support load balancer, routing is done before communication; in that case, router_logits should be None”, which contradicts the actual behavior (for fused routing backends, routing is handled inside the backend and needs router_logits). It would be clearer to rephrase, e.g.:
-            # If backend doesn't support load balancer, routing is done before communication
-            # In that case, router_logits should be None (routing already done)
             router_logits_arg = None
             if not self.backend._supports_load_balancer():
-                # For fused routing backends, router_logits is only needed if routing hasn't been done yet
+                # Fused-routing backend: routing is performed inside the backend,
+                # so we must pass router_logits through.
                 router_logits_arg = router_logits
Please double-check that all CuteDSL backends report _supports_load_balancer() == True (separated routing) so that ConfigurableMoE does not attempt to hand router_logits to them.

Also applies to: 910-975
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (1)
230-275: Good test coverage for the new moe_output_memset_inplace functionality.

The test logic is correct: it verifies that tokens with valid permuted indices (i.e., expanded_idx_to_permuted_idx >= 0) get zeroed out while others remain unchanged.

Static analysis correctly identifies unused unpacked variables. Consider prefixing them with underscores to indicate intentional non-use:
     (
-        tile_idx_to_group_idx,
+        _tile_idx_to_group_idx,
         tile_idx_to_mn_limit,
         expanded_idx_to_permuted_idx,
         permuted_idx_to_expanded_idx,
-        total_num_padded_tokens,
+        _total_num_padded_tokens,
         num_non_exiting_tiles,
     ) = torch.ops.trtllm.moe_sort(
cpp/tensorrt_llm/kernels/cuteDslKernels/moeUtils.h (1)

35-39: Header declaration aligns with implementation; consider harmonizing parameter naming

The new moeOutputMemset declaration matches the .cu implementation in type and parameter order, which is what matters for correctness. There is, however, a minor naming mismatch for the expanded_idx_to_permuted_idx parameter versus the .cu file comment/name — aligning those would make the API easier to follow.

cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp (1)

258-308: moe_output_memset_inplace validation and dispatch match kernel expectations

The new inplace memset helper:

Enforces consistent 2D input, 2D expanded_idx_to_permuted_idx, 1D permuted_idx_to_expanded_idx/tile_idx_to_mn_limit, and single-element num_non_exiting_tiles.

Checks all index tensors are int32 and that max_num_permuted_tokens matches tile_tokens_dim * num_tiles and is ≥ num_tokens * top_k.

Dispatches cleanly to moeOutputMemset for half and bfloat16, with a clear error for other dtypes.

This matches the moeUtils.cu kernel signature and routing outputs, and the inplace contract is correctly expressed via the Torch schema (Tensor(a!) input).

If you find the shape/dtype checks in moe_permute and moe_output_memset_inplace evolving together, consider a small shared helper to keep them in sync, but it’s not required for this PR.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be48cdf and 54383d4.

📒 Files selected for processing (15)

cpp/include/tensorrt_llm/common/cudaUtils.h (2 hunks)
cpp/tensorrt_llm/kernels/cuteDslKernels/moeUtils.cu (9 hunks)
cpp/tensorrt_llm/kernels/cuteDslKernels/moeUtils.h (1 hunks)
cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu (3 hunks)
cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp (4 hunks)
tensorrt_llm/_torch/autotuner.py (1 hunks)
tensorrt_llm/_torch/compilation/utils.py (1 hunks)
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py (8 hunks)
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py (2 hunks)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (6 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (7 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1 hunks)
tensorrt_llm/commands/eval.py (1 hunks)
tests/unittest/_torch/thop/parallel/test_cute_dsl_moe.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{cpp,h,cu}