Skip to content

[None] [feat] Add densegemm backend for MoE#10479

Merged
zongfeijing merged 15 commits into
NVIDIA:mainfrom
zongfeijing:dev
Apr 1, 2026
Merged

[None] [feat] Add densegemm backend for MoE#10479
zongfeijing merged 15 commits into
NVIDIA:mainfrom
zongfeijing:dev

Conversation

@zongfeijing
Copy link
Copy Markdown
Collaborator

@zongfeijing zongfeijing commented Jan 7, 2026

@coderabbitai summary

Description

Add a new DenseGEMM backend for MoE that reshapes all expert weights into a single dense matrix and performs one large GEMM call per FC layer, targeting minimum latency on Blackwell (SM100/SM103) with NVFP4 quantization.

Key design

Instead of the traditional per-expert grouped GEMM approach, DenseGEMM concatenates all expert weights along the N (FC1) or K (FC2) dimension and executes a single dense GEMM. This trades flexibility for maximum GPU utilization at small batch sizes where per-expert GEMMs underutilize SMs.

  • FC1 kernel: Fuses GEMM + SwiGLU activation + FP4 output quantization (with SFC generation) into a single CuTe DSL kernel epilogue
  • FC2 kernel: Fuses GEMM + per-token-per-expert alpha scaling into the epilogue, accumulating expert contributions in a single pass
  • fc2_alpha fusion optimization: Normalizes fc2_alpha into FC1's alpha_post, reducing FC2 to a simple scalar-alpha nvfp4_gemm call (controlled by TRTLLM_MOE_FUSED_FC2_ALPHA env var, default: enabled)

Files added/modified

Category Files Description
CuTe DSL kernels cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc1.py, fc2.py Persistent tile-scheduled Blackwell GEMM kernels with SwiGLU fusion (FC1) and per-expert alpha scaling (FC2)
Custom ops custom_ops/cute_dsl_custom_ops.py (+738 lines) TunableRunner wrappers with autotuner integration for FC1/FC2 kernels
Backend module fused_moe/fused_moe_densegemm.py (new) DenseGEMMFusedMoE class: weight transpose, NVFP4 quantization, fc2_alpha fusion, CUDA stream overlap
Integration create_moe.py, configurable_moe.py, llm_args.py, utils.py Backend registration, DENSEGEMM option in MoeConfig, MoeFc2Alpha aux stream
Tests test_moe_densegemm.py (564 cases), test_moe_backend.py, test_moe_module.py FC1/FC2 kernel correctness tests + backend/module integration tests
Scripts run_moe_as_dense_gemm_fc1.py, run_moe_as_dense_gemm_fc2.py Standalone kernel runners for development and benchmarking

Constraints

  • SM100/SM103 only — CuTe DSL kernels require Blackwell architecture. get_moe_cls() gracefully falls back to CutlassFusedMoE on other architectures.
  • NVFP4 only — requires quant_mode.has_nvfp4()
  • SwiGLU only — FC1 kernel hardcodes SwiGLU fusion. Non-SwiGLU activation_type is rejected at construction time.
  • intermediate_size must be 256-aligned — FC2 kernel tiles K with MMA tile size 256; expert boundaries must align with tile boundaries.
  • AllToAll not supported — DenseGEMM treats all experts as one matrix, incompatible with expert-parallel dispatch.

Test Coverage

  • FC1 kernel unit tests: 384 parametrized cases (num_expert × weight_per_expert × num_tokens × hidden_size × alpha_post) in test_moe_densegemm.py
  • FC2 kernel unit tests: 180 parametrized cases (num_expert × weight_per_expert × num_tokens × output_hidden_size) in test_moe_densegemm.py
  • Backend-level tests via test_moe_backend.py (DenseGEMM added to parametrized matrix with skip guards)
  • Module-level tests via test_moe_module.py (ConfigurableMoE + DenseGEMM integration)

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@zongfeijing
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37813 [ run ] triggered by Bot. Commit: edc2b8f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37813 [ run ] completed with state SUCCESS. Commit: edc2b8f
/LLM/main/L0_MergeRequest_PR pipeline #29274 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@zongfeijing zongfeijing force-pushed the dev branch 3 times, most recently from 517f8fc to afed3ed Compare March 12, 2026 07:18
@zongfeijing zongfeijing marked this pull request as ready for review March 12, 2026 08:00
@zongfeijing zongfeijing requested review from a team as code owners March 12, 2026 08:00
@zongfeijing
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38695 [ run ] triggered by Bot. Commit: afed3ed Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

This PR introduces Dense GEMM-based MoE support for NVFP4 quantization, adding Blackwell SM100 kernels with SwiGLU fusion (FC1) and FC2 dense projection paths, custom PyTorch operations, a new DenseGEMMFusedMoE backend class, and comprehensive testing infrastructure.

Changes

Cohort / File(s) Summary
Dense GEMM Kernel Implementation
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/__init__.py, tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py
Added new experimental batched dense block-scaled GEMM kernel for Blackwell SM100 with persistent-tile scheduling, TMA load/store, multi-warp coordination, and extensive epilogue handling for scale factors and alpha-scaling. Includes layout validation, stage computation, and helper methods for SMEM/TMEM/GMEM tensor partitioning.
Custom Ops & Wrappers
tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
Added CuteDSLNVFP4DenseGemmSwigluRunner and CuteDSLNVFP4DenseGemmFC2Runner wrapper classes with kernel caching, tuning, and forward execution. Exposed custom PyTorch ops trtllm::cute_dsl_nvfp4_dense_gemm_swiglu_blackwell (FC1 with SwiGLU) and trtllm::cute_dsl_nvfp4_dense_gemm_fc2_blackwell (FC2) with per-expert scaling and MoE parameter handling.
MoE Backend Integration
tensorrt_llm/_torch/modules/fused_moe/fused_moe_densegemm.py, tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py, tensorrt_llm/_torch/modules/fused_moe/create_moe.py
Introduced DenseGEMMFusedMoE class (subclass of CutlassFusedMoE) with NVFP4 quantization paths, feature-flagged fused FC2 alpha generation, weight transformation for min-latency paths, and dual execution paths for fused vs. standard fc2_alpha handling. Integrated into backend selection logic with fallback support.
Configuration & API
tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/_torch/utils.py
Added DENSEGEMM as new MoE backend option in MoeConfig.backend Literal type. Introduced MoeFc2Alpha enum member to AuxStreamType and EventType for auxiliary stream/event management.
Test Utilities
tests/unittest/_torch/modules/moe/moe_test_utils.py, tests/unittest/_torch/modules/moe/test_moe_backend.py, tests/unittest/_torch/modules/moe/test_moe_module.py
Added MoeBackendType.DENSEGEMM enum member, should_skip_densegemm() validation function enforcing NVFP4 quantization and size alignment constraints, and test environment setup disabling fused fc2_alpha for backend-level testing.
Kernel Test Scripts
tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc1.py, tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py
Added comprehensive example/benchmark scripts demonstrating kernel execution with SwiGLU fusion (FC1) and FC2 projection, including scale-factor tensor setup, optional FP4 quantization paths, reference computation workflows, and detailed numeric validation against simulated references.
Unit Tests
tests/unittest/_torch/thop/parallel/test_moe_densegemm.py
Added parameterized unit tests for Dense GEMM SwiGLU (FC1) and FC2 kernels on SM100/103 GPUs, validating output shapes, dtypes, FP4 nibble-level accuracy, scale factor correctness with swizzle/unswizzle operations, and per-token-per-expert alpha scaling.

Sequence Diagram

sequenceDiagram
    participant Input as Input Tensor
    participant Router as Router
    participant Quant as NVFP4 Quantizer
    participant FC1 as Dense GEMM FC1<br/>(SwiGLU Fusion)
    participant FC2Alpha as FC2 Alpha Gen<br/>(Optional Fused)
    participant FC2 as Dense GEMM FC2
    participant Output as Output Tensor

    Input->>Router: token logits
    Router->>Quant: routing decisions<br/>expert assignments
    Quant->>Quant: quantize input to NVFP4<br/>compute scales
    
    alt Fused FC2 Alpha Path
        Quant->>FC1: x, x_sf, weights<br/>alpha, weight_scales
        FC1->>FC1: SwiGLU fusion<br/>per-expert scaling
        FC1->>FC2Alpha: FC1 output<br/>expert scales
        FC2Alpha->>FC2Alpha: gen_fc2_alpha_fused<br/>normalized alpha
        FC2Alpha->>FC2: alpha_max (scalar)
    else Standard Path
        Quant->>FC1: x, x_sf, weights<br/>alpha, weight_scales
        FC1->>FC1: SwiGLU fusion<br/>per-expert scaling
        FC1->>FC2: FC1 output
        FC2->>FC2: per-token-per-expert<br/>alpha computation
    end
    
    FC2->>FC2: dense GEMM projection<br/>per-expert scaling
    FC2->>Output: final MoE output<br/>output_scale (FP4)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Suggested reviewers

  • QiJune
  • syuoni
  • yizhang-nv
  • leslie-fang25
  • StanleySun639
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding a DenseGEMM backend for MoE. It follows the required format with [None][feat] prefix and is specific enough to understand the primary change.
Description check ✅ Passed PR description is well-structured with clear sections: Design, Files modified, Constraints, Test Coverage, and PR Checklist. All template sections are addressed with concrete details about implementation, architecture decisions, and testing strategy.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use your project's `pylint` configuration to improve the quality of Python code reviews.

Add a pylint configuration file to your project to customize how CodeRabbit runs pylint.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)

148-153: ⚠️ Potential issue | 🟠 Major

Add DenseGEMMFusedMoE to the load-balancer allowlist.

The new DenseGEMM branch forwards init_load_balancer, but the upfront assertion still excludes DenseGEMMFusedMoE. Any DenseGEMM call that reaches create_moe_backend() with MoE load balancing enabled will raise before construction, especially when ENABLE_CONFIGURABLE_MOE=0 or callers use create_moe_backend() directly.

🛠️ Suggested fix
     moe_load_balancer = get_moe_load_balancer()
     if moe_load_balancer is not None:
         assert moe_cls in [
             WideEPMoE, CutlassFusedMoE, TRTLLMGenFusedMoE, CuteDslFusedMoE,
-            DeepGemmFusedMoE
-        ], "MoE Load Balance is only supported in WideEPMoE, CutlassFusedMoE, TRTLLMGenFusedMoE, CuteDslFusedMoE, and DeepGemmFusedMoE."
+            DeepGemmFusedMoE, DenseGEMMFusedMoE
+        ], "MoE Load Balance is only supported in WideEPMoE, CutlassFusedMoE, TRTLLMGenFusedMoE, CuteDslFusedMoE, DeepGemmFusedMoE, and DenseGEMMFusedMoE."

Also applies to: 290-305

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 148 - 153,
The assertion that guards MoE load-balancer support excludes the new
DenseGEMMFusedMoE class, causing calls that forward init_load_balancer (e.g.,
via create_moe_backend when moe_load_balancer is set) to raise; update the
allowlist in the assertion that references WideEPMoE, CutlassFusedMoE,
TRTLLMGenFusedMoE, CuteDslFusedMoE, DeepGemmFusedMoE to also include
DenseGEMMFusedMoE, and make the same change for the second identical assertion
later in the file so DenseGEMMFusedMoE is permitted wherever load balancing is
checked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`:
- Around line 3035-3042: The cache key tuple used when caching DenseGEMM kernels
(the variable named cache_key built from self.weight_per_expert, mma_tiler_mn,
cluster_shape_mn, self.scaling_vector_size, self.expert_count, and the
alpha_post flag) must include the output element type; add output_dtype (or the
name used in the method that represents the requested output element type) into
that tuple so BF16-compiled kernels cannot be reused for FP16/FP32/FP4 calls.
Apply the same change to the other occurrence that builds a DenseGEMM cache_key
(the second cache_key construction in the class) so both cache keys include
output_dtype.
- Around line 2895-2900: The mapping that sets c_cutlass_dtype currently uses
.get(self.output_dtype, cutlass.BFloat16) and silently falls back to BF16;
change this to explicitly validate self.output_dtype and raise a clear exception
(e.g., ValueError) when the dtype is unsupported instead of defaulting to
cutlass.BFloat16. Locate the assignments to c_cutlass_dtype in the DenseGEMM
runners (the tactic-probing and forward paths for the FC1 and FC2 runners) where
the mapping dict is used and replace the .get default with an explicit
lookup+error path that names self.output_dtype and the allowed dtypes. Apply the
same fix to the other occurrences referenced in the review (the FC1/FC2
tactic-probing and forward blocks) so both probing and runtime use fail-fast
behavior.

In `@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py`:
- Around line 2288-2302: The swizzled SF layouts use floor division (m // 128, n
// 128) which drops partial 128-wide tiles and can produce zero-sized tensors;
replace those with ceil-div tile counts (e.g. compute m_tiles = (m + 127)//128
and n_tiles = (n + 127)//128) and use m_tiles/n_tiles in the calls to
cute.make_ordered_layout for a_sf and b_sf (or add explicit 128-tile guards in
the validation path), so the layouts always account for partial tiles and never
produce zero-length dimensions when m<128 or n not divisible by 128.
- Around line 1-27: The file
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py
currently has a BSD-3-Clause header; replace that entire BSD header block with
the repository-standard NVIDIA Apache-2.0 license header using the latest
modification year (2026) and include any necessary upstream attribution
separately (do not mix BSD text into the file header). Ensure the new header
matches the project template used across *.py sources (Apache-2.0 text and
NVIDIA copyright line), and leave the rest of fc2.py unchanged.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 1095-1100: The dispatch logic in configurable_moe.py fails to pass
backend-specific kwargs to DenseGEMMFusedMoE because the earlier tuple check
(CutlassFusedMoE, DeepGemmFusedMoE, CuteDslFusedMoE, DenseGEMMFusedMoE) is a
no-op and later branches use exact class equality (self.backend.__class__ ==) so
DenseGEMMFusedMoE is skipped; fix by either adding an explicit branch for
DenseGEMMFusedMoE that passes enable_alltoall and moe_output (e.g., elif
self.backend.__class__ == DenseGEMMFusedMoE: ...) or, preferably, change the
equality checks to isinstance(self.backend, CutlassFusedMoE) /
isinstance(self.backend, DenseGEMMFusedMoE) (or a common base) so inherited
classes receive the correct kwargs in the dispatch that sets enable_alltoall and
moe_output for the backend.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_densegemm.py`:
- Around line 67-75: The fallback to compute expert_size as
token_final_scales.shape[1] when alpha is None is wrong because that is top_k
not total experts and causes scatter_ indexing OOB; change the logic in the
block that builds fc2_alpha (around token_selected_experts, alpha,
token_final_scales, fc2_alpha, scatter_) so expert_size is the actual expert
count (either use an explicit expert_count argument or compute expert_count =
int(token_selected_experts.max().item()) + 1) when alpha is None, allocate
fc2_alpha with that expert_size, and then perform the scatter_. Ensure
token_selected_experts.long() still indexes into the resized fc2_alpha.

In
`@tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc1.py`:
- Around line 180-182: The code computes weight_per_expert = n // expert_count
but does not validate divisibility, causing silent truncation (e.g.,
mnkl=(512,256,256,1) with expert_count=257 yields 0); update the logic in this
module to first assert or raise a clear error if n % expert_count != 0 and
choose consistent default values so defaults are divisible by expert_count
(adjust mnkl and/or expert_count), and apply the same validation to the other
occurrence of the calculation (the block around the second occurrence computing
weight_per_expert at lines referenced in the review); reference the variables
weight_per_expert, n, expert_count and the default mnkl tuple when making these
changes.
- Around line 1-27: The file run_moe_as_dense_gemm_fc1.py contains an incorrect
BSD-3-Clause header and a 2025 copyright year; replace the existing top-of-file
license block with the repository's standard NVIDIA Apache-2.0 Python header and
update the copyright year to 2026 (the latest meaningful modification), ensuring
the new header sits at the very top of run_moe_as_dense_gemm_fc1.py before any
imports or code so it matches the repo policy for .py sources.
- Around line 59-65: The fallback import in the try/except block around
importing tensorrt_llm._torch.cute_dsl_kernels.blackwell.moe_as_dense_gemm (the
code that inserts into sys.path before importing blackwell.moe_as_dense_gemm as
kernel_module) uses Path(__file__).parents[3], which points to tests/, so change
the inserted path computation to use Path(__file__).parents[4] so the
sys.path.insert(0, str(...)) points to the repository
root/tensorrt_llm/_torch/cute_dsl_kernels and the fallback import succeeds.
- Around line 947-952: The parser arguments for the flags "--vectorized_f32" and
"--use_cupti" use action="store_true" together with default=True so they always
parse to True; update their parser.add_argument calls so the flags can be
disabled at the CLI: either set default=False when using action="store_true" (so
passing the flag enables the feature) or add complementary flags using
action="store_false" (e.g., "--no-vectorized_f32" / "--no-use_cupti") with
default=True to allow turning them off; modify the parser.add_argument calls for
the entries that reference "--vectorized_f32" and "--use_cupti" accordingly.

In
`@tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py`:
- Around line 1-27: Replace the current BSD-3-Clause header in
run_moe_as_dense_gemm_fc2.py with the repository-standard NVIDIA Apache-2.0
header (using the latest modification year), or if the BSD notice must be
preserved, prepend the official Apache-2.0 header above the existing BSD block;
ensure the file begins with the full Apache-2.0 license text and the NVIDIA
copyright line per project guidelines.
- Around line 62-68: The fallback sys.path insertion uses
Path(__file__).parents[3] which points to tests/ (one directory too shallow)
causing the fallback import of blackwell.moe_as_dense_gemm.fc2 to fail; update
the fallback to insert the correct parent directory by changing
Path(__file__).parents[3] to Path(__file__).parents[4] (so the inserted path is
.../tensorrt_llm/_torch/cute_dsl_kernels) before importing kernel_module,
ensuring the direct python run_moe_as_dense_gemm_fc2.py flow can locate the
module.

In `@tests/unittest/_torch/modules/moe/test_moe_backend.py`:
- Around line 471-473: The test mutates process-wide env var
TRTLLM_MOE_FUSED_FC2_ALPHA when backend_type == MoeBackendType.DENSEGEMM and
never restores it; change the code to isolate this override by using pytest's
monkeypatch (e.g., monkeypatch.setenv("TRTLLM_MOE_FUSED_FC2_ALPHA","0") within
the test or fixture that sets backend_type) or save the original os.environ
value and restore it in a finally/teardown block so the env is returned to its
prior state after the parametrized case; update the code around the backend_type
check (the branch referencing MoeBackendType.DENSEGEMM and
TRTLLM_MOE_FUSED_FC2_ALPHA) to use monkeypatch or explicit restore.

In `@tests/unittest/_torch/modules/moe/test_moe_module.py`:
- Around line 1059-1061: The test currently sets
os.environ["TRTLLM_MOE_FUSED_FC2_ALPHA"]="0" unconditionally which leaks into
the pytest process; replace that with the pytest monkeypatch fixture to localize
the change: when checking moe_backend == MoeBackendType.DENSEGEMM.value, call
monkeypatch.setenv("TRTLLM_MOE_FUSED_FC2_ALPHA", "0") instead of writing to
os.environ so the environment is restored after the test (reference
TRTLLM_MOE_FUSED_FC2_ALPHA, moe_backend, and MoeBackendType.DENSEGEMM.value).

In `@tests/unittest/_torch/thop/parallel/test_moe_densegemm.py`:
- Around line 1-2: The file test_moe_densegemm.py currently contains only an
SPDX short-form header; replace it with the repository’s full NVIDIA Apache-2.0
header block (include the full multi-line copyright/Apache 2.0 license text and
the latest modification year, e.g., 2026) so the top of test_moe_densegemm.py
matches the standard header used across Python sources in the repo.

---

Outside diff comments:
In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 148-153: The assertion that guards MoE load-balancer support
excludes the new DenseGEMMFusedMoE class, causing calls that forward
init_load_balancer (e.g., via create_moe_backend when moe_load_balancer is set)
to raise; update the allowlist in the assertion that references WideEPMoE,
CutlassFusedMoE, TRTLLMGenFusedMoE, CuteDslFusedMoE, DeepGemmFusedMoE to also
include DenseGEMMFusedMoE, and make the same change for the second identical
assertion later in the file so DenseGEMMFusedMoE is permitted wherever load
balancing is checked.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0fcf4776-2a01-4cd1-a2ac-f6f955ae18ad

📥 Commits

Reviewing files that changed from the base of the PR and between a4e6745 and afed3ed.

📒 Files selected for processing (15)
  • tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/__init__.py
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc1.py
  • tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py
  • tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/create_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/fused_moe_densegemm.py
  • tensorrt_llm/_torch/utils.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc1.py
  • tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py
  • tests/unittest/_torch/modules/moe/moe_test_utils.py
  • tests/unittest/_torch/modules/moe/test_moe_backend.py
  • tests/unittest/_torch/modules/moe/test_moe_module.py
  • tests/unittest/_torch/thop/parallel/test_moe_densegemm.py

Comment thread tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py Outdated
Comment thread tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
Comment thread tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py Outdated
Comment thread tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
Comment thread tests/unittest/_torch/modules/moe/test_moe_backend.py Outdated
Comment thread tests/unittest/_torch/modules/moe/test_moe_module.py Outdated
Comment thread tests/unittest/_torch/thop/parallel/test_moe_densegemm.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38695 [ run ] completed with state SUCCESS. Commit: afed3ed
/LLM/main/L0_MergeRequest_PR pipeline #30014 completed with status: 'SUCCESS'

CI Report

Link to invocation

@zongfeijing
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39108 [ run ] triggered by Bot. Commit: 6a25da6 Link to invocation

@zongfeijing
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@zongfeijing zongfeijing removed the request for review from hchings March 16, 2026 17:09
Add DenseGEMMFusedMoE to the ConfigurableMoE supported backend list
so it follows the same composition-based execution flow as other
backends (routing, quantization, communication, computation).

Signed-off-by: Zongfei Jing <[email protected]>
Add the 2CTA MMA tile (256,256) with cluster shape (2,1) to the
FC1 SwiGLU kernel autotuner candidates. Benchmarks show this config
is optimal for M=144~256 and M=400~512 on B200.

Signed-off-by: Zongfei Jing <[email protected]>
- Add assert for alpha!=None in gen_fc2_alpha_fused fallback path
- Add DenseGEMMFusedMoE to MoE Load Balancer supported list
- Reduce load_weights peak memory by replacing clone() with
  transpose().contiguous()

Signed-off-by: Zongfei Jing <[email protected]>
- Add SM100/103 capability check in get_moe_cls() with graceful fallback
  to CutlassFusedMoE, and assert in DenseGEMMFusedMoE.__init__()
- Accept and validate activation_type parameter to reject non-SwiGLU
  activations at construction time instead of silently using wrong semantics
- Add output_dtype to FC1/FC2 kernel cache keys to prevent cache collisions
  when called with different output types
- Strip {$nv-internal-release} markers from fc1.py, fc2.py, and test scripts
- Fix misleading load_weights comment about peak memory behavior
- Remove stray print() in test_moe_densegemm.py

Signed-off-by: Zongfei Jing <[email protected]>
…type fix

- Replace silent .get(dtype, BFloat16) fallback with explicit validation
  and raise ValueError for unsupported output_dtype in FC1/FC2 runners
- Extract dtype mapping as class-level _CUTLASS_DTYPE_MAP constant to
  eliminate 4x duplication across get_valid_tactics()/forward() methods
- Fix activation_type assertion: accept parameter in DenseGEMMFusedMoE
  __init__ and pass through create_moe_backend() so non-SwiGLU requests
  are properly rejected instead of silently defaulting to SwiGLU

Signed-off-by: Zongfei Jing <[email protected]>
Replace BSD-3-Clause headers with Apache-2.0 to match the rest of the
cute_dsl_kernels/blackwell/ directory convention.

Signed-off-by: Zongfei Jing <[email protected]>
Add DENSEGEMM entries to l0_b200.yml for test_moe_backend and
test_configurable_moe_single_gpu alongside existing CUTLASS/TRTLLM/
CUTEDSL/DEEPGEMM entries.

Signed-off-by: Zongfei Jing <[email protected]>
The FC1 kernel autotuner was using get_last_power_of_2_num_tokens_buckets
which only generated power-of-2 M values (1,2,4,...,256), missing optimal
configs for non-power-of-2 token counts. Switch to deep_gemm_gen_tuning_buckets
(step-8 for M<128, step-128 for M>=128) and increase tune_max_num_tokens
from 256 to 512 to cover the full operating range. Also consolidate
deep_gemm_gen_tuning_buckets into utils.py as a shared utility.

Signed-off-by: Zongfei Jiang <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
- Add can_implement() to DenseGEMMFusedMoE to accurately report backend
  capabilities (SM100/103, NVFP4-only, no swiglu_gptoss_style), instead
  of inheriting the overly permissive CutlassFusedMoE implementation.
- Replace magic seed number 1111 with named DEFAULT_RANDOM_SEED constant
  in both FC1 and FC2 test scripts.
- Move nested helper functions out of run() to module level in FC1 test
  script (simulate_f8_quantization, simulate_nvfp4_quantization,
  compute_scale_factor, apply_quantization_scale, unswizzle_kernel_sfc,
  ceil_div). Remove redundant local ceil_div definitions in both scripts.
- Fix fallback import path in both FC1 and FC2 test scripts: parents[3]
  pointed to tests/ instead of repo root. Changed to parents[4].

Signed-off-by: Zongfei Jiang <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
The intermediate_size >= 14336 skip was conservatively copied from the
CuteDSL/TRTLLMGen backends, but DenseGEMM does not have the same FP4
error accumulation issue. Verified Mixtral config (e=8, k=2, h=4096,
i=14336) passes at both seq_len=1 and seq_len=8.

Signed-off-by: Zongfei Jiang <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
The fused fc2_alpha path (TRTLLM_MOE_FUSED_FC2_ALPHA) has a known
accuracy issue under TP where the scalar fc2_alpha_max gets summed
tp_size times during ReduceScatter. Change the default from enabled
to disabled so the non-fused per-expert path is used, which correctly
factors out of the TP reduction.

Also add should_skip_densegemm to the multi-GPU test parameter
generation so DenseGEMM correctly skips EP modes (DEP/TEP) and
validates TP alignment constraints.

Signed-off-by: Zongfei Jing <[email protected]>
…tlassFusedMoE

DenseGEMMFusedMoE was inheriting from CutlassFusedMoE but overriding most of
its core methods while only reusing a few (create_weights, forward_impl,
load_weights). This tight coupling was misleading since the two backends have
fundamentally different architectures: CutlassFusedMoE uses per-expert
scattered GEMM with alltoall support, while DenseGEMM packs all experts into
a single dense matrix for min-latency scenarios (NVFP4, SM100/103 only).

Changes:
- DenseGEMMFusedMoE now inherits from MoE base class directly
- Implements its own create_weights, load_weights, forward_impl, and
  _get_quant_method independently
- Simplified forward_impl without chunking/alltoall logic
- Added isinstance check for DenseGEMMFusedMoE in ConfigurableMoE's
  float32 assertion for token_final_scales
- Fixed env var leak in test_moe_backend.py by using monkeypatch

Signed-off-by: Zongfei Jing <[email protected]>
@zongfeijing
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40899 [ run ] triggered by Bot. Commit: e4c5a29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40899 [ run ] completed with state SUCCESS. Commit: e4c5a29
/LLM/main/L0_MergeRequest_PR pipeline #31900 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@zongfeijing
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41037 [ run ] triggered by Bot. Commit: e4c5a29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41037 [ run ] completed with state FAILURE. Commit: e4c5a29
/LLM/main/L0_MergeRequest_PR pipeline #32016 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@zongfeijing
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41062 [ run ] triggered by Bot. Commit: e4c5a29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41062 [ run ] completed with state FAILURE. Commit: e4c5a29
/LLM/main/L0_MergeRequest_PR pipeline #32038 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Copy link
Copy Markdown
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the LLM change

@zongfeijing
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41128 [ run ] triggered by Bot. Commit: e4c5a29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41128 [ run ] completed with state SUCCESS. Commit: e4c5a29
/LLM/main/L0_MergeRequest_PR pipeline #32099 completed with status: 'SUCCESS'

CI Report

Link to invocation

@zongfeijing zongfeijing merged commit 596f57a into NVIDIA:main Apr 1, 2026
5 checks passed
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants