[None] [feat] Add densegemm backend for MoE by zongfeijing · Pull Request #10479 · NVIDIA/TensorRT-LLM

zongfeijing · 2026-01-07T05:16:16Z

Description

Add a new DenseGEMM backend for MoE that reshapes all expert weights into a single dense matrix and performs one large GEMM call per FC layer, targeting minimum latency on Blackwell (SM100/SM103) with NVFP4 quantization.

Key design

Instead of the traditional per-expert grouped GEMM approach, DenseGEMM concatenates all expert weights along the N (FC1) or K (FC2) dimension and executes a single dense GEMM. This trades flexibility for maximum GPU utilization at small batch sizes where per-expert GEMMs underutilize SMs.

FC1 kernel: Fuses GEMM + SwiGLU activation + FP4 output quantization (with SFC generation) into a single CuTe DSL kernel epilogue
FC2 kernel: Fuses GEMM + per-token-per-expert alpha scaling into the epilogue, accumulating expert contributions in a single pass
fc2_alpha fusion optimization: Normalizes fc2_alpha into FC1's alpha_post, reducing FC2 to a simple scalar-alpha nvfp4_gemm call (controlled by TRTLLM_MOE_FUSED_FC2_ALPHA env var, default: enabled)

Files added/modified

Category	Files	Description
CuTe DSL kernels	`cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc1.py`, `fc2.py`	Persistent tile-scheduled Blackwell GEMM kernels with SwiGLU fusion (FC1) and per-expert alpha scaling (FC2)
Custom ops	`custom_ops/cute_dsl_custom_ops.py` (+738 lines)	`TunableRunner` wrappers with autotuner integration for FC1/FC2 kernels
Backend module	`fused_moe/fused_moe_densegemm.py` (new)	`DenseGEMMFusedMoE` class: weight transpose, NVFP4 quantization, fc2_alpha fusion, CUDA stream overlap
Integration	`create_moe.py`, `configurable_moe.py`, `llm_args.py`, `utils.py`	Backend registration, `DENSEGEMM` option in `MoeConfig`, `MoeFc2Alpha` aux stream
Tests	`test_moe_densegemm.py` (564 cases), `test_moe_backend.py`, `test_moe_module.py`	FC1/FC2 kernel correctness tests + backend/module integration tests
Scripts	`run_moe_as_dense_gemm_fc1.py`, `run_moe_as_dense_gemm_fc2.py`	Standalone kernel runners for development and benchmarking

Constraints

SM100/SM103 only — CuTe DSL kernels require Blackwell architecture. get_moe_cls() gracefully falls back to CutlassFusedMoE on other architectures.
NVFP4 only — requires quant_mode.has_nvfp4()
SwiGLU only — FC1 kernel hardcodes SwiGLU fusion. Non-SwiGLU activation_type is rejected at construction time.
intermediate_size must be 256-aligned — FC2 kernel tiles K with MMA tile size 256; expert boundaries must align with tile boundaries.
AllToAll not supported — DenseGEMM treats all experts as one matrix, incompatible with expert-parallel dispatch.

Test Coverage

FC1 kernel unit tests: 384 parametrized cases (num_expert × weight_per_expert × num_tokens × hidden_size × alpha_post) in test_moe_densegemm.py
FC2 kernel unit tests: 180 parametrized cases (num_expert × weight_per_expert × num_tokens × output_hidden_size) in test_moe_densegemm.py
Backend-level tests via test_moe_backend.py (DenseGEMM added to parametrized matrix with skip guards)
Module-level tests via test_moe_module.py (ConfigurableMoE + DenseGEMM integration)

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

zongfeijing · 2026-03-05T05:45:18Z

/bot run

tensorrt-cicd · 2026-03-05T05:55:57Z

PR_Github #37813 [ run ] triggered by Bot. Commit: edc2b8f Link to invocation

tensorrt-cicd · 2026-03-05T07:46:59Z

PR_Github #37813 [ run ] completed with state SUCCESS. Commit: edc2b8f
/LLM/main/L0_MergeRequest_PR pipeline #29274 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zongfeijing · 2026-03-12T08:01:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-12T08:06:43Z

PR_Github #38695 [ run ] triggered by Bot. Commit: afed3ed Link to invocation

coderabbitai · 2026-03-12T08:12:54Z

📝 Walkthrough

Walkthrough

This PR introduces Dense GEMM-based MoE support for NVFP4 quantization, adding Blackwell SM100 kernels with SwiGLU fusion (FC1) and FC2 dense projection paths, custom PyTorch operations, a new DenseGEMMFusedMoE backend class, and comprehensive testing infrastructure.

Changes

Cohort / File(s)	Summary
Dense GEMM Kernel Implementation `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/__init__.py`, `tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py`	Added new experimental batched dense block-scaled GEMM kernel for Blackwell SM100 with persistent-tile scheduling, TMA load/store, multi-warp coordination, and extensive epilogue handling for scale factors and alpha-scaling. Includes layout validation, stage computation, and helper methods for SMEM/TMEM/GMEM tensor partitioning.
Custom Ops & Wrappers `tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`	Added `CuteDSLNVFP4DenseGemmSwigluRunner` and `CuteDSLNVFP4DenseGemmFC2Runner` wrapper classes with kernel caching, tuning, and forward execution. Exposed custom PyTorch ops `trtllm::cute_dsl_nvfp4_dense_gemm_swiglu_blackwell` (FC1 with SwiGLU) and `trtllm::cute_dsl_nvfp4_dense_gemm_fc2_blackwell` (FC2) with per-expert scaling and MoE parameter handling.
MoE Backend Integration `tensorrt_llm/_torch/modules/fused_moe/fused_moe_densegemm.py`, `tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`, `tensorrt_llm/_torch/modules/fused_moe/create_moe.py`	Introduced `DenseGEMMFusedMoE` class (subclass of `CutlassFusedMoE`) with NVFP4 quantization paths, feature-flagged fused FC2 alpha generation, weight transformation for min-latency paths, and dual execution paths for fused vs. standard fc2_alpha handling. Integrated into backend selection logic with fallback support.
Configuration & API `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/_torch/utils.py`	Added `DENSEGEMM` as new MoE backend option in `MoeConfig.backend` Literal type. Introduced `MoeFc2Alpha` enum member to `AuxStreamType` and `EventType` for auxiliary stream/event management.
Test Utilities `tests/unittest/_torch/modules/moe/moe_test_utils.py`, `tests/unittest/_torch/modules/moe/test_moe_backend.py`, `tests/unittest/_torch/modules/moe/test_moe_module.py`	Added `MoeBackendType.DENSEGEMM` enum member, `should_skip_densegemm()` validation function enforcing NVFP4 quantization and size alignment constraints, and test environment setup disabling fused fc2_alpha for backend-level testing.
Kernel Test Scripts `tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc1.py`, `tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py`	Added comprehensive example/benchmark scripts demonstrating kernel execution with SwiGLU fusion (FC1) and FC2 projection, including scale-factor tensor setup, optional FP4 quantization paths, reference computation workflows, and detailed numeric validation against simulated references.
Unit Tests `tests/unittest/_torch/thop/parallel/test_moe_densegemm.py`	Added parameterized unit tests for Dense GEMM SwiGLU (FC1) and FC2 kernels on SM100/103 GPUs, validating output shapes, dtypes, FP4 nibble-level accuracy, scale factor correctness with swizzle/unswizzle operations, and per-token-per-expert alpha scaling.

Sequence Diagram

sequenceDiagram
    participant Input as Input Tensor
    participant Router as Router
    participant Quant as NVFP4 Quantizer
    participant FC1 as Dense GEMM FC1<br/>(SwiGLU Fusion)
    participant FC2Alpha as FC2 Alpha Gen<br/>(Optional Fused)
    participant FC2 as Dense GEMM FC2
    participant Output as Output Tensor

    Input->>Router: token logits
    Router->>Quant: routing decisions<br/>expert assignments
    Quant->>Quant: quantize input to NVFP4<br/>compute scales
    
    alt Fused FC2 Alpha Path
        Quant->>FC1: x, x_sf, weights<br/>alpha, weight_scales
        FC1->>FC1: SwiGLU fusion<br/>per-expert scaling
        FC1->>FC2Alpha: FC1 output<br/>expert scales
        FC2Alpha->>FC2Alpha: gen_fc2_alpha_fused<br/>normalized alpha
        FC2Alpha->>FC2: alpha_max (scalar)
    else Standard Path
        Quant->>FC1: x, x_sf, weights<br/>alpha, weight_scales
        FC1->>FC1: SwiGLU fusion<br/>per-expert scaling
        FC1->>FC2: FC1 output
        FC2->>FC2: per-token-per-expert<br/>alpha computation
    end
    
    FC2->>FC2: dense GEMM projection<br/>per-expert scaling
    FC2->>Output: final MoE output<br/>output_scale (FP4)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Suggested reviewers

QiJune
syuoni
yizhang-nv
leslie-fang25
StanleySun639

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a DenseGEMM backend for MoE. It follows the required format with [None][feat] prefix and is specific enough to understand the primary change.
Description check	✅ Passed	PR description is well-structured with clear sections: Design, Files modified, Constraints, Test Coverage, and PR Checklist. All template sections are addressed with concrete details about implementation, architecture decisions, and testing strategy.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use your project's `pylint` configuration to improve the quality of Python code reviews.

Add a pylint configuration file to your project to customize how CodeRabbit runs pylint.

coderabbitai

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)

148-153: ⚠️ Potential issue | 🟠 Major

Add DenseGEMMFusedMoE to the load-balancer allowlist.

The new DenseGEMM branch forwards init_load_balancer, but the upfront assertion still excludes DenseGEMMFusedMoE. Any DenseGEMM call that reaches create_moe_backend() with MoE load balancing enabled will raise before construction, especially when ENABLE_CONFIGURABLE_MOE=0 or callers use create_moe_backend() directly.

🛠️ Suggested fix

     moe_load_balancer = get_moe_load_balancer()
     if moe_load_balancer is not None:
         assert moe_cls in [
             WideEPMoE, CutlassFusedMoE, TRTLLMGenFusedMoE, CuteDslFusedMoE,
-            DeepGemmFusedMoE
-        ], "MoE Load Balance is only supported in WideEPMoE, CutlassFusedMoE, TRTLLMGenFusedMoE, CuteDslFusedMoE, and DeepGemmFusedMoE."
+            DeepGemmFusedMoE, DenseGEMMFusedMoE
+        ], "MoE Load Balance is only supported in WideEPMoE, CutlassFusedMoE, TRTLLMGenFusedMoE, CuteDslFusedMoE, DeepGemmFusedMoE, and DenseGEMMFusedMoE."

Also applies to: 290-305

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 148 - 153,
The assertion that guards MoE load-balancer support excludes the new
DenseGEMMFusedMoE class, causing calls that forward init_load_balancer (e.g.,
via create_moe_backend when moe_load_balancer is set) to raise; update the
allowlist in the assertion that references WideEPMoE, CutlassFusedMoE,
TRTLLMGenFusedMoE, CuteDslFusedMoE, DeepGemmFusedMoE to also include
DenseGEMMFusedMoE, and make the same change for the second identical assertion
later in the file so DenseGEMMFusedMoE is permitted wherever load balancing is
checked.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py`:
- Around line 3035-3042: The cache key tuple used when caching DenseGEMM kernels
(the variable named cache_key built from self.weight_per_expert, mma_tiler_mn,
cluster_shape_mn, self.scaling_vector_size, self.expert_count, and the
alpha_post flag) must include the output element type; add output_dtype (or the
name used in the method that represents the requested output element type) into
that tuple so BF16-compiled kernels cannot be reused for FP16/FP32/FP4 calls.
Apply the same change to the other occurrence that builds a DenseGEMM cache_key
(the second cache_key construction in the class) so both cache keys include
output_dtype.
- Around line 2895-2900: The mapping that sets c_cutlass_dtype currently uses
.get(self.output_dtype, cutlass.BFloat16) and silently falls back to BF16;
change this to explicitly validate self.output_dtype and raise a clear exception
(e.g., ValueError) when the dtype is unsupported instead of defaulting to
cutlass.BFloat16. Locate the assignments to c_cutlass_dtype in the DenseGEMM
runners (the tactic-probing and forward paths for the FC1 and FC2 runners) where
the mapping dict is used and replace the .get default with an explicit
lookup+error path that names self.output_dtype and the allowed dtypes. Apply the
same fix to the other occurrences referenced in the review (the FC1/FC2
tactic-probing and forward blocks) so both probing and runtime use fail-fast
behavior.

In `@tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py`:
- Around line 2288-2302: The swizzled SF layouts use floor division (m // 128, n
// 128) which drops partial 128-wide tiles and can produce zero-sized tensors;
replace those with ceil-div tile counts (e.g. compute m_tiles = (m + 127)//128
and n_tiles = (n + 127)//128) and use m_tiles/n_tiles in the calls to
cute.make_ordered_layout for a_sf and b_sf (or add explicit 128-tile guards in
the validation path), so the layouts always account for partial tiles and never
produce zero-length dimensions when m<128 or n not divisible by 128.
- Around line 1-27: The file
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py
currently has a BSD-3-Clause header; replace that entire BSD header block with
the repository-standard NVIDIA Apache-2.0 license header using the latest
modification year (2026) and include any necessary upstream attribution
separately (do not mix BSD text into the file header). Ensure the new header
matches the project template used across *.py sources (Apache-2.0 text and
NVIDIA copyright line), and leave the rest of fc2.py unchanged.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 1095-1100: The dispatch logic in configurable_moe.py fails to pass
backend-specific kwargs to DenseGEMMFusedMoE because the earlier tuple check
(CutlassFusedMoE, DeepGemmFusedMoE, CuteDslFusedMoE, DenseGEMMFusedMoE) is a
no-op and later branches use exact class equality (self.backend.__class__ ==) so
DenseGEMMFusedMoE is skipped; fix by either adding an explicit branch for
DenseGEMMFusedMoE that passes enable_alltoall and moe_output (e.g., elif
self.backend.__class__ == DenseGEMMFusedMoE: ...) or, preferably, change the
equality checks to isinstance(self.backend, CutlassFusedMoE) /
isinstance(self.backend, DenseGEMMFusedMoE) (or a common base) so inherited
classes receive the correct kwargs in the dispatch that sets enable_alltoall and
moe_output for the backend.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_densegemm.py`:
- Around line 67-75: The fallback to compute expert_size as
token_final_scales.shape[1] when alpha is None is wrong because that is top_k
not total experts and causes scatter_ indexing OOB; change the logic in the
block that builds fc2_alpha (around token_selected_experts, alpha,
token_final_scales, fc2_alpha, scatter_) so expert_size is the actual expert
count (either use an explicit expert_count argument or compute expert_count =
int(token_selected_experts.max().item()) + 1) when alpha is None, allocate
fc2_alpha with that expert_size, and then perform the scatter_. Ensure
token_selected_experts.long() still indexes into the resized fc2_alpha.

In
`@tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc1.py`:
- Around line 180-182: The code computes weight_per_expert = n // expert_count
but does not validate divisibility, causing silent truncation (e.g.,
mnkl=(512,256,256,1) with expert_count=257 yields 0); update the logic in this
module to first assert or raise a clear error if n % expert_count != 0 and
choose consistent default values so defaults are divisible by expert_count
(adjust mnkl and/or expert_count), and apply the same validation to the other
occurrence of the calculation (the block around the second occurrence computing
weight_per_expert at lines referenced in the review); reference the variables
weight_per_expert, n, expert_count and the default mnkl tuple when making these
changes.
- Around line 1-27: The file run_moe_as_dense_gemm_fc1.py contains an incorrect
BSD-3-Clause header and a 2025 copyright year; replace the existing top-of-file
license block with the repository's standard NVIDIA Apache-2.0 Python header and
update the copyright year to 2026 (the latest meaningful modification), ensuring
the new header sits at the very top of run_moe_as_dense_gemm_fc1.py before any
imports or code so it matches the repo policy for .py sources.
- Around line 59-65: The fallback import in the try/except block around
importing tensorrt_llm._torch.cute_dsl_kernels.blackwell.moe_as_dense_gemm (the
code that inserts into sys.path before importing blackwell.moe_as_dense_gemm as
kernel_module) uses Path(__file__).parents[3], which points to tests/, so change
the inserted path computation to use Path(__file__).parents[4] so the
sys.path.insert(0, str(...)) points to the repository
root/tensorrt_llm/_torch/cute_dsl_kernels and the fallback import succeeds.
- Around line 947-952: The parser arguments for the flags "--vectorized_f32" and
"--use_cupti" use action="store_true" together with default=True so they always
parse to True; update their parser.add_argument calls so the flags can be
disabled at the CLI: either set default=False when using action="store_true" (so
passing the flag enables the feature) or add complementary flags using
action="store_false" (e.g., "--no-vectorized_f32" / "--no-use_cupti") with
default=True to allow turning them off; modify the parser.add_argument calls for
the entries that reference "--vectorized_f32" and "--use_cupti" accordingly.

In
`@tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py`:
- Around line 1-27: Replace the current BSD-3-Clause header in
run_moe_as_dense_gemm_fc2.py with the repository-standard NVIDIA Apache-2.0
header (using the latest modification year), or if the BSD notice must be
preserved, prepend the official Apache-2.0 header above the existing BSD block;
ensure the file begins with the full Apache-2.0 license text and the NVIDIA
copyright line per project guidelines.
- Around line 62-68: The fallback sys.path insertion uses
Path(__file__).parents[3] which points to tests/ (one directory too shallow)
causing the fallback import of blackwell.moe_as_dense_gemm.fc2 to fail; update
the fallback to insert the correct parent directory by changing
Path(__file__).parents[3] to Path(__file__).parents[4] (so the inserted path is
.../tensorrt_llm/_torch/cute_dsl_kernels) before importing kernel_module,
ensuring the direct python run_moe_as_dense_gemm_fc2.py flow can locate the
module.

In `@tests/unittest/_torch/modules/moe/test_moe_backend.py`:
- Around line 471-473: The test mutates process-wide env var
TRTLLM_MOE_FUSED_FC2_ALPHA when backend_type == MoeBackendType.DENSEGEMM and
never restores it; change the code to isolate this override by using pytest's
monkeypatch (e.g., monkeypatch.setenv("TRTLLM_MOE_FUSED_FC2_ALPHA","0") within
the test or fixture that sets backend_type) or save the original os.environ
value and restore it in a finally/teardown block so the env is returned to its
prior state after the parametrized case; update the code around the backend_type
check (the branch referencing MoeBackendType.DENSEGEMM and
TRTLLM_MOE_FUSED_FC2_ALPHA) to use monkeypatch or explicit restore.

In `@tests/unittest/_torch/modules/moe/test_moe_module.py`:
- Around line 1059-1061: The test currently sets
os.environ["TRTLLM_MOE_FUSED_FC2_ALPHA"]="0" unconditionally which leaks into
the pytest process; replace that with the pytest monkeypatch fixture to localize
the change: when checking moe_backend == MoeBackendType.DENSEGEMM.value, call
monkeypatch.setenv("TRTLLM_MOE_FUSED_FC2_ALPHA", "0") instead of writing to
os.environ so the environment is restored after the test (reference
TRTLLM_MOE_FUSED_FC2_ALPHA, moe_backend, and MoeBackendType.DENSEGEMM.value).

In `@tests/unittest/_torch/thop/parallel/test_moe_densegemm.py`:
- Around line 1-2: The file test_moe_densegemm.py currently contains only an
SPDX short-form header; replace it with the repository’s full NVIDIA Apache-2.0
header block (include the full multi-line copyright/Apache 2.0 license text and
the latest modification year, e.g., 2026) so the top of test_moe_densegemm.py
matches the standard header used across Python sources in the repo.

---

Outside diff comments:
In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 148-153: The assertion that guards MoE load-balancer support
excludes the new DenseGEMMFusedMoE class, causing calls that forward
init_load_balancer (e.g., via create_moe_backend when moe_load_balancer is set)
to raise; update the allowlist in the assertion that references WideEPMoE,
CutlassFusedMoE, TRTLLMGenFusedMoE, CuteDslFusedMoE, DeepGemmFusedMoE to also
include DenseGEMMFusedMoE, and make the same change for the second identical
assertion later in the file so DenseGEMMFusedMoE is permitted wherever load
balancing is checked.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0fcf4776-2a01-4cd1-a2ac-f6f955ae18ad

📥 Commits

Reviewing files that changed from the base of the PR and between a4e6745 and afed3ed.

📒 Files selected for processing (15)

tensorrt_llm/_torch/custom_ops/cute_dsl_custom_ops.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/__init__.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc1.py
tensorrt_llm/_torch/cute_dsl_kernels/blackwell/moe_as_dense_gemm/fc2.py
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
tensorrt_llm/_torch/modules/fused_moe/create_moe.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_densegemm.py
tensorrt_llm/_torch/utils.py
tensorrt_llm/llmapi/llm_args.py
tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc1.py
tests/scripts/cute_dsl_kernels/moe_as_dense_gemm/run_moe_as_dense_gemm_fc2.py
tests/unittest/_torch/modules/moe/moe_test_utils.py
tests/unittest/_torch/modules/moe/test_moe_backend.py
tests/unittest/_torch/modules/moe/test_moe_module.py
tests/unittest/_torch/thop/parallel/test_moe_densegemm.py

tensorrt-cicd · 2026-03-12T14:45:08Z

PR_Github #38695 [ run ] completed with state SUCCESS. Commit: afed3ed
/LLM/main/L0_MergeRequest_PR pipeline #30014 completed with status: 'SUCCESS'

CI Report

Link to invocation

zongfeijing · 2026-03-16T16:26:05Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-16T16:33:10Z

PR_Github #39108 [ run ] triggered by Bot. Commit: 6a25da6 Link to invocation

zongfeijing · 2026-03-16T17:08:36Z

/bot run --disable-fail-fast

Add DenseGEMMFusedMoE to the ConfigurableMoE supported backend list so it follows the same composition-based execution flow as other backends (routing, quantization, communication, computation). Signed-off-by: Zongfei Jing <[email protected]>

Signed-off-by: Zongfei Jing <[email protected]>

Add the 2CTA MMA tile (256,256) with cluster shape (2,1) to the FC1 SwiGLU kernel autotuner candidates. Benchmarks show this config is optimal for M=144~256 and M=400~512 on B200. Signed-off-by: Zongfei Jing <[email protected]>

- Add assert for alpha!=None in gen_fc2_alpha_fused fallback path - Add DenseGEMMFusedMoE to MoE Load Balancer supported list - Reduce load_weights peak memory by replacing clone() with transpose().contiguous() Signed-off-by: Zongfei Jing <[email protected]>

- Add SM100/103 capability check in get_moe_cls() with graceful fallback to CutlassFusedMoE, and assert in DenseGEMMFusedMoE.__init__() - Accept and validate activation_type parameter to reject non-SwiGLU activations at construction time instead of silently using wrong semantics - Add output_dtype to FC1/FC2 kernel cache keys to prevent cache collisions when called with different output types - Strip {$nv-internal-release} markers from fc1.py, fc2.py, and test scripts - Fix misleading load_weights comment about peak memory behavior - Remove stray print() in test_moe_densegemm.py Signed-off-by: Zongfei Jing <[email protected]>

…type fix - Replace silent .get(dtype, BFloat16) fallback with explicit validation and raise ValueError for unsupported output_dtype in FC1/FC2 runners - Extract dtype mapping as class-level _CUTLASS_DTYPE_MAP constant to eliminate 4x duplication across get_valid_tactics()/forward() methods - Fix activation_type assertion: accept parameter in DenseGEMMFusedMoE __init__ and pass through create_moe_backend() so non-SwiGLU requests are properly rejected instead of silently defaulting to SwiGLU Signed-off-by: Zongfei Jing <[email protected]>

Replace BSD-3-Clause headers with Apache-2.0 to match the rest of the cute_dsl_kernels/blackwell/ directory convention. Signed-off-by: Zongfei Jing <[email protected]>

Add DENSEGEMM entries to l0_b200.yml for test_moe_backend and test_configurable_moe_single_gpu alongside existing CUTLASS/TRTLLM/ CUTEDSL/DEEPGEMM entries. Signed-off-by: Zongfei Jing <[email protected]>

The FC1 kernel autotuner was using get_last_power_of_2_num_tokens_buckets which only generated power-of-2 M values (1,2,4,...,256), missing optimal configs for non-power-of-2 token counts. Switch to deep_gemm_gen_tuning_buckets (step-8 for M<128, step-128 for M>=128) and increase tune_max_num_tokens from 256 to 512 to cover the full operating range. Also consolidate deep_gemm_gen_tuning_buckets into utils.py as a shared utility. Signed-off-by: Zongfei Jiang <[email protected]> Signed-off-by: Zongfei Jing <[email protected]>

- Add can_implement() to DenseGEMMFusedMoE to accurately report backend capabilities (SM100/103, NVFP4-only, no swiglu_gptoss_style), instead of inheriting the overly permissive CutlassFusedMoE implementation. - Replace magic seed number 1111 with named DEFAULT_RANDOM_SEED constant in both FC1 and FC2 test scripts. - Move nested helper functions out of run() to module level in FC1 test script (simulate_f8_quantization, simulate_nvfp4_quantization, compute_scale_factor, apply_quantization_scale, unswizzle_kernel_sfc, ceil_div). Remove redundant local ceil_div definitions in both scripts. - Fix fallback import path in both FC1 and FC2 test scripts: parents[3] pointed to tests/ instead of repo root. Changed to parents[4]. Signed-off-by: Zongfei Jiang <[email protected]> Signed-off-by: Zongfei Jing <[email protected]>

The intermediate_size >= 14336 skip was conservatively copied from the CuteDSL/TRTLLMGen backends, but DenseGEMM does not have the same FP4 error accumulation issue. Verified Mixtral config (e=8, k=2, h=4096, i=14336) passes at both seq_len=1 and seq_len=8. Signed-off-by: Zongfei Jiang <[email protected]> Signed-off-by: Zongfei Jing <[email protected]>

The fused fc2_alpha path (TRTLLM_MOE_FUSED_FC2_ALPHA) has a known accuracy issue under TP where the scalar fc2_alpha_max gets summed tp_size times during ReduceScatter. Change the default from enabled to disabled so the non-fused per-expert path is used, which correctly factors out of the TP reduction. Also add should_skip_densegemm to the multi-GPU test parameter generation so DenseGEMM correctly skips EP modes (DEP/TEP) and validates TP alignment constraints. Signed-off-by: Zongfei Jing <[email protected]>

…tlassFusedMoE DenseGEMMFusedMoE was inheriting from CutlassFusedMoE but overriding most of its core methods while only reusing a few (create_weights, forward_impl, load_weights). This tight coupling was misleading since the two backends have fundamentally different architectures: CutlassFusedMoE uses per-expert scattered GEMM with alltoall support, while DenseGEMM packs all experts into a single dense matrix for min-latency scenarios (NVFP4, SM100/103 only). Changes: - DenseGEMMFusedMoE now inherits from MoE base class directly - Implements its own create_weights, load_weights, forward_impl, and _get_quant_method independently - Simplified forward_impl without chunking/alltoall logic - Added isinstance check for DenseGEMMFusedMoE in ConfigurableMoE's float32 assertion for token_final_scales - Fixed env var leak in test_moe_backend.py by using monkeypatch Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing · 2026-03-31T07:34:37Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-31T07:40:07Z

PR_Github #40899 [ run ] triggered by Bot. Commit: e4c5a29 Link to invocation

tensorrt-cicd · 2026-03-31T18:42:24Z

PR_Github #40899 [ run ] completed with state SUCCESS. Commit: e4c5a29
/LLM/main/L0_MergeRequest_PR pipeline #31900 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zongfeijing · 2026-03-31T23:20:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-31T23:25:49Z

PR_Github #41037 [ run ] triggered by Bot. Commit: e4c5a29 Link to invocation

tensorrt-cicd · 2026-04-01T01:23:19Z

PR_Github #41037 [ run ] completed with state FAILURE. Commit: e4c5a29
/LLM/main/L0_MergeRequest_PR pipeline #32016 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

zongfeijing · 2026-04-01T01:37:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T01:43:11Z

PR_Github #41062 [ run ] triggered by Bot. Commit: e4c5a29 Link to invocation

tensorrt-cicd · 2026-04-01T03:27:31Z

PR_Github #41062 [ run ] completed with state FAILURE. Commit: e4c5a29
/LLM/main/L0_MergeRequest_PR pipeline #32038 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

QiJune

LGTM for the LLM change

zongfeijing · 2026-04-01T06:16:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-01T06:21:56Z

PR_Github #41128 [ run ] triggered by Bot. Commit: e4c5a29 Link to invocation

tensorrt-cicd · 2026-04-01T10:12:56Z

PR_Github #41128 [ run ] completed with state SUCCESS. Commit: e4c5a29
/LLM/main/L0_MergeRequest_PR pipeline #32099 completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing force-pushed the dev branch from 8ffbb85 to 3c5f97b Compare January 13, 2026 08:15

zongfeijing force-pushed the dev branch 5 times, most recently from 298c140 to edc2b8f Compare February 6, 2026 05:36

zongfeijing force-pushed the dev branch 3 times, most recently from 517f8fc to afed3ed Compare March 12, 2026 07:18

zongfeijing marked this pull request as ready for review March 12, 2026 08:00

zongfeijing requested review from a team as code owners March 12, 2026 08:00

zongfeijing requested review from hchings, lfr-0531, xxi-nv and yizhang-nv March 12, 2026 08:00

coderabbitai Bot reviewed Mar 12, 2026

View reviewed changes

zongfeijing removed the request for review from hchings March 16, 2026 17:09

zongfeijing added 13 commits March 31, 2026 00:33

Add unit tests for MoE DenseGEMM

1f4ace3

Signed-off-by: Zongfei Jing <[email protected]>

Add (256,256):(2,1) config to FC1 DenseGEMM autotuner

c2d7ed6

Add the 2CTA MMA tile (256,256) with cluster shape (2,1) to the FC1 SwiGLU kernel autotuner candidates. Benchmarks show this config is optimal for M=144~256 and M=400~512 on B200. Signed-off-by: Zongfei Jing <[email protected]>

Update fc1.py and fc2.py license headers to Apache-2.0

315414f

Replace BSD-3-Clause headers with Apache-2.0 to match the rest of the cute_dsl_kernels/blackwell/ directory convention. Signed-off-by: Zongfei Jing <[email protected]>

Add DenseGEMM backend tests to B200 CI test list

c5cd101

Add DENSEGEMM entries to l0_b200.yml for test_moe_backend and test_configurable_moe_single_gpu alongside existing CUTLASS/TRTLLM/ CUTEDSL/DEEPGEMM entries. Signed-off-by: Zongfei Jing <[email protected]>

zongfeijing force-pushed the dev branch from f6f7ec5 to e4c5a29 Compare March 31, 2026 07:33

hyukn approved these changes Apr 1, 2026

View reviewed changes

QiJune approved these changes Apr 1, 2026

View reviewed changes

zongfeijing merged commit 596f57a into NVIDIA:main Apr 1, 2026
5 checks passed

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[None] [feat] Add densegemm backend for MoE (NVIDIA#10479)

f558a4d

Signed-off-by: Zongfei Jing <[email protected]>

Conversation

zongfeijing commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key design

Files added/modified

Constraints

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

zongfeijing commented Mar 5, 2026

Uh oh!

tensorrt-cicd commented Mar 5, 2026

Uh oh!

tensorrt-cicd commented Mar 5, 2026

Uh oh!

zongfeijing commented Mar 12, 2026

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

coderabbitai Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Mar 12, 2026

Uh oh!

zongfeijing commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

zongfeijing commented Mar 16, 2026

Uh oh!

zongfeijing commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

zongfeijing commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

zongfeijing commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

zongfeijing commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

zongfeijing commented Jan 7, 2026 •

edited

Loading

coderabbitai Bot commented Mar 12, 2026 •

edited

Loading