[TRTLLM-11127][feat] add W4A8_MXFP4_FP8 MoE unit test support by xxi-nv · Pull Request #13401 · NVIDIA/TensorRT-LLM

xxi-nv · 2026-04-24T01:29:15Z

Add W4A8_MXFP4_FP8 coverage to both test_moe_backend.py and test_moe_module.py, supporting the CUTLASS and TRTLLM-gen fused MoE methods.

Key pieces

quantize_utils.py: extend get_test_quant_params for W4A8_MXFP4_FP8 with backend-specific alignment (CUTLASS 128/128, TRTLLM 128/512); loosen MXFP4MXFP8QuantizeUtil.create_weights to accept both W4A8_MXFP4_MXFP8 and W4A8_MXFP4_FP8; add MXFP4FP8QuantizeUtil and a dedicated bf16 reference MXFP4FP8RefGatedMLPFusedMoE that dequantizes MXFP4 weights via trtllm.mxfp4_dequantize_unswizzled at load time (avoids the GatedMLP+W4A8MXFP4FP8LinearMethod ref-path scale divergence documented in the PR description).
moe_test_utils.py: wire W4A8_MXFP4_FP8 into trtllm_gen_quant_algos and the is_mxfp4_variant auto-pad set; extend the 128-alignment skip to both MXFP4 A8 variants.
test_moe_backend.py / test_moe_module.py: register the new algo in QUANT_ALGOS_TO_TEST / QUANT_ALGOS and extend the MXFP4 weight-prep path in prepare_weights_from_backend.

Documented skips (each with rationale in code)

CUTLASS W4A8_MXFP4_FP8: kernel bug, accuracy gap independent of ref.
TRTLLM top_k=1: static vs dynamic FP8 scale divergence.
Module test TRTLLM W4A8_MXFP4_FP8 with num_experts>=60 and intermediate_size>=1408: ConfigurableMoE/autotuner divergence from backend.run_moe (follow-up); backend test still exercises this config.

Verified on GB200 (OCI): backend test 4 passed / 230 deselected; module test 16 skipped / 634 deselected / 0 failed.

Summary by CodeRabbit

Release Notes

Tests
- Expanded test coverage for W4A8_MXFP4_FP8 quantization algorithm across MOE module tests
- Added test skips documenting known limitations with specific backend and configuration combinations

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-24T01:36:09Z

📝 Walkthrough

Walkthrough

The changes introduce support for a new quantization algorithm W4A8_MXFP4_FP8 across MoE test utilities and modules. Updates include backend-specific skip constraints, quantization parameter configuration, FP8-specific reference implementations, and expanded test matrices with targeted skips for known mismatches.

Changes

Cohort / File(s)	Summary
Skip & Constraint Logic `tests/unittest/_torch/modules/moe/moe_test_utils.py`	Added `W4A8_MXFP4_FP8` to TRTLLM Gen quantization-algorithm gating; generalized 128-alignment failure check with dynamic error messaging; introduced TRTLLM skip for `top_k==1` due to reference-vs-fused activation/scale divergence; added unconditional CUTLASS skip for incorrect outputs on GB200; integrated `should_skip_cutlass` into quick-skip chain; expanded MXFP4 auto-padding exception set.
Quantization Configuration & Utilities `tests/unittest/_torch/modules/moe/quantize_utils.py`	Extended `get_test_quant_params` to recognize and configure `W4A8_MXFP4_FP8` with FP8 activation scaling via `x_sf_global`; generalized `MXFP4MXFP8QuantizeUtil.create_weights` to support both MXFP4_MXFP8 and FP8 variants; added new `MXFP4FP8RefGatedMLPFusedMoE` reference utility using bf16 GatedMLP forward to align with fused kernel; added new `MXFP4FP8QuantizeUtil` subclass for per-expert per-tensor input-scale population.
Test Infrastructure `tests/unittest/_torch/modules/moe/test_moe_backend.py`, `tests/unittest/_torch/modules/moe/test_moe_module.py`	Extended test matrix to include `W4A8_MXFP4_FP8` algorithm; expanded weight-preparation logic to handle both MXFP4_MXFP8 and FP8 variants via `prepare_weights_from_backend`; added targeted TRTLLM skip for large configurations (num_experts ≥ 60, intermediate_size ≥ 1408) due to high-level fused-moe result mismatch with bf16 reference.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The PR title clearly identifies the main change (adding W4A8_MXFP4_FP8 MoE unit test support) with proper JIRA ticket format and feature type indicator.
Description check	✅ Passed	The PR description is comprehensive, explaining what was changed (supporting W4A8_MXFP4_FP8 in test files), why (extending coverage), and includes test coverage details with verification results. Key pieces are clearly documented with technical rationale for each skip condition.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/unittest/_torch/modules/moe/quantize_utils.py`:
- Around line 1683-1700: The test's GatedMLP reference path is supposed to use
bf16 but currently passes dtype=dtype into super().__init__ and later calls
.to(dtype=self.dtype,...), which lets fp16 tests reuse the fp16 path; change the
constructor call to force bfloat16 (e.g., pass dtype=torch.bfloat16 or
equivalent bf16 constant) so the experts are built in bf16 (keep ModelConfig()
without quant_config), and ensure any subsequent .to(...) that currently
converts expert submodules does not convert them back to the test dtype (instead
convert only the outer wrapper or skip converting the expert weights). Apply the
same change at the other occurrence noted (lines 1729-1743).

In `@tests/unittest/_torch/modules/moe/test_moe_module.py`:
- Around line 1125-1145: Move the known-bad predicate that skips TRTLLM +
W4A8_MXFP4_FP8 large configs out of the local block guarding
test_configurable_moe_single_gpu() and into the shared test-parameter filtering
used by generate_multi_gpu_test_params(); specifically, extract the conditional
referencing quant_algo == QuantAlgo.W4A8_MXFP4_FP8, moe_backend ==
MoeBackendType.TRTLLM.value, model_config.num_experts >= 60, and
model_config.intermediate_size >= 1408 and apply it in the centralized/shared
skip/filter function so both single-GPU and multi-GPU matrices (including tests
produced by generate_multi_gpu_test_params()) skip this known-bad configuration.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4651f86b-e826-4310-90dd-918dd86a56ec

📥 Commits

Reviewing files that changed from the base of the PR and between 0a27cf9 and 98de5da.

📒 Files selected for processing (4)

tests/unittest/_torch/modules/moe/moe_test_utils.py
tests/unittest/_torch/modules/moe/quantize_utils.py
tests/unittest/_torch/modules/moe/test_moe_backend.py
tests/unittest/_torch/modules/moe/test_moe_module.py

The kernel-faithful MXFP4FP8RefGatedMLPFusedMoE (static FP8 round-trip on FC1/FC2 inputs) brings the generic top_k=1 cases and gpt-oss-style top_k>=2 cases inside tolerance for the TRTLLM Gen backend. The single remaining gap is the triple intersection W4A8_MXFP4_FP8 + top_k=1 + swiglu_gptoss_style=True: with only one routed token per expert, the load-time static per-tensor FP8 scales no longer cover the FC2 activation range that the gpt-oss SwiGLU '*(linear + 1)' term produces, and ref vs kernel diverge ~92-94%. Mirror the existing W4A8_MXFP4_MXFP8 + gpt-oss + top_k=1 skip and the original PR NVIDIA#13401 design-limitation rationale, but only on this narrow intersection. The broader top_k=1 coverage and the gpt-oss top_k>=2 coverage stay enabled. Signed-off-by: xxi <[email protected]>

xxi-nv · 2026-05-15T03:40:46Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-15T03:47:41Z

PR_Github #48507 [ run ] triggered by Bot. Commit: 393947b Link to invocation

tensorrt-cicd · 2026-05-15T21:17:13Z

PR_Github #48507 [ run ] completed with state FAILURE. Commit: 393947b
/LLM/main/L0_MergeRequest_PR pipeline #38302 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Conversation

xxi-nv commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xxi-nv commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

xxi-nv commented May 17, 2026

Uh oh!

tensorrt-cicd commented May 17, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

xxi-nv commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

xxi-nv commented May 18, 2026

Uh oh!

xxi-nv commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

xxi-nv commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

xxi-nv commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

xxi-nv commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

xxi-nv commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

xxi-nv commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

xxi-nv commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading