[None][feat] Add fused allreduce+RMSNorm op and optional residual in … by lfr-0531 · Pull Request #12201 · NVIDIA/TensorRT-LLM

lfr-0531 · 2026-03-13T15:42:43Z

Summary

Add AllReduceFusionOp.RMS_NORM (value=9) that fuses allreduce + RMSNorm in a single kernel without residual addition. This is useful for models where the residual connection is handled externally (e.g., Mewtwo).

Also makes the residual parameter optional in moe_finalize_allreduce, allowing MOE layers to skip residual addition when not needed.

Changes

C++ Kernels

Add kARRMSNorm fusion pattern in allreduce fusion kernels
Add launchResidualRmsNormKernelDispatch with a bool Residual template parameter to handle the residual-free path without runtime branching
Make moe_finalize_allreduce fused kernel conditionally skip residual load/add when residual_in is nullptr

Python

Add AllReduceFusionOp.RMS_NORM = 9 enum value
Update assertions in distributed/ops.py and functional.py to allow RMS_NORM fusion without a residual tensor

Op Registration

Update allreduceOp.cpp to handle RMS_NORM dispatch and accept optional residual (Tensor?) in MOE finalize signature

Tests

Add unit tests for RMS_NORM fusion pattern and moe_finalize_allreduce with residual=None

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Release Notes

New Features
- Added RMS_NORM as a new fusion operation option for optimized normalization workflows.
- Extended support for optional residual input in distributed reduction operations, improving flexibility for mixed precision and mixture-of-experts scenarios.
Tests
- Added comprehensive test coverage for new RMS_NORM fusion patterns and no-residual operation flows.

…moe_finalize_allreduce Add AllReduceFusionOp.RMS_NORM (value=9) that fuses allreduce + RMSNorm in a single kernel without residual addition. This is useful for models where the residual connection is handled externally. Changes: - New kARRMSNorm fusion pattern in C++ allreduce kernels - launchResidualRmsNormKernel now dispatches on Residual template param - moe_finalize_allreduce accepts optional residual (Tensor? instead of Tensor) - MOE fused_op kernel skips residual load/add when residual_in is nullptr - Python AllReduceFusionOp.RMS_NORM enum and updated assertions - Unit tests for RMS_NORM pattern and moe_finalize with residual=None Signed-off-by: Fanrong Li <[email protected]>

lfr-0531 · 2026-03-13T17:02:01Z

/bot run

tensorrt-cicd · 2026-03-13T17:08:36Z

PR_Github #38886 [ run ] triggered by Bot. Commit: 17f43b9 Link to invocation

tensorrt-cicd · 2026-03-13T20:03:54Z

PR_Github #38886 [ run ] completed with state SUCCESS. Commit: 17f43b9
/LLM/main/L0_MergeRequest_PR pipeline #30194 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-15T02:37:15Z

/bot run

tensorrt-cicd · 2026-03-15T02:43:16Z

PR_Github #38958 [ run ] triggered by Bot. Commit: 17f43b9 Link to invocation

tensorrt-cicd · 2026-03-15T09:22:49Z

PR_Github #38958 [ run ] completed with state SUCCESS. Commit: 17f43b9
/LLM/main/L0_MergeRequest_PR pipeline #30240 completed with status: 'SUCCESS'

CI Report

Link to invocation

coderabbitai · 2026-03-16T02:52:45Z

📝 Walkthrough

Walkthrough

Changes introduce a new RMS_NORM fusion operation for AllReduce with RMSNorm, adding corresponding enum values across kernel, C++, and Python layers. Residual input handling is made optional for RMS_NORM paths through updated dispatchers, kernels, and public API signatures. Test coverage is extended to validate the new RMS_NORM fusion path.

Changes

Cohort / File(s)	Summary
Enum Definitions `cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.h`, `cpp/tensorrt_llm/kernels/customAllReduceKernels.h`, `tensorrt_llm/functional.py`	Added new enum values: `kARRMSNorm` (value 6) in AllReduceFusionPattern and `RMS_NORM` (value 9) in AllReduceFusionOp. Added FusionPatternTraits specialization for kARRMSNorm with hasRMSNorm=true, hasResidual=false, and quantType=kNone.
Kernel Dispatcher Updates `cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu`, `cpp/tensorrt_llm/kernels/customAllReduceKernels.cu`	Extended DISPATCH_PATTERN dispatcher to handle kARRMSNorm pattern routing. Added template-parameterized `launchResidualRmsNormKernel(Dispatch)` functions to route RMS_NORM operations through residual-aware dispatch based on presence of residual_buffer.
MOE Kernel Implementation `cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu`	Refactored fused_op to conditionally handle optional residual input: computes norm_input from val+residual_in when available, otherwise uses val. Relaxed preconditions in moereduction and moefinalize pathways to allow RMS_NORM without mandatory residual.
Torch/C++ Operations `cpp/tensorrt_llm/thop/allreduceOp.cpp`	Added RMS_NORM handling in runFusionAllReduce with norm_out allocation and kARRMSNorm pattern routing. Updated moe_finalize_allreduce signature to make residual an optional parameter and refactored output logic to conditionally produce residual_out. Disabled residual addition for RMS_NORM fusion paths.
Python API & Assertions `tensorrt_llm/functional.py`, `tensorrt_llm/_torch/distributed/ops.py`	Relaxed AllReduceParams initialization assertion to permit RMS_NORM fusion without residual. Removed token count assertion in MoEAllReduce.forward. Updated docstrings to indicate RMS_NORM as observable fusion_op outcome.
Test Coverage `tests/unittest/_torch/multi_gpu/test_allreduce.py`	Added RMS_NORM to fusion pattern test parameterization. Introduced new test helpers (run_moe_finalize_no_residual_single_rank, test_moe_finalize_allreduce_no_residual) to validate RMS_NORM fusion without residual, including numerical correctness checks against reference RMSNorm computation.

Sequence Diagram

sequenceDiagram
    participant Python as Python API
    participant AllReduceOp as AllReduceOp (C++)
    participant Dispatcher as Kernel Dispatcher
    participant KernelImpl as Kernel Implementation
    participant Output as Result Buffer

    Python->>AllReduceOp: call with fusion_op=RMS_NORM
    AllReduceOp->>AllReduceOp: check residual presence
    AllReduceOp->>Dispatcher: launchResidualRmsNormKernel<T, Residual>()
    Dispatcher->>Dispatcher: template dispatch: Residual=true/false
    Dispatcher->>KernelImpl: launch rms_norm_kernel_launcher<T, Bias, Residual, Weight>()
    KernelImpl->>KernelImpl: compute norm_input = val + residual_in (if Residual=true)
    KernelImpl->>KernelImpl: compute output = rms_norm(norm_input)
    KernelImpl->>Output: write norm_out (and residual_out if Residual=true)
    Output-->>AllReduceOp: return {norm_out} or {norm_out, residual_out}
    AllReduceOp-->>Python: return fused result

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.68% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title describes the main change - adding a fused allreduce+RMSNorm operation and optional residual support. It is directly related to the primary objective of the PR.
Description check	✅ Passed	The PR description provides a clear summary, explains the changes across C++ kernels, Python, op registration, and tests. It covers the key aspects and includes a completed checklist.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the tone of the review comments and chat replies.

Configure the tone_instructions setting to customize the tone of the review comments and chat replies. For example, you can set the tone to Act like a strict teacher, Act like a pirate and more.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/thop/allreduceOp.cpp (1)

1678-1752: ⚠️ Potential issue | 🟠 Major

Validate the optional-input shapes before sizing the MoE finalize launch.

num_tokens is now inferred from whichever optional tensor happens to be present, but the code never checks that residual, shared_expert_output, expert_scale_factor, and expanded_idx_to_permuted_idx agree on that m, or that norm_weight.size(0) matches the hidden dim of the other inputs. A mismatched caller will size norm_out and allreduce_fusion_params.size from one shape and then hand the kernel raw pointers with another, which is an out-of-bounds risk instead of a clean TORCH_CHECK.

Suggested fix

-    int hidden_dim = norm_weight.size(0);
+    TORCH_CHECK(norm_weight.dim() == 1, "norm_weight must be 1D");
+    int hidden_dim = norm_weight.size(0);
+    TORCH_CHECK(input.size(-1) == hidden_dim, "input hidden dim must match norm_weight");
     int top_k = expanded_idx_to_permuted_idx.size(-1);
 
@@
-    int num_tokens;
+    TORCH_CHECK(expanded_idx_to_permuted_idx.dim() == 2, "expanded_idx_to_permuted_idx must be 2D");
+    int num_tokens;
     if (residual.has_value())
     {
+        TORCH_CHECK(residual.value().dim() == 2, "residual must be 2D");
+        TORCH_CHECK(
+            residual.value().size(1) == hidden_dim, "residual hidden dim must match norm_weight");
         num_tokens = residual.value().size(0);
     }
     else if (shared_expert_output.has_value())
     {
+        TORCH_CHECK(shared_expert_output.value().dim() == 2, "shared_expert_output must be 2D");
+        TORCH_CHECK(shared_expert_output.value().size(1) == hidden_dim,
+            "shared_expert_output hidden dim must match norm_weight");
         num_tokens = shared_expert_output.value().size(0);
     }
     else
@@
+    TORCH_CHECK(
+        expanded_idx_to_permuted_idx.size(0) == num_tokens, "expanded_idx_to_permuted_idx token dim mismatch");
+    if (shared_expert_output.has_value())
+    {
+        TORCH_CHECK(
+            shared_expert_output.value().size(0) == num_tokens, "shared_expert_output token dim mismatch");
+    }
+    if (expert_scale_factor.has_value())
+    {
+        TORCH_CHECK(expert_scale_factor.value().dim() == 2, "expert_scale_factor must be 2D");
+        TORCH_CHECK(expert_scale_factor.value().size(0) == num_tokens
+                && expert_scale_factor.value().size(1) == top_k,
+            "expert_scale_factor must have shape [num_tokens, top_k]");
+    }
+
     // size: num_token * hidden_dim
     allreduce_fusion_params.size = num_tokens * hidden_dim;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/thop/allreduceOp.cpp` around lines 1678 - 1752, Compute and
validate a consistent num_tokens and hidden_dim before allocating outputs or
populating allreduce_fusion_params: check that norm_weight.size(0) equals
hidden_dim (input.size(1) or as used), and if residual.has_value() ensure
residual.value().dim() and residual.value().size(0) match num_tokens and
residual.value().size(1) == hidden_dim; if shared_expert_output.has_value()
ensure its size(0) == num_tokens and size(1) == hidden_dim; if
expert_scale_factor.has_value() validate its shape/length aligns with num_tokens
(or is broadcastable) and use TORCH_CHECK to fail with clear messages; also
verify expanded_idx_to_permuted_idx.size(0) matches num_tokens before using it
to set allreduce_fusion_params.size and before creating norm_out/residual_out so
the kernel receives consistent pointers (refer to variables/functions:
num_tokens, hidden_dim, norm_weight, input, residual, shared_expert_output,
expert_scale_factor, expanded_idx_to_permuted_idx, allreduce_fusion_params,
moefinalize_allreduce_fusion_op).

🧹 Nitpick comments (2)

cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu (1)
741-743: Consider adding rms_gamma check for consistency.

The moefinalize_allreduce_fusion_op validates allreduce_in, expanded_idx_to_permuted_idx, and top_k, but unlike moereduction_allreduce_fusion_op (line 454), it doesn't explicitly check rms_gamma.

Since fused_op (line 139) unconditionally dereferences params.rms_gamma, a null rms_gamma would cause a crash. If rms_gamma is guaranteed non-null by the caller, this is fine—but adding an explicit check would improve defensive robustness and consistency with moereduction_allreduce_fusion_op.
♻️ Suggested fix to add rms_gamma check
-    TLLM_CHECK(params.allreduce_in && params.expanded_idx_to_permuted_idx && params.top_k);
+    TLLM_CHECK(params.allreduce_in && params.expanded_idx_to_permuted_idx && params.top_k && params.rms_gamma);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu`
around lines 741 - 743, The moefinalize_allreduce_fusion_op validation is
missing a null check for params.rms_gamma (moereduction_allreduce_fusion_op
performs this check), yet fused_op later dereferences params.rms_gamma; add a
TLLM_CHECK(params.rms_gamma) (or equivalent null/assert) in
moefinalize_allreduce_fusion_op alongside the existing checks to ensure
rms_gamma is non-null before fused_op uses it.
tests/unittest/_torch/multi_gpu/test_allreduce.py (1)
285-286: Cover the actual residual=None RMS_NORM contract.

The new RMS_NORM case still goes through the common harness that always materializes and passes a residual tensor. That only proves this path ignores residual when present; it never exercises the API change this PR is enabling: RMS_NORM with residual=None.

Also applies to: 301-313
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/multi_gpu/test_allreduce.py` around lines 285 - 286,
Add a test that exercises the RMS_NORM API with residual=None instead of only
using the common harness that always materializes a residual: update the param
set to include a pytest.param for AllReduceFusionOp.RMS_NORM that triggers the
path where no residual tensor is passed (or add a separate test that directly
calls the allreduce harness/function with AllReduceFusionOp.RMS_NORM and
residual=None), and ensure the harness invocation for that case does not create
or pass a dummy residual so the code path for residual=None is actually executed
and validated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/kernels/customAllReduceKernels.cu`:
- Around line 1980-1990: The wrapper launchResidualRmsNormKernel currently
derives whether to use the residual specialization from
params.fusion_params.residual_buffer, which allows RESIDUAL_RMS_PREPOST_NORM to
run without a residual; change it so that if fusionOp ==
RESIDUAL_RMS_PREPOST_NORM you require params.fusion_params.residual_buffer to be
non-null (emit an error/ASSERT/log and dispatch the <T, true> specialization),
otherwise proceed to choose based on the buffer; reference
launchResidualRmsNormKernel, RESIDUAL_RMS_PREPOST_NORM, and
params.fusion_params.residual_buffer when making the check and error-handling so
the residual-mandatory contract is preserved.

In `@tensorrt_llm/functional.py`:
- Around line 3953-3955: The assertion currently allows
AllReduceFusionOp.RMS_NORM with residual == None but create_allreduce_plugin
always reads all_reduce_params.residual.trt_tensor, causing a crash; update
create_allreduce_plugin to guard access to all_reduce_params.residual.trt_tensor
(only read/use it if all_reduce_params.residual is not None) and handle the
RMS_NORM path without a residual (e.g., skip adding the residual input or pass a
null/empty tensor placeholder as the plugin expects); reference
AllReduceFusionOp, create_allreduce_plugin, and
all_reduce_params.residual.trt_tensor when locating the change.

In `@tests/unittest/_torch/multi_gpu/test_allreduce.py`:
- Around line 692-696: The zip(...) call used to pack arguments for
mpi_pool_executor.map (wrapping run_moe_finalize_no_residual_single_rank and
run_moe_finalize_allreduce_no_residual_op with fc2_output, shared_expert_output,
expanded_idx_to_permuted_idx, scale) should be made strict to avoid silent
truncation and satisfy Ruff B905; update the zip invocation to pass strict=True
so mismatched iterator lengths raise immediately when calling
mpi_pool_executor.map with run_moe_finalize_no_residual_single_rank.

---

Outside diff comments:
In `@cpp/tensorrt_llm/thop/allreduceOp.cpp`:
- Around line 1678-1752: Compute and validate a consistent num_tokens and
hidden_dim before allocating outputs or populating allreduce_fusion_params:
check that norm_weight.size(0) equals hidden_dim (input.size(1) or as used), and
if residual.has_value() ensure residual.value().dim() and
residual.value().size(0) match num_tokens and residual.value().size(1) ==
hidden_dim; if shared_expert_output.has_value() ensure its size(0) == num_tokens
and size(1) == hidden_dim; if expert_scale_factor.has_value() validate its
shape/length aligns with num_tokens (or is broadcastable) and use TORCH_CHECK to
fail with clear messages; also verify expanded_idx_to_permuted_idx.size(0)
matches num_tokens before using it to set allreduce_fusion_params.size and
before creating norm_out/residual_out so the kernel receives consistent pointers
(refer to variables/functions: num_tokens, hidden_dim, norm_weight, input,
residual, shared_expert_output, expert_scale_factor,
expanded_idx_to_permuted_idx, allreduce_fusion_params,
moefinalize_allreduce_fusion_op).

---

Nitpick comments:
In `@cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu`:
- Around line 741-743: The moefinalize_allreduce_fusion_op validation is missing
a null check for params.rms_gamma (moereduction_allreduce_fusion_op performs
this check), yet fused_op later dereferences params.rms_gamma; add a
TLLM_CHECK(params.rms_gamma) (or equivalent null/assert) in
moefinalize_allreduce_fusion_op alongside the existing checks to ensure
rms_gamma is non-null before fused_op uses it.

In `@tests/unittest/_torch/multi_gpu/test_allreduce.py`:
- Around line 285-286: Add a test that exercises the RMS_NORM API with
residual=None instead of only using the common harness that always materializes
a residual: update the param set to include a pytest.param for
AllReduceFusionOp.RMS_NORM that triggers the path where no residual tensor is
passed (or add a separate test that directly calls the allreduce
harness/function with AllReduceFusionOp.RMS_NORM and residual=None), and ensure
the harness invocation for that case does not create or pass a dummy residual so
the code path for residual=None is actually executed and validated.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dc020840-5c2c-4541-b944-1a845987a5a2

📥 Commits

Reviewing files that changed from the base of the PR and between 390a7fd and 17f43b9.

📒 Files selected for processing (9)

cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu
cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.h
cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu
cpp/tensorrt_llm/kernels/customAllReduceKernels.cu
cpp/tensorrt_llm/kernels/customAllReduceKernels.h
cpp/tensorrt_llm/thop/allreduceOp.cpp
tensorrt_llm/_torch/distributed/ops.py
tensorrt_llm/functional.py
tests/unittest/_torch/multi_gpu/test_allreduce.py

hyukn

LGTM. Just one nit.

…n, add rms_gamma check - Guard residual.trt_tensor access in create_allreduce_plugin for RMS_NORM - Re-add max_token assertion in MoEAllReduce with None guard for optional residual - Add TLLM_CHECK(params.rms_gamma) in moefinalize_allreduce_fusion_op Signed-off-by: Fanrong Li <[email protected]>

lfr-0531 · 2026-03-16T11:14:57Z

/bot run

tensorrt-cicd · 2026-03-16T11:21:04Z

PR_Github #39083 [ run ] triggered by Bot. Commit: e239caa Link to invocation

tensorrt-cicd · 2026-03-16T15:17:56Z

PR_Github #39083 [ run ] completed with state SUCCESS. Commit: e239caa
/LLM/main/L0_MergeRequest_PR pipeline #30346 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-16T15:57:57Z

/bot run

tensorrt-cicd · 2026-03-16T16:04:00Z

PR_Github #39103 [ run ] triggered by Bot. Commit: e239caa Link to invocation

tensorrt-cicd · 2026-03-16T19:09:54Z

PR_Github #39103 [ run ] completed with state SUCCESS. Commit: e239caa
/LLM/main/L0_MergeRequest_PR pipeline #30364 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-17T01:06:58Z

/bot run

tensorrt-cicd · 2026-03-17T01:13:48Z

PR_Github #39148 [ run ] triggered by Bot. Commit: e239caa Link to invocation

tensorrt-cicd · 2026-03-17T02:34:28Z

PR_Github #39148 [ run ] completed with state FAILURE. Commit: e239caa
/LLM/main/L0_MergeRequest_PR pipeline #30408 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-17T03:12:31Z

/bot run

tensorrt-cicd · 2026-03-17T03:18:48Z

PR_Github #39171 [ run ] triggered by Bot. Commit: e239caa Link to invocation

tensorrt-cicd · 2026-03-17T04:07:14Z

PR_Github #39171 [ run ] completed with state FAILURE. Commit: e239caa
/LLM/main/L0_MergeRequest_PR pipeline #30426 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-17T06:05:25Z

/bot run

tensorrt-cicd · 2026-03-17T06:11:32Z

PR_Github #39201 [ run ] triggered by Bot. Commit: c84adc3 Link to invocation

tensorrt-cicd · 2026-03-17T10:15:36Z

PR_Github #39201 [ run ] completed with state SUCCESS. Commit: c84adc3
/LLM/main/L0_MergeRequest_PR pipeline #30454 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-17T11:11:52Z

/bot run

tensorrt-cicd · 2026-03-17T11:18:29Z

PR_Github #39247 [ run ] triggered by Bot. Commit: c84adc3 Link to invocation

tensorrt-cicd · 2026-03-17T12:53:23Z

PR_Github #39247 [ run ] completed with state SUCCESS. Commit: c84adc3
/LLM/main/L0_MergeRequest_PR pipeline #30502 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-17T14:54:32Z

/bot run

tensorrt-cicd · 2026-03-17T15:02:36Z

PR_Github #39268 [ run ] triggered by Bot. Commit: c84adc3 Link to invocation

tensorrt-cicd · 2026-03-17T16:58:26Z

PR_Github #39268 [ run ] completed with state SUCCESS. Commit: c84adc3
/LLM/main/L0_MergeRequest_PR pipeline #30517 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-18T02:18:50Z

/bot run

tensorrt-cicd · 2026-03-18T02:25:17Z

PR_Github #39361 [ run ] triggered by Bot. Commit: c84adc3 Link to invocation

tensorrt-cicd · 2026-03-18T05:36:23Z

PR_Github #39361 [ run ] completed with state SUCCESS. Commit: c84adc3
/LLM/main/L0_MergeRequest_PR pipeline #30606 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-18T06:13:06Z

/bot run

tensorrt-cicd · 2026-03-18T06:19:14Z

PR_Github #39402 [ run ] triggered by Bot. Commit: c84adc3 Link to invocation

tensorrt-cicd · 2026-03-18T07:15:58Z

PR_Github #39402 [ run ] completed with state SUCCESS. Commit: c84adc3
/LLM/main/L0_MergeRequest_PR pipeline #30633 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-18T07:53:36Z

/bot run

tensorrt-cicd · 2026-03-18T07:59:07Z

PR_Github #39419 [ run ] triggered by Bot. Commit: c84adc3 Link to invocation

tensorrt-cicd · 2026-03-18T11:01:59Z

PR_Github #39419 [ run ] completed with state SUCCESS. Commit: c84adc3
/LLM/main/L0_MergeRequest_PR pipeline #30649 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-03-18T11:11:21Z

/bot run

lfr-0531 · 2026-03-18T11:27:49Z

/bot run

tensorrt-cicd · 2026-03-18T11:33:30Z

PR_Github #39448 [ run ] triggered by Bot. Commit: a89b9eb Link to invocation

tensorrt-cicd · 2026-03-18T18:23:26Z

PR_Github #39448 [ run ] completed with state SUCCESS. Commit: a89b9eb
/LLM/main/L0_MergeRequest_PR pipeline #30675 completed with status: 'SUCCESS'

CI Report

Link to invocation

NVIDIA#12201) Signed-off-by: Fanrong Li <[email protected]>

github-actions Bot assigned lfr-0531 Mar 13, 2026

lfr-0531 marked this pull request as ready for review March 16, 2026 02:40

lfr-0531 requested a review from a team as a code owner March 16, 2026 02:40

lfr-0531 requested review from HuiGao-NV and yilin-void March 16, 2026 02:40

coderabbitai Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/kernels/customAllReduceKernels.cu

Comment thread tensorrt_llm/functional.py

Comment thread tests/unittest/_torch/multi_gpu/test_allreduce.py

lfr-0531 requested a review from hyukn March 16, 2026 08:24

hyukn approved these changes Mar 16, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/distributed/ops.py

lfr-0531 added 2 commits March 16, 2026 11:08

Merge branch 'main' into user/fanrongl/add-fused-allreduce-rmsnorm

e239caa

lfr-0531 enabled auto-merge (squash) March 16, 2026 11:15

Merge branch 'main' into user/fanrongl/add-fused-allreduce-rmsnorm

c84adc3

Merge branch 'main' into user/fanrongl/add-fused-allreduce-rmsnorm

a89b9eb

lfr-0531 merged commit 7b08677 into NVIDIA:main Mar 18, 2026
5 checks passed

limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026

[None][feat] Add fused allreduce+RMSNorm op and optional residual in … (

5cd1b31

NVIDIA#12201) Signed-off-by: Fanrong Li <[email protected]>

longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026

[None][feat] Add fused allreduce+RMSNorm op and optional residual in … (

5830caa

NVIDIA#12201) Signed-off-by: Fanrong Li <[email protected]>

b8zhong mentioned this pull request May 4, 2026

[Feat] Sync custom allreduce comm kernels with latest TRT-LLM flashinfer-ai/flashinfer#3223

Open

Conversation

lfr-0531 commented Mar 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

C++ Kernels

Python

Op Registration

Tests

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Release Notes

Uh oh!

lfr-0531 commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

lfr-0531 commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

coderabbitai Bot commented Mar 16, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hyukn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lfr-0531 commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

lfr-0531 commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

lfr-0531 commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

lfr-0531 commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

lfr-0531 commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

lfr-0531 commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

lfr-0531 commented Mar 13, 2026 •

edited by coderabbitai Bot

Loading