[None][feat] AutoDeploy: Support Latent MOE for Nemotron #8955

nvchenghaoz · 2025-11-05T21:32:23Z

Summary by CodeRabbit

New Features
- Added latent space routing support to NemotronH Mixture of Experts architectures, enabling enhanced flexibility for expert selection while maintaining compatibility with existing direct routing approaches.

Signed-off-by: Chenghao Zhang <[email protected]>

coderabbitai · 2025-11-05T21:36:17Z

📝 Walkthrough

Walkthrough

Modified the NemotronH MOE forward pass to support latent MOE routing. The implementation now detects the presence of latent projection layers and automatically routes through latent space when available, projecting inputs to latent space before routing and back to original space afterward.

Changes

Cohort / File(s)	Summary
Latent MOE Support `tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py`	Added latent MOE support detection and routing logic to `_nemotron_h_moe_forward`. Implementation checks for `fc1_latent_proj` and `fc2_latent_proj` layers, projects input to latent space, routes through MOE, and projects back to original space. Docstring updated to document the dual-architecture support.

Sequence Diagram

sequenceDiagram
    participant Forward as _nemotron_h_moe_forward
    participant Detect as Detect Latent Projections
    participant Route as MOE Routing
    
    Forward->>Detect: Check for fc1_latent_proj/fc2_latent_proj
    alt Latent MOE Present
        Detect-->>Forward: Latent projections found
        Forward->>Forward: Project input to latent space
        Forward->>Route: Route in latent space via torch_moe
        Route-->>Forward: MOE output (latent space)
        Forward->>Forward: Project back to original space
    else Direct MOE
        Detect-->>Forward: No latent projections
        Forward->>Route: Route directly via torch_moe
        Route-->>Forward: MOE output (original space)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Focus on the conditional logic that detects and handles latent projection layers
Verify the projection operations (forward and backward) maintain correct tensor shapes
Review docstring accuracy for both latent and direct MOE pathways
Confirm compatibility with existing MOE routing behavior when latent projections are absent

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete, containing only '@coderabbitai summary' placeholder text without actual description, test coverage details, or checklist completion.	Provide a complete description following the template: explain what the Latent MOE support does, list relevant tests, and complete the PR checklist items.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding Latent MOE support for Nemotron in AutoDeploy, which aligns with the code changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (1)
93-96: Consider enhancing the docstring for clarity.

The docstring accurately describes the new capability but could benefit from additional detail about the latent MOE flow (projection to latent space, routing, projection back) to help future maintainers understand the implementation.

Example enhancement:
 def _nemotron_h_moe_forward(self, hidden_states: torch.Tensor):
     """
     Uses NemotronH router (returns indices, weights) and dispatches through auto_deploy::torch_moe
-    with act_fn='relu2'. Handles both latent MOE and direct MOE architectures.
+    with act_fn='relu2'. Handles both latent MOE and direct MOE architectures:
+    - Latent MOE: Projects input to latent space via fc1_latent_proj before routing,
+      then projects back via fc2_latent_proj after expert processing.
+    - Direct MOE: Routes directly through experts without latent space projection.
     """

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb7f983 and 08efa7c.

📒 Files selected for processing (1)

tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

📚 Learning: 2025-10-20T17:07:18.745Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (1)

103-125: Code logic is correct; dimension concerns apply to model instantiation, not this patch.

The patch file doesn't define NemotronHMOE—it only patches its forward method at runtime. The original model's initialization determines whether fc1_latent_proj, fc2_latent_proj, and experts exist and what dimensions they have. The patch correctly detects and conditionally applies latent projections. If dimensions are incompatible, PyTorch will raise a shape error during execution, making manual verification redundant. The code handles both latent and non-latent MOE paths correctly and requires no changes.

suyoggupta

That was fairly quick! 👏

nvchenghaoz · 2025-11-06T17:57:56Z

/bot run

tensorrt-cicd · 2025-11-06T18:04:19Z

PR_Github #23761 [ run ] triggered by Bot. Commit: 08efa7c

tensorrt-cicd · 2025-11-06T20:40:17Z

PR_Github #23761 [ run ] completed with state SUCCESS. Commit: 08efa7c
/LLM/main/L0_MergeRequest_PR pipeline #17884 completed with status: 'SUCCESS'

Support Latent MOE for Nemotron

08efa7c

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz requested review from Fridah-nv and suyoggupta November 5, 2025 21:32

nvchenghaoz requested a review from a team as a code owner November 5, 2025 21:32

github-project-automation bot added this to AutoDeploy Board Nov 5, 2025

github-project-automation bot moved this to Backlog in AutoDeploy Board Nov 5, 2025

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

suyoggupta requested a review from Naveassaf November 5, 2025 21:45

suyoggupta approved these changes Nov 5, 2025

View reviewed changes

github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Nov 5, 2025

suyoggupta requested review from Wanli-Jiang and tijyojwad November 5, 2025 21:45

lucaslie approved these changes Nov 5, 2025

View reviewed changes

nvchenghaoz enabled auto-merge (squash) November 6, 2025 17:57

nvchenghaoz merged commit 1a78e7a into NVIDIA:main Nov 6, 2025
7 checks passed

github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 6, 2025

nvchenghaoz self-assigned this Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][feat] AutoDeploy: Support Latent MOE for Nemotron #8955

[None][feat] AutoDeploy: Support Latent MOE for Nemotron #8955

Uh oh!

nvchenghaoz commented Nov 5, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 5, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

suyoggupta left a comment

Uh oh!

nvchenghaoz commented Nov 6, 2025

Uh oh!

tensorrt-cicd commented Nov 6, 2025

Uh oh!

tensorrt-cicd commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[None][feat] AutoDeploy: Support Latent MOE for Nemotron #8955

[None][feat] AutoDeploy: Support Latent MOE for Nemotron #8955

Uh oh!

Conversation

nvchenghaoz commented Nov 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 5, 2025

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

suyoggupta left a comment

Choose a reason for hiding this comment

Uh oh!

nvchenghaoz commented Nov 6, 2025

Uh oh!

tensorrt-cicd commented Nov 6, 2025

Uh oh!

tensorrt-cicd commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nvchenghaoz commented Nov 5, 2025 •

edited by coderabbitai bot

Loading