Skip to content

Conversation

@nvchenghaoz
Copy link
Collaborator

@nvchenghaoz nvchenghaoz commented Nov 5, 2025

Summary by CodeRabbit

  • New Features
    • Added latent space routing support to NemotronH Mixture of Experts architectures, enabling enhanced flexibility for expert selection while maintaining compatibility with existing direct routing approaches.

Signed-off-by: Chenghao Zhang <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 5, 2025

📝 Walkthrough

Walkthrough

Modified the NemotronH MOE forward pass to support latent MOE routing. The implementation now detects the presence of latent projection layers and automatically routes through latent space when available, projecting inputs to latent space before routing and back to original space afterward.

Changes

Cohort / File(s) Summary
Latent MOE Support
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
Added latent MOE support detection and routing logic to _nemotron_h_moe_forward. Implementation checks for fc1_latent_proj and fc2_latent_proj layers, projects input to latent space, routes through MOE, and projects back to original space. Docstring updated to document the dual-architecture support.

Sequence Diagram

sequenceDiagram
    participant Forward as _nemotron_h_moe_forward
    participant Detect as Detect Latent Projections
    participant Route as MOE Routing
    
    Forward->>Detect: Check for fc1_latent_proj/fc2_latent_proj
    alt Latent MOE Present
        Detect-->>Forward: Latent projections found
        Forward->>Forward: Project input to latent space
        Forward->>Route: Route in latent space via torch_moe
        Route-->>Forward: MOE output (latent space)
        Forward->>Forward: Project back to original space
    else Direct MOE
        Detect-->>Forward: No latent projections
        Forward->>Route: Route directly via torch_moe
        Route-->>Forward: MOE output (original space)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus on the conditional logic that detects and handles latent projection layers
  • Verify the projection operations (forward and backward) maintain correct tensor shapes
  • Review docstring accuracy for both latent and direct MOE pathways
  • Confirm compatibility with existing MOE routing behavior when latent projections are absent

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete, containing only '@coderabbitai summary' placeholder text without actual description, test coverage details, or checklist completion. Provide a complete description following the template: explain what the Latent MOE support does, list relevant tests, and complete the PR checklist items.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding Latent MOE support for Nemotron in AutoDeploy, which aligns with the code changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (1)

93-96: Consider enhancing the docstring for clarity.

The docstring accurately describes the new capability but could benefit from additional detail about the latent MOE flow (projection to latent space, routing, projection back) to help future maintainers understand the implementation.

Example enhancement:

 def _nemotron_h_moe_forward(self, hidden_states: torch.Tensor):
     """
     Uses NemotronH router (returns indices, weights) and dispatches through auto_deploy::torch_moe
-    with act_fn='relu2'. Handles both latent MOE and direct MOE architectures.
+    with act_fn='relu2'. Handles both latent MOE and direct MOE architectures:
+    - Latent MOE: Projects input to latent space via fc1_latent_proj before routing,
+      then projects back via fc2_latent_proj after expert processing.
+    - Direct MOE: Routes directly through experts without latent space projection.
     """
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb7f983 and 08efa7c.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (1)

103-125: Code logic is correct; dimension concerns apply to model instantiation, not this patch.

The patch file doesn't define NemotronHMOE—it only patches its forward method at runtime. The original model's initialization determines whether fc1_latent_proj, fc2_latent_proj, and experts exist and what dimensions they have. The patch correctly detects and conditionally applies latent projections. If dimensions are incompatible, PyTorch will raise a shape error during execution, making manual verification redundant. The code handles both latent and non-latent MOE paths correctly and requires no changes.

@suyoggupta suyoggupta requested a review from Naveassaf November 5, 2025 21:45
Copy link
Collaborator

@suyoggupta suyoggupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was fairly quick! 👏

@github-project-automation github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Nov 5, 2025
@nvchenghaoz nvchenghaoz enabled auto-merge (squash) November 6, 2025 17:57
@nvchenghaoz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23761 [ run ] triggered by Bot. Commit: 08efa7c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23761 [ run ] completed with state SUCCESS. Commit: 08efa7c
/LLM/main/L0_MergeRequest_PR pipeline #17884 completed with status: 'SUCCESS'

@nvchenghaoz nvchenghaoz merged commit 1a78e7a into NVIDIA:main Nov 6, 2025
7 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 6, 2025
@nvchenghaoz nvchenghaoz self-assigned this Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants