-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][feat] AutoDeploy: Support Latent MOE for Nemotron #8955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][feat] AutoDeploy: Support Latent MOE for Nemotron #8955
Conversation
Signed-off-by: Chenghao Zhang <[email protected]>
📝 WalkthroughWalkthroughModified the NemotronH MOE forward pass to support latent MOE routing. The implementation now detects the presence of latent projection layers and automatically routes through latent space when available, projecting inputs to latent space before routing and back to original space afterward. Changes
Sequence DiagramsequenceDiagram
participant Forward as _nemotron_h_moe_forward
participant Detect as Detect Latent Projections
participant Route as MOE Routing
Forward->>Detect: Check for fc1_latent_proj/fc2_latent_proj
alt Latent MOE Present
Detect-->>Forward: Latent projections found
Forward->>Forward: Project input to latent space
Forward->>Route: Route in latent space via torch_moe
Route-->>Forward: MOE output (latent space)
Forward->>Forward: Project back to original space
else Direct MOE
Detect-->>Forward: No latent projections
Forward->>Route: Route directly via torch_moe
Route-->>Forward: MOE output (original space)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (1)
93-96: Consider enhancing the docstring for clarity.The docstring accurately describes the new capability but could benefit from additional detail about the latent MOE flow (projection to latent space, routing, projection back) to help future maintainers understand the implementation.
Example enhancement:
def _nemotron_h_moe_forward(self, hidden_states: torch.Tensor): """ Uses NemotronH router (returns indices, weights) and dispatches through auto_deploy::torch_moe - with act_fn='relu2'. Handles both latent MOE and direct MOE architectures. + with act_fn='relu2'. Handles both latent MOE and direct MOE architectures: + - Latent MOE: Projects input to latent space via fc1_latent_proj before routing, + then projects back via fc2_latent_proj after expert processing. + - Direct MOE: Routes directly through experts without latent space projection. """
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
Applied to files:
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py (1)
103-125: Code logic is correct; dimension concerns apply to model instantiation, not this patch.The patch file doesn't define NemotronHMOE—it only patches its forward method at runtime. The original model's initialization determines whether
fc1_latent_proj,fc2_latent_proj, and experts exist and what dimensions they have. The patch correctly detects and conditionally applies latent projections. If dimensions are incompatible, PyTorch will raise a shape error during execution, making manual verification redundant. The code handles both latent and non-latent MOE paths correctly and requires no changes.
suyoggupta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was fairly quick! 👏
|
/bot run |
|
PR_Github #23761 [ run ] triggered by Bot. Commit: |
|
PR_Github #23761 [ run ] completed with state |
Summary by CodeRabbit