Skip to content

Conversation

@nvchenghaoz
Copy link
Collaborator

@nvchenghaoz nvchenghaoz commented Oct 31, 2025

Summary by CodeRabbit

  • Refactor
    • Optimized FP8 quantization implementation to enhance inference performance
    • Improved operator selection logic for better computational efficiency during tensor processing

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 31, 2025

📝 Walkthrough

Walkthrough

The fp8_linear function in the quantization module has been refactored to replace its legacy FP8 implementation with a TensorRT-LLM-based approach. The change introduces per-tensor quantization, dynamic M-based operator selection between CUDA and cuBLAS scaled matrix multiplication kernels, updated shape handling, and removes the @torch.compile decorator.

Changes

Cohort / File(s) Change Summary
FP8 Linear Implementation Refactor
tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
Replaces addmm_float8_unwrapped with TensorRT-LLM FP8 quantization path; introduces per-tensor quantization via static_quantize_e4m3_per_tensor; adds M-dimension-based operator selection between cuda_scaled_mm and cublas_scaled_mm; updates input/output shape handling with 2D reshape and restoration; removes @torch.compile(dynamic=True) decorator; updates docstring.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15–25 minutes

  • The refactor involves multiple logic changes (quantization method, operator selection, shape manipulation) localized to a single function
  • Reviewers should verify correct TensorRT-LLM quantization API usage and the M-based operator selection logic
  • Shape handling changes require careful validation to ensure dimensional correctness through reshaping and restoration
  • Consider testing the dynamic operator selection across different input M dimensions

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description provided by the author contains only "@coderabbitai summary" and is missing all required sections from the description template. The template requires a Description section explaining the issue and solution, a Test Coverage section listing relevant tests, and a PR Checklist with confirmations. While the raw summary indicates substantial changes to the FP8 linear implementation, none of this information is present in the actual PR description, making it largely incomplete and unable to provide the necessary context for reviewers. The author should complete the PR description by filling in the Description section with an explanation of the changes and rationale, the Test Coverage section with relevant test cases that validate the FP8 linear kernel changes, and the PR Checklist to confirm compliance with coding guidelines and other requirements. The raw summary information should be incorporated into the Description section to provide reviewers with adequate context for evaluating the changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "[TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear" follows the required template with a valid JIRA ticket ID, a type indicator, and a clear summary. The title accurately describes the main change as identified in the raw summary: replacing the FP8 linear implementation to use TensorRT-LLM kernels instead of the previous addmm_float8_unwrapped path. The title is concise, specific, and directly related to the primary objective of the changeset without vague terminology.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe670af and 2901029.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 7227
File: tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py:269-275
Timestamp: 2025-08-27T16:59:12.325Z
Learning: In FP8 quantized linear layers, bias should be kept in high precision (typically float32) rather than being quantized to FP8 or cast to half precision, as bias is added after the matrix multiplication and high precision bias helps maintain numerical accuracy.
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
📚 Learning: 2025-08-27T16:59:12.325Z
Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 7227
File: tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py:269-275
Timestamp: 2025-08-27T16:59:12.325Z
Learning: In FP8 quantized linear layers, bias should be kept in high precision (typically float32) rather than being quantized to FP8 or cast to half precision, as bias is added after the matrix multiplication and high precision bias helps maintain numerical accuracy.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@suyoggupta
Copy link
Collaborator

looks like it also fixes #8814 . Can we confirm?

@github-project-automation github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Oct 31, 2025
lucaslie
lucaslie previously approved these changes Oct 31, 2025
@lucaslie
Copy link
Member

lucaslie commented Oct 31, 2025

looks like it also fixes #8814 . Can we confirm?

@suyoggupta I have already developed a unit test, Let me add it to the PR

edit: never mind. I confused your comment to be about #8811

@lucaslie lucaslie force-pushed the chenghao/ad-fp8-linear branch from 42d293a to 8c487c9 Compare October 31, 2025 04:39
Signed-off-by: Lucas Liebenwein <[email protected]>
Copy link
Member

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

galagam added a commit to galagam/TensorRT-LLM that referenced this pull request Nov 2, 2025
@nvchenghaoz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23427 [ run ] triggered by Bot. Commit: be455fb

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23427 [ run ] completed with state FAILURE. Commit: be455fb
/LLM/main/L0_MergeRequest_PR pipeline #17642 completed with status: 'FAILURE'

@nvchenghaoz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23529 [ run ] triggered by Bot. Commit: be455fb

@nvchenghaoz nvchenghaoz self-assigned this Nov 4, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #23529 [ run ] completed with state FAILURE. Commit: be455fb
/LLM/main/L0_MergeRequest_PR pipeline #17709 completed with status: 'FAILURE'

@nvchenghaoz
Copy link
Collaborator Author

/bot run

1 similar comment
@nvchenghaoz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23678 [ run ] triggered by Bot. Commit: b532f66

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23678 [ run ] completed with state SUCCESS. Commit: b532f66
/LLM/main/L0_MergeRequest_PR pipeline #17814 completed with status: 'SUCCESS'

@nvchenghaoz nvchenghaoz merged commit ddf2d01 into NVIDIA:main Nov 6, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[Feature]: AutoDeploy: investigate fp8 gemms that do not have pad to 16 requirement

4 participants