[TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear #8820

nvchenghaoz · 2025-10-31T00:11:25Z

Summary by CodeRabbit

Refactor
- Optimized FP8 quantization implementation to enhance inference performance
- Improved operator selection logic for better computational efficiency during tensor processing

coderabbitai · 2025-10-31T00:16:54Z

📝 Walkthrough

Walkthrough

The fp8_linear function in the quantization module has been refactored to replace its legacy FP8 implementation with a TensorRT-LLM-based approach. The change introduces per-tensor quantization, dynamic M-based operator selection between CUDA and cuBLAS scaled matrix multiplication kernels, updated shape handling, and removes the @torch.compile decorator.

Changes

Cohort / File(s)	Change Summary
FP8 Linear Implementation Refactor `tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py`	Replaces `addmm_float8_unwrapped` with TensorRT-LLM FP8 quantization path; introduces per-tensor quantization via `static_quantize_e4m3_per_tensor`; adds M-dimension-based operator selection between `cuda_scaled_mm` and `cublas_scaled_mm`; updates input/output shape handling with 2D reshape and restoration; removes `@torch.compile(dynamic=True)` decorator; updates docstring.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15–25 minutes

The refactor involves multiple logic changes (quantization method, operator selection, shape manipulation) localized to a single function
Reviewers should verify correct TensorRT-LLM quantization API usage and the M-based operator selection logic
Shape handling changes require careful validation to ensure dimensional correctness through reshaping and restoration
Consider testing the dynamic operator selection across different input M dimensions

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description provided by the author contains only "@coderabbitai summary" and is missing all required sections from the description template. The template requires a Description section explaining the issue and solution, a Test Coverage section listing relevant tests, and a PR Checklist with confirmations. While the raw summary indicates substantial changes to the FP8 linear implementation, none of this information is present in the actual PR description, making it largely incomplete and unable to provide the necessary context for reviewers.	The author should complete the PR description by filling in the Description section with an explanation of the changes and rationale, the Test Coverage section with relevant test cases that validate the FP8 linear kernel changes, and the PR Checklist to confirm compliance with coding guidelines and other requirements. The raw summary information should be incorporated into the Description section to provide reviewers with adequate context for evaluating the changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "[TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear" follows the required template with a valid JIRA ticket ID, a type indicator, and a clear summary. The title accurately describes the main change as identified in the raw summary: replacing the FP8 linear implementation to use TensorRT-LLM kernels instead of the previous addmm_float8_unwrapped path. The title is concise, specific, and directly related to the primary objective of the changeset without vague terminology.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe670af and 2901029.

📒 Files selected for processing (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py

🧠 Learnings (6)

📓 Common learnings

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 7227
File: tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py:269-275
Timestamp: 2025-08-27T16:59:12.325Z
Learning: In FP8 quantized linear layers, bias should be kept in high precision (typically float32) rather than being quantized to FP8 or cast to half precision, as bias is added after the matrix multiplication and high precision bias helps maintain numerical accuracy.

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

📚 Learning: 2025-10-20T16:54:09.824Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py

📚 Learning: 2025-10-20T17:09:21.560Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.

Applied to files:

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py

📚 Learning: 2025-09-23T15:13:48.819Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py

📚 Learning: 2025-09-19T21:28:13.751Z

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py

📚 Learning: 2025-08-27T16:59:12.325Z

Learnt from: Fridah-nv
Repo: NVIDIA/TensorRT-LLM PR: 7227
File: tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py:269-275
Timestamp: 2025-08-27T16:59:12.325Z
Learning: In FP8 quantized linear layers, bias should be kept in high precision (typically float32) rather than being quantized to FP8 or cast to half precision, as bias is added after the matrix multiplication and high precision bias helps maintain numerical accuracy.

Applied to files:

tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

suyoggupta · 2025-10-31T02:01:02Z

looks like it also fixes #8814 . Can we confirm?

lucaslie · 2025-10-31T03:26:23Z

looks like it also fixes #8814 . Can we confirm?

~~@suyoggupta I have already developed a unit test, Let me add it to the PR~~

edit: never mind. I confused your comment to be about #8811

Signed-off-by: Chenghao Zhang <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie

Let's think a bit more about this PR before we merge it, see discussion here as well: https://nvidia.slack.com/archives/C08T55LHSG4/p1761886176295439?thread_ts=1761854055.288739&cid=C08T55LHSG4

Signed-off-by: nvchenghaoz <[email protected]>

nvchenghaoz · 2025-11-03T23:31:29Z

/bot run

tensorrt-cicd · 2025-11-03T23:37:10Z

PR_Github #23427 [ run ] triggered by Bot. Commit: be455fb

tensorrt-cicd · 2025-11-03T23:56:50Z

PR_Github #23427 [ run ] completed with state FAILURE. Commit: be455fb
/LLM/main/L0_MergeRequest_PR pipeline #17642 completed with status: 'FAILURE'

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py

nvchenghaoz · 2025-11-04T17:40:13Z

/bot run

tensorrt-cicd · 2025-11-04T17:46:11Z

PR_Github #23529 [ run ] triggered by Bot. Commit: be455fb

tensorrt-cicd · 2025-11-04T18:11:07Z

PR_Github #23529 [ run ] completed with state FAILURE. Commit: be455fb
/LLM/main/L0_MergeRequest_PR pipeline #17709 completed with status: 'FAILURE'

nvchenghaoz · 2025-11-04T18:17:40Z

/bot run

nvchenghaoz · 2025-11-05T22:16:32Z

/bot run

tensorrt-cicd · 2025-11-05T22:23:26Z

PR_Github #23678 [ run ] triggered by Bot. Commit: b532f66

tensorrt-cicd · 2025-11-06T00:57:02Z

PR_Github #23678 [ run ] completed with state SUCCESS. Commit: b532f66
/LLM/main/L0_MergeRequest_PR pipeline #17814 completed with status: 'SUCCESS'

nvchenghaoz requested review from lucaslie and suyoggupta October 31, 2025 00:11

nvchenghaoz requested a review from a team as a code owner October 31, 2025 00:11

github-project-automation bot added this to AutoDeploy Board Oct 31, 2025

github-project-automation bot moved this to Backlog in AutoDeploy Board Oct 31, 2025

suyoggupta approved these changes Oct 31, 2025

View reviewed changes

github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Oct 31, 2025

lucaslie previously approved these changes Oct 31, 2025

View reviewed changes

lucaslie linked an issue Oct 31, 2025 that may be closed by this pull request

[Feature]: AutoDeploy: investigate fp8 gemms that do not have pad to 16 requirement #8814

Closed

1 task

nvchenghaoz and others added 5 commits October 30, 2025 20:41

Use TRTLLM kernels for FP8 linear

483d9b8

Signed-off-by: Chenghao Zhang <[email protected]>

Split the bias since bias is not supported

df2e8f9

Signed-off-by: Chenghao Zhang <[email protected]>

Fold the bias add to the cublas kernel

28ccd7b

Signed-off-by: Chenghao Zhang <[email protected]>

unit testing for various fp8 shapes

84ec26c

Signed-off-by: Lucas Liebenwein <[email protected]>

testing more shapes

8c487c9

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie force-pushed the chenghao/ad-fp8-linear branch from 42d293a to 8c487c9 Compare October 31, 2025 04:39

sticking to mod 16 for now

d48550f

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie reviewed Oct 31, 2025

View reviewed changes

lucaslie self-requested a review October 31, 2025 04:56

nvchenghaoz added 2 commits October 31, 2025 16:02

Bringup back the old op

73be0c4

Signed-off-by: nvchenghaoz <[email protected]>

Add padding for fp8 linear

be455fb

Signed-off-by: nvchenghaoz <[email protected]>

galagam added a commit to galagam/TensorRT-LLM that referenced this pull request Nov 2, 2025

replace fp8 kernels NVIDIA#8820

7e96271

lucaslie reviewed Nov 4, 2025

View reviewed changes

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py Show resolved Hide resolved

nvchenghaoz self-assigned this Nov 4, 2025

lucaslie approved these changes Nov 4, 2025

View reviewed changes

lucaslie mentioned this pull request Nov 4, 2025

[Feature]: AutoDeploy: padding for fp8 linear kernel when size % 16 != 0 #8811

Closed

1 task

Merge branch 'main' into chenghao/ad-fp8-linear

b532f66

nvchenghaoz merged commit ddf2d01 into NVIDIA:main Nov 6, 2025
5 checks passed

github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 6, 2025

[TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear #8820

[TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear #8820

Uh oh!

Conversation

nvchenghaoz commented Oct 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 31, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

suyoggupta commented Oct 31, 2025

Uh oh!

lucaslie commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

nvchenghaoz commented Nov 3, 2025

Uh oh!

tensorrt-cicd commented Nov 3, 2025

Uh oh!

tensorrt-cicd commented Nov 3, 2025

Uh oh!

Uh oh!

nvchenghaoz commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

tensorrt-cicd commented Nov 4, 2025

Uh oh!

nvchenghaoz commented Nov 4, 2025

Uh oh!

nvchenghaoz commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 5, 2025

Uh oh!

tensorrt-cicd commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nvchenghaoz commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

lucaslie commented Oct 31, 2025 •

edited

Loading