Skip to content

[None][feat] Add FP4 residual quantization kernel without channel reo…#13117

Merged
Tracin merged 6 commits into
NVIDIA:mainfrom
Tracin:fp4_quant_residual
Apr 24, 2026
Merged

[None][feat] Add FP4 residual quantization kernel without channel reo…#13117
Tracin merged 6 commits into
NVIDIA:mainfrom
Tracin:fp4_quant_residual

Conversation

@Tracin
Copy link
Copy Markdown
Collaborator

@Tracin Tracin commented Apr 16, 2026

Add nvfp4_quantize_residual_with_block_size kernel that performs ARCQuant-style FP4 quantization with residual, using the quantize_with_block_size loop pattern (grid-stride rows, block-stride columns). Unlike the existing reorder kernel, this version reads directly from global memory without channel reordering, eliminating shared memory and __syncthreads() overhead.

Key components:

  • cvt_group_to_fp4_residual: reusable device helper (analogous to cvt_warp_fp16_to_fp4) that quantizes a 16-element group to FP4 with optional residual quantization
  • Vectorized 128-bit loads, float32 precision throughout, PDL support
  • Reuses get_sf_out_offset_128x4, fp32_vec_to_e2m1, reciprocal_approximate_ftz from quantization.cuh
  • Torch op fp4_quantize_with_residual exposed for Python access
  • Correctness test (GEMM cosine similarity) and perf benchmarks

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced fp4_quantize_with_residual operation for optimized FP4 quantization without channel reordering
    • Supports bfloat16 and fp8 input types with flexible activation and weight quantization modes
  • Tests

    • Added comprehensive correctness and performance validation tests for the new quantization operation
    • Includes benchmarking utilities to measure CUDA kernel performance

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…rder

Add nvfp4_quantize_residual_with_block_size kernel that performs ARCQuant-style
FP4 quantization with residual, using the quantize_with_block_size loop pattern
(grid-stride rows, block-stride columns). Unlike the existing reorder kernel,
this version reads directly from global memory without channel reordering,
eliminating shared memory and __syncthreads() overhead.

Key components:
- cvt_group_to_fp4_residual: reusable device helper (analogous to
  cvt_warp_fp16_to_fp4) that quantizes a 16-element group to FP4 with
  optional residual quantization
- Vectorized 128-bit loads, float32 precision throughout, PDL support
- Reuses get_sf_out_offset_128x4, fp32_vec_to_e2m1, reciprocal_approximate_ftz
  from quantization.cuh
- Torch op fp4_quantize_with_residual exposed for Python access
- Correctness test (GEMM cosine similarity) and perf benchmarks

Signed-off-by: Tracin <[email protected]>
@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 16, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43721 [ run ] triggered by Bot. Commit: aa78617 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

This PR adds a new ARCQuant FP4 quantization kernel nvfp4_quantize_residual_with_block_size with block-based residual computation support. The implementation includes device kernel logic, host launcher, PyTorch bindings, and comprehensive correctness and performance tests.

Changes

Cohort / File(s) Summary
Kernel implementation
cpp/tensorrt_llm/kernels/arcquantFP4.cu
Added PTX-based FP4 dequantization helper e2m1_uint32_to_float8, new device function cvt_group_to_fp4_residual for quantizing and dequantizing FP4 groups with residual computation, global kernel nvfp4_quantize_residual_with_block_size using grid-stride row/block-stride group loops with vectorized loads and conditional residual quantization, host launcher wrapper, and 3 explicit template instantiations for (bf16/fp8_e4m3, GROUP_SIZE=16, ACT/WEIGHT). Removed local reciprocal_approximate_ftz helper (moved to quantization.cuh).
Kernel interface
cpp/tensorrt_llm/kernels/arcquantFP4.h
Added templated function declaration run_nvfp4_quantize_residual_with_block_size exposing the new quantization kernel interface to host code.
PyTorch binding
cpp/tensorrt_llm/thop/fp4Quantize.h, cpp/tensorrt_llm/thop/fp4Quantize.cpp
Added public C++ API fp4_quantize_with_residual with dtype validation (bf16/fp8 inputs, float32 scales), shape constraints (KE/KQ/total alignment), output tensor allocation via computeSwizzledLayoutSFSize, dtype-specific kernel dispatch with ArcQuantType::ACT/WEIGHT selection, and PyTorch operator registration with CUDA implementation mapping.
Tests
tests/unittest/_torch/thop/parallel/test_arcquant_fp4.py
Added benchmarking helpers (_benchmark, _effective_bytes, _gb_per_sec), correctness test test_arcquant_fp4_with_residual validating end-to-end GEMM cosine similarity, and two performance comparison tests (test_arcquant_fp4_perf_no_residual, test_arcquant_fp4_perf_with_residual) comparing old vs new quantization methods.

Sequence Diagram

sequenceDiagram
    participant PyTorch as PyTorch Operator
    participant HostLauncher as Host Launcher
    participant DeviceKernel as Device Kernel
    participant Memory as GPU Memory

    PyTorch->>HostLauncher: fp4_quantize_with_residual(X, input_scale, KE, is_act)
    HostLauncher->>HostLauncher: Validate dtypes & shapes
    HostLauncher->>Memory: Allocate output buffers (q_out, q_scale)
    HostLauncher->>DeviceKernel: Launch nvfp4_quantize_residual_with_block_size
    
    DeviceKernel->>Memory: Load GROUP_SIZE elements (float4 vectorized)
    DeviceKernel->>DeviceKernel: Convert to float32 with global_scale
    DeviceKernel->>DeviceKernel: Compute max-based scale factor
    DeviceKernel->>DeviceKernel: Quantize to FP4 e2m1
    DeviceKernel->>DeviceKernel: Dequantize back (e2m1_uint32_to_float8)
    DeviceKernel->>DeviceKernel: Compute residuals
    alt if residual output enabled
        DeviceKernel->>DeviceKernel: Conditionally quantize residuals
    end
    
    DeviceKernel->>Memory: Write packed outputs (float2 or float4)
    HostLauncher->>PyTorch: Return (QX, SFX) tensors
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • binghanc
  • nv-guomingz
  • yuxianq
  • liji-nv
  • QiJune
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is partially related to the changeset—it mentions FP4 residual quantization kernel without channel reorder, which is a real component, but is incomplete (truncated with '…') and does not fully convey the scope of changes including the new Torch op and test coverage. Complete the truncated title to clearly summarize the main change, e.g., '[None][feat] Add FP4 residual quantization kernel without channel reorder and Torch op' or similar.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The PR description provides detailed explanation of the solution (kernel implementation, key components, test coverage) and includes a checked checklist item, but the template sections (Description, Test Coverage) are not explicitly filled out—the content is in the code block above the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/unittest/_torch/thop/parallel/test_arcquant_fp4.py (1)

9-40: Add return type annotations to the new helpers.

These utilities are missing explicit return types, which is out of step with the repo's Python typing rules.

✍️ Suggested cleanup
-def _benchmark(fn, warmup=10, iters=100):
+def _benchmark(fn, warmup: int = 10, iters: int = 100) -> float:
@@
-def _effective_bytes(M, K, KE, input_dtype=torch.bfloat16):
+def _effective_bytes(M: int, K: int, KE: int, input_dtype: torch.dtype = torch.bfloat16) -> int:
@@
-def _gb_per_sec(total_bytes, time_ms):
+def _gb_per_sec(total_bytes: int, time_ms: float) -> float:

As per coding guidelines, "Always annotate Python function return types; use None if the function does not return anything".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/thop/parallel/test_arcquant_fp4.py` around lines 9 -
40, Add explicit return type annotations to the three helper functions: annotate
_benchmark(...) -> float, _effective_bytes(M, K, KE, input_dtype: torch.dtype =
torch.bfloat16) -> int, and _gb_per_sec(total_bytes, time_ms) -> float; update
the function signatures (not bodies) to include these types so they comply with
the project's typing rules.
cpp/tensorrt_llm/kernels/arcquantFP4.cu (1)

463-503: Label the get_sf_out_offset_128x4 arguments inline.

These call sites are hard to audit with five positional arguments and two std::optional values. Inline parameter names would make the swizzled-layout indexing much easier to review.

🧭 Suggested cleanup
-                auto sfOffset
-                    = get_sf_out_offset_128x4(std::nullopt, rowIdx, sfIdx, std::optional<int>(numRows), numSFCols);
+                auto sfOffset = get_sf_out_offset_128x4(
+                    /*batchIdx=*/std::nullopt, /*mIdx=*/rowIdx, /*kIdx=*/sfIdx,
+                    /*numRows=*/std::optional<int>(numRows), /*numColVecs=*/numSFCols);
@@
-                auto sfOffset
-                    = get_sf_out_offset_128x4(std::nullopt, rowIdx, pos, std::optional<int>(numRows), numSFCols);
+                auto sfOffset = get_sf_out_offset_128x4(
+                    /*batchIdx=*/std::nullopt, /*mIdx=*/rowIdx, /*kIdx=*/pos,
+                    /*numRows=*/std::optional<int>(numRows), /*numColVecs=*/numSFCols);
@@
-                    auto sfOffsetRes = get_sf_out_offset_128x4(
-                        std::nullopt, rowIdx, pos + 1, std::optional<int>(numRows), numSFCols);
+                    auto sfOffsetRes = get_sf_out_offset_128x4(
+                        /*batchIdx=*/std::nullopt, /*mIdx=*/rowIdx, /*kIdx=*/pos + 1,
+                        /*numRows=*/std::optional<int>(numRows), /*numColVecs=*/numSFCols);

As per coding guidelines, "In C++ function calls with non-obvious parameters, use inline C comments with the format /*paramName=*/ to document parameters".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/arcquantFP4.cu` around lines 463 - 503, Call sites
of get_sf_out_offset_128x4 are hard to read because they pass multiple
positional args and std::optional values; update each call (e.g., the ones using
std::nullopt, rowIdx, pos, std::optional<int>(numRows), numSFCols and the pos+1
variant) to use inline parameter-name comments like /*argName=*/ before each
argument (for example /*swizzle=*/std::nullopt, /*row=*/rowIdx, /*col=*/pos,
/*numRows=*/std::optional<int>(numRows), /*numSFCols=*/numSFCols) so the meaning
of each parameter is explicit and easier to audit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/thop/fp4Quantize.cpp`:
- Around line 313-328: Add input validation at the top of
fp4_quantize_with_residual: check that M > 0 and KQ > 0 to avoid zero-dimension
launches, verify input_scale.numel() > 0 (and that input_scale.device() ==
X.device()) to avoid invalid device pointer dereferences, and keep the existing
dtype requirement for input_scale; if any check fails, TORCH_CHECK with a clear
message. Reference fp4_quantize_with_residual (and the downstream arcquantFP4.cu
kernel behavior) so the checks guard against blockSize==0 and invalid device
pointers.
- Around line 313-314: The file containing the function
fp4_quantize_with_residual has an outdated NVIDIA copyright header ending at
2023; update the header year range to include 2026 (e.g., change "2023" to
"2023-2026" or similar per project convention) so the modified file reflects
2026 changes and matches the project's copyright header policy.

In `@cpp/tensorrt_llm/thop/fp4Quantize.h`:
- Around line 39-40: Update the copyright header in
cpp/tensorrt_llm/thop/fp4Quantize.h to include 2026 (since the file was modified
in 2026); locate the top-of-file NVIDIA copyright block (near the declaration of
fp4_quantize_with_residual) and change the year range or add 2026 so the header
reflects the modification year.

---

Nitpick comments:
In `@cpp/tensorrt_llm/kernels/arcquantFP4.cu`:
- Around line 463-503: Call sites of get_sf_out_offset_128x4 are hard to read
because they pass multiple positional args and std::optional values; update each
call (e.g., the ones using std::nullopt, rowIdx, pos,
std::optional<int>(numRows), numSFCols and the pos+1 variant) to use inline
parameter-name comments like /*argName=*/ before each argument (for example
/*swizzle=*/std::nullopt, /*row=*/rowIdx, /*col=*/pos,
/*numRows=*/std::optional<int>(numRows), /*numSFCols=*/numSFCols) so the meaning
of each parameter is explicit and easier to audit.

In `@tests/unittest/_torch/thop/parallel/test_arcquant_fp4.py`:
- Around line 9-40: Add explicit return type annotations to the three helper
functions: annotate _benchmark(...) -> float, _effective_bytes(M, K, KE,
input_dtype: torch.dtype = torch.bfloat16) -> int, and _gb_per_sec(total_bytes,
time_ms) -> float; update the function signatures (not bodies) to include these
types so they comply with the project's typing rules.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9d1bac2b-7524-4ada-a743-01f63a4e5f47

📥 Commits

Reviewing files that changed from the base of the PR and between ac9ea3c and aa78617.

📒 Files selected for processing (5)
  • cpp/tensorrt_llm/kernels/arcquantFP4.cu
  • cpp/tensorrt_llm/kernels/arcquantFP4.h
  • cpp/tensorrt_llm/thop/fp4Quantize.cpp
  • cpp/tensorrt_llm/thop/fp4Quantize.h
  • tests/unittest/_torch/thop/parallel/test_arcquant_fp4.py

Comment thread cpp/tensorrt_llm/thop/fp4Quantize.cpp
Comment thread cpp/tensorrt_llm/thop/fp4Quantize.cpp
Comment thread cpp/tensorrt_llm/thop/fp4Quantize.h
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43721 [ run ] completed with state FAILURE. Commit: aa78617
/LLM/main/L0_MergeRequest_PR pipeline #34204 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 17, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43886 [ run ] triggered by Bot. Commit: 5f971f0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43886 [ run ] completed with state FAILURE. Commit: 5f971f0
/LLM/main/L0_MergeRequest_PR pipeline #34338 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 20, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44265 [ run ] triggered by Bot. Commit: f430438 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44265 [ run ] completed with state FAILURE. Commit: f430438
/LLM/main/L0_MergeRequest_PR pipeline #34688 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 20, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44323 [ run ] triggered by Bot. Commit: f430438 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44323 [ run ] completed with state FAILURE. Commit: f430438
/LLM/main/L0_MergeRequest_PR pipeline #34746 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 20, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44397 [ run ] triggered by Bot. Commit: f430438 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44397 [ run ] completed with state FAILURE. Commit: f430438
/LLM/main/L0_MergeRequest_PR pipeline #34814 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Add torch.library.register_fake for trtllm::fp4_quantize_with_residual
so the op participates correctly in torch.compile / fake-tensor tracing.

Signed-off-by: Tracin <[email protected]>
@Tracin Tracin requested a review from a team as a code owner April 21, 2026 06:49
@Tracin Tracin requested a review from hyukn April 21, 2026 06:49
@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 21, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44660 [ run ] triggered by Bot. Commit: e454c56 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44660 [ run ] completed with state SUCCESS. Commit: e454c56
/LLM/main/L0_MergeRequest_PR pipeline #35034 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 22, 2026

/bot run

1 similar comment
@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 22, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44899 [ run ] triggered by Bot. Commit: e454c56 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44899 [ run ] completed with state SUCCESS. Commit: e454c56
/LLM/main/L0_MergeRequest_PR pipeline #35234 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 23, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45067 [ run ] triggered by Bot. Commit: e454c56 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45067 [ run ] completed with state SUCCESS. Commit: e454c56
/LLM/main/L0_MergeRequest_PR pipeline #35370 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 23, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45136 [ run ] triggered by Bot. Commit: f65767d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45136 [ run ] completed with state SUCCESS. Commit: f65767d
/LLM/main/L0_MergeRequest_PR pipeline #35422 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Tracin
Copy link
Copy Markdown
Collaborator Author

Tracin commented Apr 24, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45295 [ run ] triggered by Bot. Commit: f65767d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45295 [ run ] completed with state SUCCESS. Commit: f65767d
/LLM/main/L0_MergeRequest_PR pipeline #35549 completed with status: 'SUCCESS'

CI Report

Link to invocation

Copy link
Copy Markdown
Collaborator

@hyukn hyukn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Tracin Tracin merged commit c4b8e8e into NVIDIA:main Apr 24, 2026
5 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants