[None][feat] Add FP4 residual quantization kernel without channel reo…#13117
Conversation
…rder Add nvfp4_quantize_residual_with_block_size kernel that performs ARCQuant-style FP4 quantization with residual, using the quantize_with_block_size loop pattern (grid-stride rows, block-stride columns). Unlike the existing reorder kernel, this version reads directly from global memory without channel reordering, eliminating shared memory and __syncthreads() overhead. Key components: - cvt_group_to_fp4_residual: reusable device helper (analogous to cvt_warp_fp16_to_fp4) that quantizes a 16-element group to FP4 with optional residual quantization - Vectorized 128-bit loads, float32 precision throughout, PDL support - Reuses get_sf_out_offset_128x4, fp32_vec_to_e2m1, reciprocal_approximate_ftz from quantization.cuh - Torch op fp4_quantize_with_residual exposed for Python access - Correctness test (GEMM cosine similarity) and perf benchmarks Signed-off-by: Tracin <[email protected]>
|
/bot run |
|
PR_Github #43721 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR adds a new ARCQuant FP4 quantization kernel Changes
Sequence DiagramsequenceDiagram
participant PyTorch as PyTorch Operator
participant HostLauncher as Host Launcher
participant DeviceKernel as Device Kernel
participant Memory as GPU Memory
PyTorch->>HostLauncher: fp4_quantize_with_residual(X, input_scale, KE, is_act)
HostLauncher->>HostLauncher: Validate dtypes & shapes
HostLauncher->>Memory: Allocate output buffers (q_out, q_scale)
HostLauncher->>DeviceKernel: Launch nvfp4_quantize_residual_with_block_size
DeviceKernel->>Memory: Load GROUP_SIZE elements (float4 vectorized)
DeviceKernel->>DeviceKernel: Convert to float32 with global_scale
DeviceKernel->>DeviceKernel: Compute max-based scale factor
DeviceKernel->>DeviceKernel: Quantize to FP4 e2m1
DeviceKernel->>DeviceKernel: Dequantize back (e2m1_uint32_to_float8)
DeviceKernel->>DeviceKernel: Compute residuals
alt if residual output enabled
DeviceKernel->>DeviceKernel: Conditionally quantize residuals
end
DeviceKernel->>Memory: Write packed outputs (float2 or float4)
HostLauncher->>PyTorch: Return (QX, SFX) tensors
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/unittest/_torch/thop/parallel/test_arcquant_fp4.py (1)
9-40: Add return type annotations to the new helpers.These utilities are missing explicit return types, which is out of step with the repo's Python typing rules.
✍️ Suggested cleanup
-def _benchmark(fn, warmup=10, iters=100): +def _benchmark(fn, warmup: int = 10, iters: int = 100) -> float: @@ -def _effective_bytes(M, K, KE, input_dtype=torch.bfloat16): +def _effective_bytes(M: int, K: int, KE: int, input_dtype: torch.dtype = torch.bfloat16) -> int: @@ -def _gb_per_sec(total_bytes, time_ms): +def _gb_per_sec(total_bytes: int, time_ms: float) -> float:As per coding guidelines, "Always annotate Python function return types; use
Noneif the function does not return anything".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/thop/parallel/test_arcquant_fp4.py` around lines 9 - 40, Add explicit return type annotations to the three helper functions: annotate _benchmark(...) -> float, _effective_bytes(M, K, KE, input_dtype: torch.dtype = torch.bfloat16) -> int, and _gb_per_sec(total_bytes, time_ms) -> float; update the function signatures (not bodies) to include these types so they comply with the project's typing rules.cpp/tensorrt_llm/kernels/arcquantFP4.cu (1)
463-503: Label theget_sf_out_offset_128x4arguments inline.These call sites are hard to audit with five positional arguments and two
std::optionalvalues. Inline parameter names would make the swizzled-layout indexing much easier to review.🧭 Suggested cleanup
- auto sfOffset - = get_sf_out_offset_128x4(std::nullopt, rowIdx, sfIdx, std::optional<int>(numRows), numSFCols); + auto sfOffset = get_sf_out_offset_128x4( + /*batchIdx=*/std::nullopt, /*mIdx=*/rowIdx, /*kIdx=*/sfIdx, + /*numRows=*/std::optional<int>(numRows), /*numColVecs=*/numSFCols); @@ - auto sfOffset - = get_sf_out_offset_128x4(std::nullopt, rowIdx, pos, std::optional<int>(numRows), numSFCols); + auto sfOffset = get_sf_out_offset_128x4( + /*batchIdx=*/std::nullopt, /*mIdx=*/rowIdx, /*kIdx=*/pos, + /*numRows=*/std::optional<int>(numRows), /*numColVecs=*/numSFCols); @@ - auto sfOffsetRes = get_sf_out_offset_128x4( - std::nullopt, rowIdx, pos + 1, std::optional<int>(numRows), numSFCols); + auto sfOffsetRes = get_sf_out_offset_128x4( + /*batchIdx=*/std::nullopt, /*mIdx=*/rowIdx, /*kIdx=*/pos + 1, + /*numRows=*/std::optional<int>(numRows), /*numColVecs=*/numSFCols);As per coding guidelines, "In C++ function calls with non-obvious parameters, use inline C comments with the format
/*paramName=*/to document parameters".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/arcquantFP4.cu` around lines 463 - 503, Call sites of get_sf_out_offset_128x4 are hard to read because they pass multiple positional args and std::optional values; update each call (e.g., the ones using std::nullopt, rowIdx, pos, std::optional<int>(numRows), numSFCols and the pos+1 variant) to use inline parameter-name comments like /*argName=*/ before each argument (for example /*swizzle=*/std::nullopt, /*row=*/rowIdx, /*col=*/pos, /*numRows=*/std::optional<int>(numRows), /*numSFCols=*/numSFCols) so the meaning of each parameter is explicit and easier to audit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tensorrt_llm/thop/fp4Quantize.cpp`:
- Around line 313-328: Add input validation at the top of
fp4_quantize_with_residual: check that M > 0 and KQ > 0 to avoid zero-dimension
launches, verify input_scale.numel() > 0 (and that input_scale.device() ==
X.device()) to avoid invalid device pointer dereferences, and keep the existing
dtype requirement for input_scale; if any check fails, TORCH_CHECK with a clear
message. Reference fp4_quantize_with_residual (and the downstream arcquantFP4.cu
kernel behavior) so the checks guard against blockSize==0 and invalid device
pointers.
- Around line 313-314: The file containing the function
fp4_quantize_with_residual has an outdated NVIDIA copyright header ending at
2023; update the header year range to include 2026 (e.g., change "2023" to
"2023-2026" or similar per project convention) so the modified file reflects
2026 changes and matches the project's copyright header policy.
In `@cpp/tensorrt_llm/thop/fp4Quantize.h`:
- Around line 39-40: Update the copyright header in
cpp/tensorrt_llm/thop/fp4Quantize.h to include 2026 (since the file was modified
in 2026); locate the top-of-file NVIDIA copyright block (near the declaration of
fp4_quantize_with_residual) and change the year range or add 2026 so the header
reflects the modification year.
---
Nitpick comments:
In `@cpp/tensorrt_llm/kernels/arcquantFP4.cu`:
- Around line 463-503: Call sites of get_sf_out_offset_128x4 are hard to read
because they pass multiple positional args and std::optional values; update each
call (e.g., the ones using std::nullopt, rowIdx, pos,
std::optional<int>(numRows), numSFCols and the pos+1 variant) to use inline
parameter-name comments like /*argName=*/ before each argument (for example
/*swizzle=*/std::nullopt, /*row=*/rowIdx, /*col=*/pos,
/*numRows=*/std::optional<int>(numRows), /*numSFCols=*/numSFCols) so the meaning
of each parameter is explicit and easier to audit.
In `@tests/unittest/_torch/thop/parallel/test_arcquant_fp4.py`:
- Around line 9-40: Add explicit return type annotations to the three helper
functions: annotate _benchmark(...) -> float, _effective_bytes(M, K, KE,
input_dtype: torch.dtype = torch.bfloat16) -> int, and _gb_per_sec(total_bytes,
time_ms) -> float; update the function signatures (not bodies) to include these
types so they comply with the project's typing rules.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9d1bac2b-7524-4ada-a743-01f63a4e5f47
📒 Files selected for processing (5)
cpp/tensorrt_llm/kernels/arcquantFP4.cucpp/tensorrt_llm/kernels/arcquantFP4.hcpp/tensorrt_llm/thop/fp4Quantize.cppcpp/tensorrt_llm/thop/fp4Quantize.htests/unittest/_torch/thop/parallel/test_arcquant_fp4.py
|
PR_Github #43721 [ run ] completed with state
|
|
/bot run |
|
PR_Github #43886 [ run ] triggered by Bot. Commit: |
|
PR_Github #43886 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44265 [ run ] triggered by Bot. Commit: |
|
PR_Github #44265 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44323 [ run ] triggered by Bot. Commit: |
|
PR_Github #44323 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44397 [ run ] triggered by Bot. Commit: |
|
PR_Github #44397 [ run ] completed with state
|
Add torch.library.register_fake for trtllm::fp4_quantize_with_residual so the op participates correctly in torch.compile / fake-tensor tracing. Signed-off-by: Tracin <[email protected]>
|
/bot run |
|
PR_Github #44660 [ run ] triggered by Bot. Commit: |
|
PR_Github #44660 [ run ] completed with state
|
|
/bot run |
1 similar comment
|
/bot run |
|
PR_Github #44899 [ run ] triggered by Bot. Commit: |
|
PR_Github #44899 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45067 [ run ] triggered by Bot. Commit: |
|
PR_Github #45067 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45136 [ run ] triggered by Bot. Commit: |
|
PR_Github #45136 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45295 [ run ] triggered by Bot. Commit: |
|
PR_Github #45295 [ run ] completed with state |
…rder (NVIDIA#13117) Signed-off-by: Tracin <[email protected]>
Add nvfp4_quantize_residual_with_block_size kernel that performs ARCQuant-style FP4 quantization with residual, using the quantize_with_block_size loop pattern (grid-stride rows, block-stride columns). Unlike the existing reorder kernel, this version reads directly from global memory without channel reordering, eliminating shared memory and __syncthreads() overhead.
Key components:
Summary by CodeRabbit
Release Notes
New Features
fp4_quantize_with_residualoperation for optimized FP4 quantization without channel reorderingTests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.