[None][fix] Make the sliced nvfp4 output contiguous #9123

JadoTu · 2025-11-13T06:07:09Z

Summary by CodeRabbit

Bug Fixes
- Improved tensor memory layout handling in NVFP4-accelerated linear operations for enhanced performance and stability during inference.

[None][fix] Make the sliced nvfp4 output contiguous

Add torch.tensor.contiguous() to linear output in nvfp4 situation. In case the following function needs it.
minor change

Signed-off-by: jiant <[email protected]>

JadoTu · 2025-11-13T06:08:52Z

/bot run

coderabbitai · 2025-11-13T06:09:47Z

📝 Walkthrough

Walkthrough

Added a .contiguous() call after slicing output in the NVFP4-based Linear path to guarantee tensor contiguity when the output's last dimension exceeds the module's configured output features.

Changes

Cohort / File(s)	Summary
NVFP4 Linear contiguity fix `tensorrt_llm/_torch/modules/linear.py`	Ensures output tensor contiguity by appending `.contiguous()` after slicing in `NVFP4LinearMethod.apply` when trimming output dimensions

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Verify the .contiguous() call is placed correctly post-slice and pre-bias-addition in the GEMM output path
Confirm this change addresses any memory layout issues without introducing unintended performance regressions
Ensure the contiguity requirement is appropriate for downstream operations

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is incomplete. It lacks details on the issue being fixed, test coverage information, and the PR checklist is not filled out as required by the template.	Expand the Description section to explain the issue and why contiguity is needed. Add a Test Coverage section listing relevant tests. Complete the PR Checklist by checking items as appropriate for this change.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding contiguity guarantee to sliced nvfp4 output in the NVFP4LinearMethod.apply path.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cde18c1 and 3293289.

📒 Files selected for processing (1)

tensorrt_llm/_torch/modules/linear.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/modules/linear.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/modules/linear.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/modules/linear.py

🧠 Learnings (1)

📓 Common learnings

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/modules/linear.py (1)

827-829: LGTM! Correct fix for ensuring tensor contiguity after slicing.

The addition of .contiguous() after slicing the padded output is necessary and correct. When post_load_weights applies padding for alignment requirements (lines 972-981), slicing the output can create a non-contiguous tensor view. This ensures downstream operations that require contiguous memory layout will work correctly.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · 2025-11-13T06:14:44Z

PR_Github #24399 [ run ] triggered by Bot. Commit: 3293289

tensorrt-cicd · 2025-11-13T11:45:28Z

PR_Github #24399 [ run ] completed with state SUCCESS. Commit: 3293289
/LLM/main/L0_MergeRequest_PR pipeline #18410 completed with status: 'FAILURE'

JadoTu · 2025-11-14T02:38:00Z

/bot run

tensorrt-cicd · 2025-11-14T02:43:56Z

PR_Github #24541 [ run ] triggered by Bot. Commit: 3293289

tensorrt-cicd · 2025-11-14T05:19:22Z

PR_Github #24541 [ run ] completed with state SUCCESS. Commit: 3293289
/LLM/main/L0_MergeRequest_PR pipeline #18522 completed with status: 'SUCCESS'

Signed-off-by: jiant <[email protected]>

[fix] make the sliced output contiguous

3293289

Signed-off-by: jiant <[email protected]>

JadoTu requested a review from a team as a code owner November 13, 2025 06:07

JadoTu requested a review from mikeiovine November 13, 2025 06:07

JadoTu changed the title ~~Make the sliced nvfp4 output contiguous~~ [None][fix] Make the sliced nvfp4 output contiguous Nov 13, 2025

juney-nvidia approved these changes Nov 14, 2025

View reviewed changes

juney-nvidia merged commit 3cde845 into NVIDIA:main Nov 15, 2025
8 of 9 checks passed

zheyuf pushed a commit to zheyuf/TensorRT-LLM that referenced this pull request Nov 19, 2025

[None][fix] Make the sliced nvfp4 output contiguous (NVIDIA#9123)

1e0c07e

Signed-off-by: jiant <[email protected]>

greg-kwasniewski1 pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Nov 20, 2025

[None][fix] Make the sliced nvfp4 output contiguous (NVIDIA#9123)

3a89c09

Signed-off-by: jiant <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][fix] Make the sliced nvfp4 output contiguous #9123

[None][fix] Make the sliced nvfp4 output contiguous #9123

Uh oh!

JadoTu commented Nov 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

JadoTu commented Nov 13, 2025

Uh oh!

coderabbitai bot commented Nov 13, 2025

Walkthrough

Changes

Estimated code review effort

Uh oh!

tensorrt-cicd commented Nov 13, 2025

Uh oh!

tensorrt-cicd commented Nov 13, 2025

Uh oh!

JadoTu commented Nov 14, 2025

Uh oh!

tensorrt-cicd commented Nov 14, 2025

Uh oh!

tensorrt-cicd commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[None][fix] Make the sliced nvfp4 output contiguous #9123

[None][fix] Make the sliced nvfp4 output contiguous #9123

Uh oh!

Conversation

JadoTu commented Nov 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

JadoTu commented Nov 13, 2025

Uh oh!

coderabbitai bot commented Nov 13, 2025

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

tensorrt-cicd commented Nov 13, 2025

Uh oh!

tensorrt-cicd commented Nov 13, 2025

Uh oh!

JadoTu commented Nov 14, 2025

Uh oh!

tensorrt-cicd commented Nov 14, 2025

Uh oh!

tensorrt-cicd commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JadoTu commented Nov 13, 2025 •

edited by coderabbitai bot

Loading