Skip to content

Conversation

@JadoTu
Copy link
Collaborator

@JadoTu JadoTu commented Nov 13, 2025

Summary by CodeRabbit

  • Bug Fixes
    • Improved tensor memory layout handling in NVFP4-accelerated linear operations for enhanced performance and stability during inference.

[None][fix] Make the sliced nvfp4 output contiguous

  1. Add torch.tensor.contiguous() to linear output in nvfp4 situation. In case the following function needs it.
  2. minor change

@JadoTu JadoTu requested a review from a team as a code owner November 13, 2025 06:07
@JadoTu JadoTu requested a review from mikeiovine November 13, 2025 06:07
@JadoTu JadoTu changed the title Make the sliced nvfp4 output contiguous [None][fix] Make the sliced nvfp4 output contiguous Nov 13, 2025
@JadoTu
Copy link
Collaborator Author

JadoTu commented Nov 13, 2025

/bot run

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 13, 2025

📝 Walkthrough

Walkthrough

Added a .contiguous() call after slicing output in the NVFP4-based Linear path to guarantee tensor contiguity when the output's last dimension exceeds the module's configured output features.

Changes

Cohort / File(s) Summary
NVFP4 Linear contiguity fix
tensorrt_llm/_torch/modules/linear.py
Ensures output tensor contiguity by appending .contiguous() after slicing in NVFP4LinearMethod.apply when trimming output dimensions

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify the .contiguous() call is placed correctly post-slice and pre-bias-addition in the GEMM output path
  • Confirm this change addresses any memory layout issues without introducing unintended performance regressions
  • Ensure the contiguity requirement is appropriate for downstream operations

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description is incomplete. It lacks details on the issue being fixed, test coverage information, and the PR checklist is not filled out as required by the template. Expand the Description section to explain the issue and why contiguity is needed. Add a Test Coverage section listing relevant tests. Complete the PR Checklist by checking items as appropriate for this change.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding contiguity guarantee to sliced nvfp4 output in the NVFP4LinearMethod.apply path.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cde18c1 and 3293289.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/modules/linear.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/modules/linear.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/modules/linear.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/modules/linear.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/linear.py (1)

827-829: LGTM! Correct fix for ensuring tensor contiguity after slicing.

The addition of .contiguous() after slicing the padded output is necessary and correct. When post_load_weights applies padding for alignment requirements (lines 972-981), slicing the output can create a non-contiguous tensor view. This ensures downstream operations that require contiguous memory layout will work correctly.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24399 [ run ] triggered by Bot. Commit: 3293289

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24399 [ run ] completed with state SUCCESS. Commit: 3293289
/LLM/main/L0_MergeRequest_PR pipeline #18410 completed with status: 'FAILURE'

@JadoTu
Copy link
Collaborator Author

JadoTu commented Nov 14, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24541 [ run ] triggered by Bot. Commit: 3293289

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24541 [ run ] completed with state SUCCESS. Commit: 3293289
/LLM/main/L0_MergeRequest_PR pipeline #18522 completed with status: 'SUCCESS'

@juney-nvidia juney-nvidia merged commit 3cde845 into NVIDIA:main Nov 15, 2025
8 of 9 checks passed
zheyuf pushed a commit to zheyuf/TensorRT-LLM that referenced this pull request Nov 19, 2025
greg-kwasniewski1 pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Nov 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants