Skip to content

Conversation

@yunruis
Copy link
Contributor

@yunruis yunruis commented Oct 20, 2025

Introduce dsv3RopeOp as a standalone operation and decouple it from attentionOp during the MLA generation stage. This improves modularity and enables overlap with bgemm for perf optimization. Detailly,

  • overlap for kv cache bf16 flow,
  • fp8 kv cache flow has data dependency, currently do not overlap

Workload: ISL8K/OSL1K, local_batch=32, mtp=3(seqlen=4), no quant
Machine: B200

  bgemm(us) rope(us) total(us) speed up
no_overlap 8.00 3.58 12.29 1
bgemm-rope-overlap 8.16 5.98 11.34 1.07x
image

Summary by CodeRabbit

  • New Features

    • Exposes MLA rope-generation as a callable operation and enables it during generation.
    • Attention APIs now accept optional per-step sequence and quantization tensors for sequence-aware attention.
  • Improvements

    • Generation wiring now sources sequence/scale/quant parameters from public call arguments, reducing ephemeral workspace use.
    • Python bindings updated to pass new optional tensors through attention and generation paths.
  • Tests

    • Unit tests updated to exercise the new per-step tensors and FP8/quant handling.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yunruis yunruis requested a review from a team as a code owner October 20, 2025 08:00
@yunruis yunruis requested a review from brb-nv October 20, 2025 08:00
@yunruis yunruis changed the title accuracy on unnittest with 1 layer, bf16 + fp8 [None][feat] Add rope and uk-bgemm overlap for mla generation Oct 20, 2025
@yunruis
Copy link
Contributor Author

yunruis commented Oct 20, 2025

/bot run

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 20, 2025

📝 Walkthrough

Walkthrough

Adds end-to-end MLA rope-generation support: new CUDA MLA kernel and TorchScript binding, threads six optional per-attention tensors through Python and C++ bindings, replaces runtime workspace allocations with generation-time parameter wiring, and updates attention codepaths and tests to use the public generation parameters.

Changes

Cohort / File(s) Summary
Core attention runtime
cpp/tensorrt_llm/common/attentionOp.cpp
Removed several runtime workspace allocations for MLA, added runtime validity checks for params (seqQOffset, cache_seq_lens, fmha_tile_counter) and optional FP8 buffers, and switched fmha/XQA parameter pointers to source from public generation params instead of ephemeral workspace pointers.
Thop attention API & implementation
cpp/tensorrt_llm/thop/attentionOp.h, cpp/tensorrt_llm/thop/attentionOp.cpp
Appended six new optional tensor parameters to public/internal attention signatures and Runner/prepare overrides: cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter, mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer. Wired those parameters through context and generation branches into MLA/fmha structures.
New MLA rope kernel & binding
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
New file implementing MLARopeGeneration: argument aggregation, validation, KV-cache setup, MlaParams population, dtype-specialized dispatch helper, and TorchScript registration (torch.ops.trtllm.mla_rope_generation).
Bindings (nanobind / pybind11)
cpp/tensorrt_llm/nanobind/thop/bindings.cpp, cpp/tensorrt_llm/pybind/thop/bindings.cpp
Extended the Python-facing attention bindings with six new optional keyword args (default std::nullopt): cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter, mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer.
Build config
cpp/tensorrt_llm/thop/CMakeLists.txt
Added dsv3RopeOp.cpp to the th_common shared target source list and adjusted target link libraries in the affected target context.
Python attention backend
tensorrt_llm/_torch/attention_backend/trtllm.py
Plumbed six new optional tensors through TrtllmAttentionWrapper.run and TrtllmAttention.forward; added mla_rope_generation methods that invoke torch.ops.trtllm.mla_rope_generation with the extended argument list.
Generation internals
tensorrt_llm/_torch/modules/attention.py
In forward_generation, create per-step cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter; conditionally allocate FP8-scale buffers (mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer) when FP8 KV cache present; run mla_rope_generation in parallel with existing BMM paths.
Custom ops (fake op for shape inference)
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
Added a fake-op registration trtllm::mla_rope_generation (no-op placeholder) for shape inference with the full multi-argument signature.
Tests
tests/unittest/_torch/attention/test_attention_mla.py
Updated generation test to prepare and pass cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter; conditional FP8 buffer allocation and calls to mla_rope_generation; extended forward/test paths to accept the new params.

Sequence Diagram(s)

sequenceDiagram
    actor Python
    participant TrtWrapper as TrtllmAttentionWrapper
    participant ThopAttention as thop.attention
    participant TorchOp as torch.ops.trtllm.mla_rope_generation
    participant CUDA as CUDA Kernel

    Python->>TrtWrapper: forward(fused_q,..., cu_q_seqlens?, cu_kv_seqlens?, fmha_scheduler_counter?, mla_bmm1_scale?, mla_bmm2_scale?, quant_q_buffer?)
    TrtWrapper->>ThopAttention: call attention(..., cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter, mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer)

    alt Generation path
        ThopAttention->>ThopAttention: prepare generation buffers / conditionally set FP8 scales
        par Main BMM
            ThopAttention->>CUDA: bmm_out()/fp8_block_scaling_bmm_out()
        and MLA rope generation
            ThopAttention->>TorchOp: mla_rope_generation(fused_q, q_pe, latent_cache, cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter, mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer, ...)
            TorchOp->>CUDA: validate & setup -> dispatch MLA kernel
            CUDA-->>TorchOp: complete
            TorchOp-->>ThopAttention: completion
        end
        ThopAttention->>CUDA: mqa.forward(generation=true, params from public args)
        CUDA-->>ThopAttention: attn_out
    end

    ThopAttention-->>TrtWrapper: output
    TrtWrapper-->>Python: result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive PR description includes objectives and performance data but lacks required template sections (Description, Test Coverage details). Complete the Description section explaining what changes were made and why. Clearly specify which test cases validate the new dsv3RopeOp functionality and rope-bgemm overlap.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation" directly aligns with the primary change in the changeset: introducing a new standalone dsv3RopeOp operation and decoupling it from attentionOp during MLA generation to enable performance optimization through operation overlap. The title is concise, follows the repository conventions with ticket and type prefix, and specifically conveys the main objectives without unnecessary noise.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/thop/attentionOp.h (1)

37-68: Header hygiene: add include guard and document new params.

Guidelines require include guards and Doxygen on public prototypes. Replace pragma once with TRTLLM_ATTENTIONOP_H guards and add brief //! docs for the six new optional tensors.

Apply this diff header/footer:

-#pragma once
+#ifndef TRTLLM_ATTENTIONOP_H
+#define TRTLLM_ATTENTIONOP_H
@@
-void attention( /* existing params ... */,
+//! Multi-head attention entrypoint. New optional tensors are used to pass precomputed
+//! sequence lengths, FMHA scheduler state, MLA FP8 scales, and an FP8 Q buffer.
+void attention( /* existing params ... */,
     std::vector<std::optional<torch::Tensor>> spec_decoding_tensor_params,
     std::vector<std::optional<torch::Tensor>> sparse_attention_params,
     std::optional<torch::Tensor> cu_q_seqlens,            //!< [B+1] cumulative Q seq lens
     std::optional<torch::Tensor> cu_kv_seqlens,           //!< [B+1] cumulative KV seq lens
     std::optional<torch::Tensor> fmha_scheduler_counter,  //!< [1]  tile counter buffer
     std::optional<torch::Tensor> mla_bmm1_scale,          //!< [2]  bmm1 scales (with log2)
     std::optional<torch::Tensor> mla_bmm2_scale,          //!< [1]  bmm2 scale
     std::optional<torch::Tensor> quant_q_buffer           //!< [..] pre-quantized FP8 Q
 );
+
+#endif // TRTLLM_ATTENTIONOP_H
cpp/tensorrt_llm/pybind/thop/bindings.cpp (1)

31-62: Remove duplicate docstring in pybind11 and nanobind bindings—docstring must appear only once, after all parameters and before call_guard.

Parameter order and names match the torch_ext::attention signature correctly. However, both the pybind11 and nanobind bindings contain a critical error: the docstring "Multi-head attention operation" appears twice—once incorrectly placed between sparse_attention_params and cu_q_seqlens, and once correctly placed before the call_guard.

The pybind11/nanobind binding syntax requires the docstring to appear exactly once, after all parameter definitions. The intermediate docstring will cause binding errors.

Fix locations:

  • cpp/tensorrt_llm/pybind/thop/bindings.cpp: Remove the "Multi-head attention operation" string appearing after sparse_attention_params (line 57 in the binding)
  • cpp/tensorrt_llm/nanobind/thop/bindings.cpp: Remove the "Multi-head attention operation" string appearing after sparse_attention_params (same line position)

Keep only the docstring that appears immediately before call_guard.

🧹 Nitpick comments (13)
cpp/tensorrt_llm/nanobind/thop/bindings.cpp (1)

59-62: Place docstring last; keep kwargs contiguous.

Minor binding hygiene: move the docstring literal after the new kwargs and before call_guard to match typical nanobind usage and avoid confusion.

Apply this diff:

-        nb::arg("mla_tensor_params"), nb::arg("attention_chunk_size") = std::nullopt,
-        nb::arg("softmax_stats_tensor") = std::nullopt, nb::arg("spec_decoding_bool_params"),
-        nb::arg("spec_decoding_tensor_params"), nb::arg("sparse_attention_params"), "Multi-head attention operation",
+        nb::arg("mla_tensor_params"), nb::arg("attention_chunk_size") = std::nullopt,
+        nb::arg("softmax_stats_tensor") = std::nullopt, nb::arg("spec_decoding_bool_params"),
+        nb::arg("spec_decoding_tensor_params"), nb::arg("sparse_attention_params"),
         nb::arg("cu_q_seqlens") = std::nullopt, nb::arg("cu_kv_seqlens") = std::nullopt,
         nb::arg("fmha_scheduler_counter") = std::nullopt, nb::arg("mla_bmm1_scale") = std::nullopt,
-        nb::arg("mla_bmm2_scale") = std::nullopt, nb::arg("quant_q_buffer") = std::nullopt,
-        nb::call_guard<nb::gil_scoped_release>());
+        nb::arg("mla_bmm2_scale") = std::nullopt, nb::arg("quant_q_buffer") = std::nullopt,
+        "Multi-head attention operation", nb::call_guard<nb::gil_scoped_release>());
cpp/tensorrt_llm/thop/attentionOp.h (1)

64-68: Signature size is getting risky; consider a params struct.

This function now has 60+ parameters. Pack related options into a struct (e.g., AttentionArgs) to reduce call-site error risk and ease future extension.

cpp/tensorrt_llm/kernels/mlaKernels.cu (2)

935-961: Guard debug kernels; avoid unconditional device printf symbols.

Define these only under a build-time flag to keep release binaries lean and prevent accidental device I/O.

Apply this diff:

+#if defined(TLLM_ENABLE_DEBUG_KERNEL_PRINTS)
 __global__ void printCudaVectorInt32(int32_t const* vec, int32_t size)
 {
     for (int i = 0; i < size; i++)
     {
         printf("%d, ", vec[i]);
     }
     printf("\n");
 }
 
 __global__ void printCudaVectorUint32(uint32_t const* vec, int32_t size)
 {
     for (int i = 0; i < size; i++)
     {
         printf("%u, ", vec[i]);
     }
     printf("\n");
 }
 
 __global__ void printCudaVectorFloat(float const* vec, int32_t size)
 {
     for (int i = 0; i < size; i++)
     {
         printf("%f, ", vec[i]);
     }
     printf("\n");
 }
+#endif

1019-1096: Remove commented-out debugging or wrap with a macro.

Large commented blocks add noise and violate “no dead code”. Either delete or guard with a macro that’s off by default.

Example:

-    // printf("=================invokeMLARopeGeneration============\n");
-    // ...
+#if defined(TLLM_ENABLE_DEBUG_KERNEL_PRINTS)
+    printf("=================invokeMLARopeGeneration============\n");
+    // optional debug prints...
+#endif
cpp/tensorrt_llm/pybind/thop/bindings.cpp (1)

59-62: Avoid duplicate docstrings; keep a single doc literal at the end.

Two "Multi-head attention operation" strings are passed; retain only one at the end.

Apply this diff:

-        py::arg("spec_decoding_tensor_params"), py::arg("sparse_attention_params"), "Multi-head attention operation",
+        py::arg("spec_decoding_tensor_params"), py::arg("sparse_attention_params"),
         py::arg("cu_q_seqlens") = std::nullopt, py::arg("cu_kv_seqlens") = std::nullopt,
         py::arg("fmha_scheduler_counter") = std::nullopt, py::arg("mla_bmm1_scale") = std::nullopt,
         py::arg("mla_bmm2_scale") = std::nullopt, py::arg("quant_q_buffer") = std::nullopt,
         "Multi-head attention operation", py::call_guard<py::gil_scoped_release>());
tests/unittest/_torch/attention/test_attention_mla.py (2)

356-362: Type hint mismatch for num_generation_steps.

pytest parametrize feeds an int (10, 2, 3), but the signature annotates List[int]. Adjust to int to avoid confusion and IDE/type-checker noise.

Apply this diff:

-def test_attention_mla(scenario: Scenario, context_sequence_lengths: List[int],
-                       generation_seq_len_q: int,
-                       num_generation_steps: List[int]):
+def test_attention_mla(scenario: Scenario, context_sequence_lengths: List[int],
+                       generation_seq_len_q: int,
+                       num_generation_steps: int):

394-399: Remove verbose prints to meet PR goal (“drop print info”).

These prints spam test output. Either remove them or gate behind an env flag (e.g., if os.getenv("TRTLLM_TEST_VERBOSE") == "1":).

Also applies to: 524-529, 599-599, 652-653, 684-686, 697-699, 737-740, 815-835

tensorrt_llm/_torch/attention_backend/trtllm.py (2)

1894-1895: Normalize non-ASCII punctuation in comments.

Replace the fullwidth “?” with ASCII “?” to satisfy linters and keep consistency.


1433-1433: Remove unused kwargs or thread them through intentionally.

kwargs is unused (Ruff ARG002). Drop it from the signature or pass to lower layers if intended.

cpp/tensorrt_llm/thop/attentionOp.cpp (1)

598-603: Standardize comment language and tone.

Non-English comments (“数据类型检测…”, “缓存检查…”, “请求类型…”, “工作空间管理…”) reduce readability. Please convert to concise English to match project style.

Also applies to: 654-656, 740-742, 758-760, 791-799

tensorrt_llm/_torch/modules/attention.py (1)

1461-1468: Remove redundant guard.

if True: is unnecessary noise; call mla_rope_generation unconditionally or gate with a real condition.

Apply this diff:

-        if True:
-            # print("[forward_generation] apply_rope_generation")
-            self.mqa.mla_rope_generation(fused_q, q_pe, latent_cache,
-                                         attn_metadata, cu_q_seqlens,
-                                         cu_kv_seqlens, fmha_scheduler_counter,
-                                         mla_bmm1_scale, mla_bmm2_scale,
-                                         quant_q_buffer)
+        self.mqa.mla_rope_generation(
+            fused_q, q_pe, latent_cache, attn_metadata,
+            cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter,
+            mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer
+        )
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (2)

96-115: Normalize comment language and clarity.

Replace mixed Chinese/English with concise English to meet project standards and improve maintainability.

Apply this diff (example for the large block; repeat similarly for smaller lines):

-/*
-此处需要处理的:
-input:
-    q_pe_ld, q_pe_stride,
-    cache_type
-output:
-    workspace:
-        scale, ...
-    fused_q: [q_len, 128 * 576] gen only
-    kv_cache
-
-验证coverage
-    模型
-        dsv3, dsv3_lite
-    runtime:
-        ctx only, gen only, continuous batchign
-    gen_tokens
-        1, mtp, others...
-*/
+/*
+TODO/Notes:
+- Inputs: q_pe_ld/q_pe_stride, cache_type.
+- Outputs: fused_q [q_len, heads*(kv_lora_rank+qk_rope_head_dim)], KV cache updates.
+- Coverage targets:
+  * Models: dsv3, dsv3_lite
+  * Runtime: context-only, generation-only, continuous batching
+  * Gen tokens: 1, MTP, others
+*/

Also applies to: 154-160, 274-281, 598-603


210-218: Guard KV-cache pool pointer shape assumptions.

Add TORCH_CHECK messages that mention expected shapes/dims in English; current checks are fine—this is a readability nit only.

Also applies to: 246-258

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d05079b and 9e95278.

📒 Files selected for processing (11)
  • cpp/tensorrt_llm/common/attentionOp.cpp (4 hunks)
  • cpp/tensorrt_llm/kernels/mlaKernels.cu (2 hunks)
  • cpp/tensorrt_llm/nanobind/thop/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/thop/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
  • cpp/tensorrt_llm/thop/attentionOp.cpp (11 hunks)
  • cpp/tensorrt_llm/thop/attentionOp.h (1 hunks)
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (1 hunks)
  • tensorrt_llm/_torch/attention_backend/trtllm.py (7 hunks)
  • tensorrt_llm/_torch/modules/attention.py (6 hunks)
  • tests/unittest/_torch/attention/test_attention_mla.py (13 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...

Files:

  • cpp/tensorrt_llm/pybind/thop/bindings.cpp
  • cpp/tensorrt_llm/nanobind/thop/bindings.cpp
  • cpp/tensorrt_llm/thop/attentionOp.h
  • cpp/tensorrt_llm/common/attentionOp.cpp
  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • cpp/tensorrt_llm/kernels/mlaKernels.cu
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.

Files:

  • cpp/tensorrt_llm/pybind/thop/bindings.cpp
  • cpp/tensorrt_llm/nanobind/thop/bindings.cpp
  • cpp/tensorrt_llm/thop/attentionOp.h
  • cpp/tensorrt_llm/common/attentionOp.cpp
  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • cpp/tensorrt_llm/kernels/mlaKernels.cu
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • cpp/tensorrt_llm/pybind/thop/bindings.cpp
  • cpp/tensorrt_llm/nanobind/thop/bindings.cpp
  • cpp/tensorrt_llm/thop/attentionOp.h
  • cpp/tensorrt_llm/common/attentionOp.cpp
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • cpp/tensorrt_llm/kernels/mlaKernels.cu
  • tests/unittest/_torch/attention/test_attention_mla.py
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
  • tensorrt_llm/_torch/modules/attention.py
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.

Files:

  • cpp/tensorrt_llm/pybind/thop/bindings.cpp
  • cpp/tensorrt_llm/nanobind/thop/bindings.cpp
  • cpp/tensorrt_llm/thop/attentionOp.h
  • cpp/tensorrt_llm/common/attentionOp.cpp
  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • cpp/tensorrt_llm/pybind/thop/bindings.cpp
  • cpp/tensorrt_llm/nanobind/thop/bindings.cpp
  • cpp/tensorrt_llm/thop/attentionOp.h
  • cpp/tensorrt_llm/common/attentionOp.cpp
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • cpp/tensorrt_llm/kernels/mlaKernels.cu
  • tests/unittest/_torch/attention/test_attention_mla.py
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
  • tensorrt_llm/_torch/modules/attention.py
**/*.{h,hpp,hh,hxx}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.

Files:

  • cpp/tensorrt_llm/thop/attentionOp.h
**/*.{h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).

Files:

  • cpp/tensorrt_llm/thop/attentionOp.h
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tests/unittest/_torch/attention/test_attention_mla.py
  • tensorrt_llm/_torch/modules/attention.py
🧬 Code graph analysis (6)
cpp/tensorrt_llm/common/attentionOp.cpp (1)
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplPrecompiled.cpp (4)
  • xqaParams (106-135)
  • xqaParams (106-106)
  • xqaParams (137-154)
  • xqaParams (137-137)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
cpp/tests/unit_tests/kernels/ropeTest.cu (1)
  • attention_mask (620-662)
cpp/tensorrt_llm/kernels/mlaKernels.cu (2)
cpp/tensorrt_llm/kernels/mlaChunkedPrefill.cu (4)
  • void (75-104)
  • void (107-136)
  • void (141-221)
  • void (226-289)
cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h (1)
  • kv_cache_buffer (117-117)
tests/unittest/_torch/attention/test_attention_mla.py (3)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
  • attn_metadata (121-122)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
  • prepare (960-1047)
  • mla_rope_generation (1771-1920)
tensorrt_llm/quantization/mode.py (1)
  • has_fp8_kv_cache (166-167)
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (3)
cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h (2)
  • kv_cache_buffer (117-117)
  • cache_type (190-190)
cpp/tensorrt_llm/kernels/mlaKernels.cu (2)
  • invokeMLARopeGeneration (999-1097)
  • invokeMLARopeGeneration (999-999)
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplCommon.h (1)
  • KVBlockArray (170-187)
tensorrt_llm/_torch/modules/attention.py (3)
tensorrt_llm/_torch/attention_backend/interface.py (2)
  • num_seqs (246-250)
  • num_tokens (268-269)
tensorrt_llm/quantization/mode.py (1)
  • has_fp8_kv_cache (166-167)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
  • mla_rope_generation (1771-1920)
🪛 Clang (14.0.6)
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp

[error] 1-1: 'tensorrt_llm/common/attentionOp.h' file not found

(clang-diagnostic-error)

🪛 Ruff (0.14.0)
tensorrt_llm/_torch/attention_backend/trtllm.py

1433-1433: Unused method argument: kwargs

(ARG002)


1791-1791: Comment contains ambiguous (FULLWIDTH COMMA). Did you mean , (COMMA)?

(RUF003)


1894-1894: Comment contains ambiguous (FULLWIDTH QUESTION MARK). Did you mean ? (QUESTION MARK)?

(RUF003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/thop/CMakeLists.txt (1)

38-99: No issues found. Addition of dsv3RopeOp.cpp to the shared library is correct.

The source file exists, contains valid C++ code, and follows standard CMake practices—add_library() is used to create a shared library from source files. The change properly integrates the new MLA rope generation kernel implementation into the build.

cpp/tensorrt_llm/common/attentionOp.cpp (1)

1221-1226: Good: plumbs quant_q buffer into XQA.

Setting xqaParams.quant_q_buffer_ptr enables FP8 Q consumption in XQA. No issues.

@yunruis
Copy link
Contributor Author

yunruis commented Oct 20, 2025

/bot run

@yunruis yunruis force-pushed the user/yunruis/rope_overlap branch from 9e95278 to bd29ac1 Compare October 20, 2025 08:36
@yunruis
Copy link
Contributor Author

yunruis commented Oct 20, 2025

/bot run

1 similar comment
@yunruis
Copy link
Contributor Author

yunruis commented Oct 20, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21876 [ run ] triggered by Bot. Commit: bd29ac1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21876 [ run ] completed with state FAILURE. Commit: bd29ac1
/LLM/main/L0_MergeRequest_PR pipeline #16490 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Oct 20, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21892 [ run ] triggered by Bot. Commit: bd29ac1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21892 [ run ] completed with state FAILURE. Commit: bd29ac1
/LLM/main/L0_MergeRequest_PR pipeline #16503 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Oct 20, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21917 [ run ] triggered by Bot. Commit: bd29ac1

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21917 [ run ] completed with state FAILURE. Commit: bd29ac1
/LLM/main/L0_MergeRequest_PR pipeline #16522 completed with status: 'FAILURE'

@yunruis yunruis force-pushed the user/yunruis/rope_overlap branch from bd29ac1 to 5e4f5de Compare October 20, 2025 16:16
@yunruis
Copy link
Contributor Author

yunruis commented Oct 20, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21925 [ run ] triggered by Bot. Commit: 5e4f5de

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21925 [ run ] completed with state SUCCESS. Commit: 5e4f5de
/LLM/main/L0_MergeRequest_PR pipeline #16527 completed with status: 'FAILURE'

@yunruis yunruis force-pushed the user/yunruis/rope_overlap branch 2 times, most recently from a497cbe to 51cfc70 Compare October 21, 2025 09:29
@yunruis
Copy link
Contributor Author

yunruis commented Oct 21, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22040 [ run ] triggered by Bot. Commit: 51cfc70

@kaiyux kaiyux changed the title [None][feat] Add rope and uk-bgemm overlap for mla generation [TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation Oct 21, 2025
Copy link
Collaborator

@PerkzZheng PerkzZheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

others LGTM. thanks for the work. Do you have performance numbers to show the gains ?

@tensorrt-cicd
Copy link
Collaborator

PR_Github #22040 [ run ] completed with state SUCCESS. Commit: 51cfc70
/LLM/main/L0_MergeRequest_PR pipeline #16618 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23489 [ run ] triggered by Bot. Commit: fea53a7

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23457 [ run ] completed with state ABORTED. Commit: e52cfc3
LLM/main/L0_MergeRequest_PR #17666 (Blue Ocean) completed with status: ABORTED

@yunruis yunruis force-pushed the user/yunruis/rope_overlap branch 2 times, most recently from 76d35b3 to 954a36e Compare November 4, 2025 10:12
@yunruis
Copy link
Contributor Author

yunruis commented Nov 4, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23502 [ run ] triggered by Bot. Commit: 954a36e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23489 [ run ] completed with state ABORTED. Commit: fea53a7
LLM/main/L0_MergeRequest_PR #17681 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23502 [ run ] completed with state SUCCESS. Commit: 954a36e
/LLM/main/L0_MergeRequest_PR pipeline #17689 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Nov 4, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23523 [ run ] triggered by Bot. Commit: 954a36e

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23523 [ run ] completed with state SUCCESS. Commit: 954a36e
/LLM/main/L0_MergeRequest_PR pipeline #17702 completed with status: 'FAILURE'

@yunruis yunruis force-pushed the user/yunruis/rope_overlap branch from 954a36e to 218278f Compare November 5, 2025 03:24
@yunruis
Copy link
Contributor Author

yunruis commented Nov 5, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23570 [ run ] triggered by Bot. Commit: 218278f

@yunruis
Copy link
Contributor Author

yunruis commented Nov 5, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23603 [ run ] triggered by Bot. Commit: 218278f

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23570 [ run ] completed with state ABORTED. Commit: 218278f
LLM/main/L0_MergeRequest_PR #17737 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23603 [ run ] completed with state SUCCESS. Commit: 218278f
/LLM/main/L0_MergeRequest_PR pipeline #17760 completed with status: 'FAILURE'

Introduce dsv3RopeOp as a standalone operation and decouple it from attentionOp during the MLA generation stage. This improves modularity and enables overlap with bgemm for perf optimization

Signed-off-by: yunruis <[email protected]>
@yunruis yunruis force-pushed the user/yunruis/rope_overlap branch from 218278f to 4c2e3eb Compare November 5, 2025 11:19
@yunruis
Copy link
Contributor Author

yunruis commented Nov 5, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23642 [ run ] triggered by Bot. Commit: 4c2e3eb

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23642 [ run ] completed with state SUCCESS. Commit: 4c2e3eb
/LLM/main/L0_MergeRequest_PR pipeline #17787 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Nov 5, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23657 [ run ] triggered by Bot. Commit: 4c2e3eb

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23657 [ run ] completed with state SUCCESS. Commit: 4c2e3eb
/LLM/main/L0_MergeRequest_PR pipeline #17798 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Nov 6, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23685 [ run ] triggered by Bot. Commit: 4c2e3eb

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23685 [ run ] completed with state SUCCESS. Commit: 4c2e3eb
/LLM/main/L0_MergeRequest_PR pipeline #17820 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@yuxianq yuxianq merged commit 5154556 into NVIDIA:main Nov 6, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants