[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation #8495

yunruis · 2025-10-20T08:00:29Z

Introduce dsv3RopeOp as a standalone operation and decouple it from attentionOp during the MLA generation stage. This improves modularity and enables overlap with bgemm for perf optimization. Detailly,

overlap for kv cache bf16 flow,
fp8 kv cache flow has data dependency, currently do not overlap

Workload: ISL8K/OSL1K, local_batch=32, mtp=3(seqlen=4), no quant
Machine: B200

bgemm(us) rope(us) total(us) speed up

no_overlap 8.00 3.58 12.29 1

bgemm-rope-overlap 8.16 5.98 11.34 1.07x

Summary by CodeRabbit

New Features

Exposes MLA rope-generation as a callable operation and enables it during generation.

Attention APIs now accept optional per-step sequence and quantization tensors for sequence-aware attention.

Improvements

Generation wiring now sources sequence/scale/quant parameters from public call arguments, reducing ephemeral workspace use.

Python bindings updated to pass new optional tensors through attention and generation paths.

Tests

Unit tests updated to exercise the new per-step tensors and FP8/quant handling.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

Test cases are provided for new code paths (see test instructions)

Any new dependencies have been scanned for license and vulnerabilities

CODEOWNERS updated if ownership changes

Documentation updated as needed

The reviewers assigned automatically/manually are appropriate for the PR.

Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

yunruis · 2025-10-20T08:03:48Z

/bot run

coderabbitai · 2025-10-20T08:07:56Z

📝 Walkthrough

Walkthrough

Adds end-to-end MLA rope-generation support: new CUDA MLA kernel and TorchScript binding, threads six optional per-attention tensors through Python and C++ bindings, replaces runtime workspace allocations with generation-time parameter wiring, and updates attention codepaths and tests to use the public generation parameters.

Changes

Cohort / File(s)	Summary
Core attention runtime `cpp/tensorrt_llm/common/attentionOp.cpp`	Removed several runtime workspace allocations for MLA, added runtime validity checks for params (seqQOffset, cache_seq_lens, fmha_tile_counter) and optional FP8 buffers, and switched fmha/XQA parameter pointers to source from public generation params instead of ephemeral workspace pointers.
Thop attention API & implementation `cpp/tensorrt_llm/thop/attentionOp.h`, `cpp/tensorrt_llm/thop/attentionOp.cpp`	Appended six new optional tensor parameters to public/internal attention signatures and Runner/prepare overrides: `cu_q_seqlens`, `cu_kv_seqlens`, `fmha_scheduler_counter`, `mla_bmm1_scale`, `mla_bmm2_scale`, `quant_q_buffer`. Wired those parameters through context and generation branches into MLA/fmha structures.
New MLA rope kernel & binding `cpp/tensorrt_llm/thop/dsv3RopeOp.cpp`	New file implementing MLARopeGeneration: argument aggregation, validation, KV-cache setup, MlaParams population, dtype-specialized dispatch helper, and TorchScript registration (torch.ops.trtllm.mla_rope_generation).
Bindings (nanobind / pybind11) `cpp/tensorrt_llm/nanobind/thop/bindings.cpp`, `cpp/tensorrt_llm/pybind/thop/bindings.cpp`	Extended the Python-facing `attention` bindings with six new optional keyword args (default std::nullopt): `cu_q_seqlens`, `cu_kv_seqlens`, `fmha_scheduler_counter`, `mla_bmm1_scale`, `mla_bmm2_scale`, `quant_q_buffer`.
Build config `cpp/tensorrt_llm/thop/CMakeLists.txt`	Added `dsv3RopeOp.cpp` to the `th_common` shared target source list and adjusted target link libraries in the affected target context.
Python attention backend `tensorrt_llm/_torch/attention_backend/trtllm.py`	Plumbed six new optional tensors through TrtllmAttentionWrapper.run and TrtllmAttention.forward; added mla_rope_generation methods that invoke `torch.ops.trtllm.mla_rope_generation` with the extended argument list.
Generation internals `tensorrt_llm/_torch/modules/attention.py`	In forward_generation, create per-step `cu_q_seqlens`, `cu_kv_seqlens`, `fmha_scheduler_counter`; conditionally allocate FP8-scale buffers (`mla_bmm1_scale`, `mla_bmm2_scale`, `quant_q_buffer`) when FP8 KV cache present; run mla_rope_generation in parallel with existing BMM paths.
Custom ops (fake op for shape inference) `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`	Added a fake-op registration `trtllm::mla_rope_generation` (no-op placeholder) for shape inference with the full multi-argument signature.
Tests `tests/unittest/_torch/attention/test_attention_mla.py`	Updated generation test to prepare and pass `cu_q_seqlens`, `cu_kv_seqlens`, `fmha_scheduler_counter`; conditional FP8 buffer allocation and calls to `mla_rope_generation`; extended forward/test paths to accept the new params.

Sequence Diagram(s)

sequenceDiagram
    actor Python
    participant TrtWrapper as TrtllmAttentionWrapper
    participant ThopAttention as thop.attention
    participant TorchOp as torch.ops.trtllm.mla_rope_generation
    participant CUDA as CUDA Kernel

    Python->>TrtWrapper: forward(fused_q,..., cu_q_seqlens?, cu_kv_seqlens?, fmha_scheduler_counter?, mla_bmm1_scale?, mla_bmm2_scale?, quant_q_buffer?)
    TrtWrapper->>ThopAttention: call attention(..., cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter, mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer)

    alt Generation path
        ThopAttention->>ThopAttention: prepare generation buffers / conditionally set FP8 scales
        par Main BMM
            ThopAttention->>CUDA: bmm_out()/fp8_block_scaling_bmm_out()
        and MLA rope generation
            ThopAttention->>TorchOp: mla_rope_generation(fused_q, q_pe, latent_cache, cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter, mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer, ...)
            TorchOp->>CUDA: validate & setup -> dispatch MLA kernel
            CUDA-->>TorchOp: complete
            TorchOp-->>ThopAttention: completion
        end
        ThopAttention->>CUDA: mqa.forward(generation=true, params from public args)
        CUDA-->>ThopAttention: attn_out
    end

    ThopAttention-->>TrtWrapper: output
    TrtWrapper-->>Python: result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	❓ Inconclusive	PR description includes objectives and performance data but lacks required template sections (Description, Test Coverage details).	Complete the Description section explaining what changes were made and why. Clearly specify which test cases validate the new dsv3RopeOp functionality and rope-bgemm overlap.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation" directly aligns with the primary change in the changeset: introducing a new standalone `dsv3RopeOp` operation and decoupling it from `attentionOp` during MLA generation to enable performance optimization through operation overlap. The title is concise, follows the repository conventions with ticket and type prefix, and specifically conveys the main objectives without unnecessary noise.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/tensorrt_llm/thop/attentionOp.h (1)
37-68: Header hygiene: add include guard and document new params.

Guidelines require include guards and Doxygen on public prototypes. Replace pragma once with TRTLLM_ATTENTIONOP_H guards and add brief //! docs for the six new optional tensors.

Apply this diff header/footer:
-#pragma once
+#ifndef TRTLLM_ATTENTIONOP_H
+#define TRTLLM_ATTENTIONOP_H
@@
-void attention( /* existing params ... */,
+//! Multi-head attention entrypoint. New optional tensors are used to pass precomputed
+//! sequence lengths, FMHA scheduler state, MLA FP8 scales, and an FP8 Q buffer.
+void attention( /* existing params ... */,
     std::vector<std::optional<torch::Tensor>> spec_decoding_tensor_params,
     std::vector<std::optional<torch::Tensor>> sparse_attention_params,
     std::optional<torch::Tensor> cu_q_seqlens,            //!< [B+1] cumulative Q seq lens
     std::optional<torch::Tensor> cu_kv_seqlens,           //!< [B+1] cumulative KV seq lens
     std::optional<torch::Tensor> fmha_scheduler_counter,  //!< [1]  tile counter buffer
     std::optional<torch::Tensor> mla_bmm1_scale,          //!< [2]  bmm1 scales (with log2)
     std::optional<torch::Tensor> mla_bmm2_scale,          //!< [1]  bmm2 scale
     std::optional<torch::Tensor> quant_q_buffer           //!< [..] pre-quantized FP8 Q
 );
+
+#endif // TRTLLM_ATTENTIONOP_H
cpp/tensorrt_llm/pybind/thop/bindings.cpp (1)

31-62: Remove duplicate docstring in pybind11 and nanobind bindings—docstring must appear only once, after all parameters and before call_guard.

Parameter order and names match the torch_ext::attention signature correctly. However, both the pybind11 and nanobind bindings contain a critical error: the docstring "Multi-head attention operation" appears twice—once incorrectly placed between sparse_attention_params and cu_q_seqlens, and once correctly placed before the call_guard.

The pybind11/nanobind binding syntax requires the docstring to appear exactly once, after all parameter definitions. The intermediate docstring will cause binding errors.

Fix locations:

cpp/tensorrt_llm/pybind/thop/bindings.cpp: Remove the "Multi-head attention operation" string appearing after sparse_attention_params (line 57 in the binding)

cpp/tensorrt_llm/nanobind/thop/bindings.cpp: Remove the "Multi-head attention operation" string appearing after sparse_attention_params (same line position)

Keep only the docstring that appears immediately before call_guard.

🧹 Nitpick comments (13)

cpp/tensorrt_llm/nanobind/thop/bindings.cpp (1)
59-62: Place docstring last; keep kwargs contiguous.

Minor binding hygiene: move the docstring literal after the new kwargs and before call_guard to match typical nanobind usage and avoid confusion.

Apply this diff:
-        nb::arg("mla_tensor_params"), nb::arg("attention_chunk_size") = std::nullopt,
-        nb::arg("softmax_stats_tensor") = std::nullopt, nb::arg("spec_decoding_bool_params"),
-        nb::arg("spec_decoding_tensor_params"), nb::arg("sparse_attention_params"), "Multi-head attention operation",
+        nb::arg("mla_tensor_params"), nb::arg("attention_chunk_size") = std::nullopt,
+        nb::arg("softmax_stats_tensor") = std::nullopt, nb::arg("spec_decoding_bool_params"),
+        nb::arg("spec_decoding_tensor_params"), nb::arg("sparse_attention_params"),
         nb::arg("cu_q_seqlens") = std::nullopt, nb::arg("cu_kv_seqlens") = std::nullopt,
         nb::arg("fmha_scheduler_counter") = std::nullopt, nb::arg("mla_bmm1_scale") = std::nullopt,
-        nb::arg("mla_bmm2_scale") = std::nullopt, nb::arg("quant_q_buffer") = std::nullopt,
-        nb::call_guard<nb::gil_scoped_release>());
+        nb::arg("mla_bmm2_scale") = std::nullopt, nb::arg("quant_q_buffer") = std::nullopt,
+        "Multi-head attention operation", nb::call_guard<nb::gil_scoped_release>());
cpp/tensorrt_llm/thop/attentionOp.h (1)

64-68: Signature size is getting risky; consider a params struct.

This function now has 60+ parameters. Pack related options into a struct (e.g., AttentionArgs) to reduce call-site error risk and ease future extension.
cpp/tensorrt_llm/kernels/mlaKernels.cu (2)
935-961: Guard debug kernels; avoid unconditional device printf symbols.

Define these only under a build-time flag to keep release binaries lean and prevent accidental device I/O.

Apply this diff:
+#if defined(TLLM_ENABLE_DEBUG_KERNEL_PRINTS)
 __global__ void printCudaVectorInt32(int32_t const* vec, int32_t size)
 {
     for (int i = 0; i < size; i++)
     {
         printf("%d, ", vec[i]);
     }
     printf("\n");
 }
 
 __global__ void printCudaVectorUint32(uint32_t const* vec, int32_t size)
 {
     for (int i = 0; i < size; i++)
     {
         printf("%u, ", vec[i]);
     }
     printf("\n");
 }
 
 __global__ void printCudaVectorFloat(float const* vec, int32_t size)
 {
     for (int i = 0; i < size; i++)
     {
         printf("%f, ", vec[i]);
     }
     printf("\n");
 }
+#endif
1019-1096: Remove commented-out debugging or wrap with a macro.

Large commented blocks add noise and violate “no dead code”. Either delete or guard with a macro that’s off by default.

Example:
-    // printf("=================invokeMLARopeGeneration============\n");
-    // ...
+#if defined(TLLM_ENABLE_DEBUG_KERNEL_PRINTS)
+    printf("=================invokeMLARopeGeneration============\n");
+    // optional debug prints...
+#endif
cpp/tensorrt_llm/pybind/thop/bindings.cpp (1)
59-62: Avoid duplicate docstrings; keep a single doc literal at the end.

Two "Multi-head attention operation" strings are passed; retain only one at the end.

Apply this diff:
-        py::arg("spec_decoding_tensor_params"), py::arg("sparse_attention_params"), "Multi-head attention operation",
+        py::arg("spec_decoding_tensor_params"), py::arg("sparse_attention_params"),
         py::arg("cu_q_seqlens") = std::nullopt, py::arg("cu_kv_seqlens") = std::nullopt,
         py::arg("fmha_scheduler_counter") = std::nullopt, py::arg("mla_bmm1_scale") = std::nullopt,
         py::arg("mla_bmm2_scale") = std::nullopt, py::arg("quant_q_buffer") = std::nullopt,
         "Multi-head attention operation", py::call_guard<py::gil_scoped_release>());
tests/unittest/_torch/attention/test_attention_mla.py (2)
356-362: Type hint mismatch for num_generation_steps.

pytest parametrize feeds an int (10, 2, 3), but the signature annotates List[int]. Adjust to int to avoid confusion and IDE/type-checker noise.

Apply this diff:
-def test_attention_mla(scenario: Scenario, context_sequence_lengths: List[int],
-                       generation_seq_len_q: int,
-                       num_generation_steps: List[int]):
+def test_attention_mla(scenario: Scenario, context_sequence_lengths: List[int],
+                       generation_seq_len_q: int,
+                       num_generation_steps: int):
394-399: Remove verbose prints to meet PR goal (“drop print info”).

These prints spam test output. Either remove them or gate behind an env flag (e.g., if os.getenv("TRTLLM_TEST_VERBOSE") == "1":).

Also applies to: 524-529, 599-599, 652-653, 684-686, 697-699, 737-740, 815-835
tensorrt_llm/_torch/attention_backend/trtllm.py (2)

1894-1895: Normalize non-ASCII punctuation in comments.

Replace the fullwidth “？” with ASCII “?” to satisfy linters and keep consistency.

1433-1433: Remove unused kwargs or thread them through intentionally.

kwargs is unused (Ruff ARG002). Drop it from the signature or pass to lower layers if intended.

cpp/tensorrt_llm/thop/attentionOp.cpp (1)

598-603: Standardize comment language and tone.

Non-English comments (“数据类型检测…”, “缓存检查…”, “请求类型…”, “工作空间管理…”) reduce readability. Please convert to concise English to match project style.

Also applies to: 654-656, 740-742, 758-760, 791-799
tensorrt_llm/_torch/modules/attention.py (1)
1461-1468: Remove redundant guard.

if True: is unnecessary noise; call mla_rope_generation unconditionally or gate with a real condition.

Apply this diff:
-        if True:
-            # print("[forward_generation] apply_rope_generation")
-            self.mqa.mla_rope_generation(fused_q, q_pe, latent_cache,
-                                         attn_metadata, cu_q_seqlens,
-                                         cu_kv_seqlens, fmha_scheduler_counter,
-                                         mla_bmm1_scale, mla_bmm2_scale,
-                                         quant_q_buffer)
+        self.mqa.mla_rope_generation(
+            fused_q, q_pe, latent_cache, attn_metadata,
+            cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter,
+            mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer
+        )
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (2)
96-115: Normalize comment language and clarity.

Replace mixed Chinese/English with concise English to meet project standards and improve maintainability.

Apply this diff (example for the large block; repeat similarly for smaller lines):
-/*
-此处需要处理的：
-input：
-    q_pe_ld, q_pe_stride,
-    cache_type
-output:
-    workspace:
-        scale, ...
-    fused_q: [q_len, 128 * 576] gen only
-    kv_cache
-
-验证coverage
-    模型
-        dsv3, dsv3_lite
-    runtime：
-        ctx only, gen only, continuous batchign
-    gen_tokens
-        1, mtp, others...
-*/
+/*
+TODO/Notes:
+- Inputs: q_pe_ld/q_pe_stride, cache_type.
+- Outputs: fused_q [q_len, heads*(kv_lora_rank+qk_rope_head_dim)], KV cache updates.
+- Coverage targets:
+  * Models: dsv3, dsv3_lite
+  * Runtime: context-only, generation-only, continuous batching
+  * Gen tokens: 1, MTP, others
+*/
Also applies to: 154-160, 274-281, 598-603

210-218: Guard KV-cache pool pointer shape assumptions.

Add TORCH_CHECK messages that mention expected shapes/dims in English; current checks are fine—this is a readability nit only.

Also applies to: 246-258

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d05079b and 9e95278.

📒 Files selected for processing (11)

cpp/tensorrt_llm/common/attentionOp.cpp (4 hunks)
cpp/tensorrt_llm/kernels/mlaKernels.cu (2 hunks)
cpp/tensorrt_llm/nanobind/thop/bindings.cpp (1 hunks)
cpp/tensorrt_llm/pybind/thop/bindings.cpp (1 hunks)
cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/thop/attentionOp.cpp (11 hunks)
cpp/tensorrt_llm/thop/attentionOp.h (1 hunks)
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (1 hunks)
tensorrt_llm/_torch/attention_backend/trtllm.py (7 hunks)
tensorrt_llm/_torch/modules/attention.py (6 hunks)
tests/unittest/_torch/attention/test_attention_mla.py (13 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}