-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation #8495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/bot run |
📝 WalkthroughWalkthroughAdds end-to-end MLA rope-generation support: new CUDA MLA kernel and TorchScript binding, threads six optional per-attention tensors through Python and C++ bindings, replaces runtime workspace allocations with generation-time parameter wiring, and updates attention codepaths and tests to use the public generation parameters. Changes
Sequence Diagram(s)sequenceDiagram
actor Python
participant TrtWrapper as TrtllmAttentionWrapper
participant ThopAttention as thop.attention
participant TorchOp as torch.ops.trtllm.mla_rope_generation
participant CUDA as CUDA Kernel
Python->>TrtWrapper: forward(fused_q,..., cu_q_seqlens?, cu_kv_seqlens?, fmha_scheduler_counter?, mla_bmm1_scale?, mla_bmm2_scale?, quant_q_buffer?)
TrtWrapper->>ThopAttention: call attention(..., cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter, mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer)
alt Generation path
ThopAttention->>ThopAttention: prepare generation buffers / conditionally set FP8 scales
par Main BMM
ThopAttention->>CUDA: bmm_out()/fp8_block_scaling_bmm_out()
and MLA rope generation
ThopAttention->>TorchOp: mla_rope_generation(fused_q, q_pe, latent_cache, cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter, mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer, ...)
TorchOp->>CUDA: validate & setup -> dispatch MLA kernel
CUDA-->>TorchOp: complete
TorchOp-->>ThopAttention: completion
end
ThopAttention->>CUDA: mqa.forward(generation=true, params from public args)
CUDA-->>ThopAttention: attn_out
end
ThopAttention-->>TrtWrapper: output
TrtWrapper-->>Python: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/thop/attentionOp.h (1)
37-68: Header hygiene: add include guard and document new params.Guidelines require include guards and Doxygen on public prototypes. Replace pragma once with TRTLLM_ATTENTIONOP_H guards and add brief //! docs for the six new optional tensors.
Apply this diff header/footer:
-#pragma once +#ifndef TRTLLM_ATTENTIONOP_H +#define TRTLLM_ATTENTIONOP_H @@ -void attention( /* existing params ... */, +//! Multi-head attention entrypoint. New optional tensors are used to pass precomputed +//! sequence lengths, FMHA scheduler state, MLA FP8 scales, and an FP8 Q buffer. +void attention( /* existing params ... */, std::vector<std::optional<torch::Tensor>> spec_decoding_tensor_params, std::vector<std::optional<torch::Tensor>> sparse_attention_params, std::optional<torch::Tensor> cu_q_seqlens, //!< [B+1] cumulative Q seq lens std::optional<torch::Tensor> cu_kv_seqlens, //!< [B+1] cumulative KV seq lens std::optional<torch::Tensor> fmha_scheduler_counter, //!< [1] tile counter buffer std::optional<torch::Tensor> mla_bmm1_scale, //!< [2] bmm1 scales (with log2) std::optional<torch::Tensor> mla_bmm2_scale, //!< [1] bmm2 scale std::optional<torch::Tensor> quant_q_buffer //!< [..] pre-quantized FP8 Q ); + +#endif // TRTLLM_ATTENTIONOP_Hcpp/tensorrt_llm/pybind/thop/bindings.cpp (1)
31-62: Remove duplicate docstring in pybind11 and nanobind bindings—docstring must appear only once, after all parameters and beforecall_guard.Parameter order and names match the
torch_ext::attentionsignature correctly. However, both the pybind11 and nanobind bindings contain a critical error: the docstring"Multi-head attention operation"appears twice—once incorrectly placed betweensparse_attention_paramsandcu_q_seqlens, and once correctly placed before thecall_guard.The pybind11/nanobind binding syntax requires the docstring to appear exactly once, after all parameter definitions. The intermediate docstring will cause binding errors.
Fix locations:
cpp/tensorrt_llm/pybind/thop/bindings.cpp: Remove the"Multi-head attention operation"string appearing aftersparse_attention_params(line 57 in the binding)cpp/tensorrt_llm/nanobind/thop/bindings.cpp: Remove the"Multi-head attention operation"string appearing aftersparse_attention_params(same line position)Keep only the docstring that appears immediately before
call_guard.
🧹 Nitpick comments (13)
cpp/tensorrt_llm/nanobind/thop/bindings.cpp (1)
59-62: Place docstring last; keep kwargs contiguous.Minor binding hygiene: move the docstring literal after the new kwargs and before call_guard to match typical nanobind usage and avoid confusion.
Apply this diff:
- nb::arg("mla_tensor_params"), nb::arg("attention_chunk_size") = std::nullopt, - nb::arg("softmax_stats_tensor") = std::nullopt, nb::arg("spec_decoding_bool_params"), - nb::arg("spec_decoding_tensor_params"), nb::arg("sparse_attention_params"), "Multi-head attention operation", + nb::arg("mla_tensor_params"), nb::arg("attention_chunk_size") = std::nullopt, + nb::arg("softmax_stats_tensor") = std::nullopt, nb::arg("spec_decoding_bool_params"), + nb::arg("spec_decoding_tensor_params"), nb::arg("sparse_attention_params"), nb::arg("cu_q_seqlens") = std::nullopt, nb::arg("cu_kv_seqlens") = std::nullopt, nb::arg("fmha_scheduler_counter") = std::nullopt, nb::arg("mla_bmm1_scale") = std::nullopt, - nb::arg("mla_bmm2_scale") = std::nullopt, nb::arg("quant_q_buffer") = std::nullopt, - nb::call_guard<nb::gil_scoped_release>()); + nb::arg("mla_bmm2_scale") = std::nullopt, nb::arg("quant_q_buffer") = std::nullopt, + "Multi-head attention operation", nb::call_guard<nb::gil_scoped_release>());cpp/tensorrt_llm/thop/attentionOp.h (1)
64-68: Signature size is getting risky; consider a params struct.This function now has 60+ parameters. Pack related options into a struct (e.g., AttentionArgs) to reduce call-site error risk and ease future extension.
cpp/tensorrt_llm/kernels/mlaKernels.cu (2)
935-961: Guard debug kernels; avoid unconditional device printf symbols.Define these only under a build-time flag to keep release binaries lean and prevent accidental device I/O.
Apply this diff:
+#if defined(TLLM_ENABLE_DEBUG_KERNEL_PRINTS) __global__ void printCudaVectorInt32(int32_t const* vec, int32_t size) { for (int i = 0; i < size; i++) { printf("%d, ", vec[i]); } printf("\n"); } __global__ void printCudaVectorUint32(uint32_t const* vec, int32_t size) { for (int i = 0; i < size; i++) { printf("%u, ", vec[i]); } printf("\n"); } __global__ void printCudaVectorFloat(float const* vec, int32_t size) { for (int i = 0; i < size; i++) { printf("%f, ", vec[i]); } printf("\n"); } +#endif
1019-1096: Remove commented-out debugging or wrap with a macro.Large commented blocks add noise and violate “no dead code”. Either delete or guard with a macro that’s off by default.
Example:
- // printf("=================invokeMLARopeGeneration============\n"); - // ... +#if defined(TLLM_ENABLE_DEBUG_KERNEL_PRINTS) + printf("=================invokeMLARopeGeneration============\n"); + // optional debug prints... +#endifcpp/tensorrt_llm/pybind/thop/bindings.cpp (1)
59-62: Avoid duplicate docstrings; keep a single doc literal at the end.Two "Multi-head attention operation" strings are passed; retain only one at the end.
Apply this diff:
- py::arg("spec_decoding_tensor_params"), py::arg("sparse_attention_params"), "Multi-head attention operation", + py::arg("spec_decoding_tensor_params"), py::arg("sparse_attention_params"), py::arg("cu_q_seqlens") = std::nullopt, py::arg("cu_kv_seqlens") = std::nullopt, py::arg("fmha_scheduler_counter") = std::nullopt, py::arg("mla_bmm1_scale") = std::nullopt, py::arg("mla_bmm2_scale") = std::nullopt, py::arg("quant_q_buffer") = std::nullopt, "Multi-head attention operation", py::call_guard<py::gil_scoped_release>());tests/unittest/_torch/attention/test_attention_mla.py (2)
356-362: Type hint mismatch for num_generation_steps.pytest parametrize feeds an int (10, 2, 3), but the signature annotates List[int]. Adjust to int to avoid confusion and IDE/type-checker noise.
Apply this diff:
-def test_attention_mla(scenario: Scenario, context_sequence_lengths: List[int], - generation_seq_len_q: int, - num_generation_steps: List[int]): +def test_attention_mla(scenario: Scenario, context_sequence_lengths: List[int], + generation_seq_len_q: int, + num_generation_steps: int):
394-399: Remove verbose prints to meet PR goal (“drop print info”).These prints spam test output. Either remove them or gate behind an env flag (e.g., if os.getenv("TRTLLM_TEST_VERBOSE") == "1":).
Also applies to: 524-529, 599-599, 652-653, 684-686, 697-699, 737-740, 815-835
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
1894-1895: Normalize non-ASCII punctuation in comments.Replace the fullwidth “?” with ASCII “?” to satisfy linters and keep consistency.
1433-1433: Remove unused kwargs or thread them through intentionally.kwargs is unused (Ruff ARG002). Drop it from the signature or pass to lower layers if intended.
cpp/tensorrt_llm/thop/attentionOp.cpp (1)
598-603: Standardize comment language and tone.Non-English comments (“数据类型检测…”, “缓存检查…”, “请求类型…”, “工作空间管理…”) reduce readability. Please convert to concise English to match project style.
Also applies to: 654-656, 740-742, 758-760, 791-799
tensorrt_llm/_torch/modules/attention.py (1)
1461-1468: Remove redundant guard.if True: is unnecessary noise; call mla_rope_generation unconditionally or gate with a real condition.
Apply this diff:
- if True: - # print("[forward_generation] apply_rope_generation") - self.mqa.mla_rope_generation(fused_q, q_pe, latent_cache, - attn_metadata, cu_q_seqlens, - cu_kv_seqlens, fmha_scheduler_counter, - mla_bmm1_scale, mla_bmm2_scale, - quant_q_buffer) + self.mqa.mla_rope_generation( + fused_q, q_pe, latent_cache, attn_metadata, + cu_q_seqlens, cu_kv_seqlens, fmha_scheduler_counter, + mla_bmm1_scale, mla_bmm2_scale, quant_q_buffer + )cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (2)
96-115: Normalize comment language and clarity.Replace mixed Chinese/English with concise English to meet project standards and improve maintainability.
Apply this diff (example for the large block; repeat similarly for smaller lines):
-/* -此处需要处理的: -input: - q_pe_ld, q_pe_stride, - cache_type -output: - workspace: - scale, ... - fused_q: [q_len, 128 * 576] gen only - kv_cache - -验证coverage - 模型 - dsv3, dsv3_lite - runtime: - ctx only, gen only, continuous batchign - gen_tokens - 1, mtp, others... -*/ +/* +TODO/Notes: +- Inputs: q_pe_ld/q_pe_stride, cache_type. +- Outputs: fused_q [q_len, heads*(kv_lora_rank+qk_rope_head_dim)], KV cache updates. +- Coverage targets: + * Models: dsv3, dsv3_lite + * Runtime: context-only, generation-only, continuous batching + * Gen tokens: 1, MTP, others +*/Also applies to: 154-160, 274-281, 598-603
210-218: Guard KV-cache pool pointer shape assumptions.Add TORCH_CHECK messages that mention expected shapes/dims in English; current checks are fine—this is a readability nit only.
Also applies to: 246-258
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
cpp/tensorrt_llm/common/attentionOp.cpp(4 hunks)cpp/tensorrt_llm/kernels/mlaKernels.cu(2 hunks)cpp/tensorrt_llm/nanobind/thop/bindings.cpp(1 hunks)cpp/tensorrt_llm/pybind/thop/bindings.cpp(1 hunks)cpp/tensorrt_llm/thop/CMakeLists.txt(1 hunks)cpp/tensorrt_llm/thop/attentionOp.cpp(11 hunks)cpp/tensorrt_llm/thop/attentionOp.h(1 hunks)cpp/tensorrt_llm/thop/dsv3RopeOp.cpp(1 hunks)tensorrt_llm/_torch/attention_backend/trtllm.py(7 hunks)tensorrt_llm/_torch/modules/attention.py(6 hunks)tests/unittest/_torch/attention/test_attention_mla.py(13 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/pybind/thop/bindings.cppcpp/tensorrt_llm/nanobind/thop/bindings.cppcpp/tensorrt_llm/thop/attentionOp.hcpp/tensorrt_llm/common/attentionOp.cppcpp/tensorrt_llm/thop/attentionOp.cppcpp/tensorrt_llm/kernels/mlaKernels.cucpp/tensorrt_llm/thop/dsv3RopeOp.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/pybind/thop/bindings.cppcpp/tensorrt_llm/nanobind/thop/bindings.cppcpp/tensorrt_llm/thop/attentionOp.hcpp/tensorrt_llm/common/attentionOp.cppcpp/tensorrt_llm/thop/attentionOp.cppcpp/tensorrt_llm/kernels/mlaKernels.cucpp/tensorrt_llm/thop/dsv3RopeOp.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/pybind/thop/bindings.cppcpp/tensorrt_llm/nanobind/thop/bindings.cppcpp/tensorrt_llm/thop/attentionOp.hcpp/tensorrt_llm/common/attentionOp.cpptensorrt_llm/_torch/attention_backend/trtllm.pycpp/tensorrt_llm/thop/attentionOp.cppcpp/tensorrt_llm/kernels/mlaKernels.cutests/unittest/_torch/attention/test_attention_mla.pycpp/tensorrt_llm/thop/dsv3RopeOp.cpptensorrt_llm/_torch/modules/attention.py
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/pybind/thop/bindings.cppcpp/tensorrt_llm/nanobind/thop/bindings.cppcpp/tensorrt_llm/thop/attentionOp.hcpp/tensorrt_llm/common/attentionOp.cppcpp/tensorrt_llm/thop/attentionOp.cppcpp/tensorrt_llm/thop/dsv3RopeOp.cpp
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/pybind/thop/bindings.cppcpp/tensorrt_llm/nanobind/thop/bindings.cppcpp/tensorrt_llm/thop/attentionOp.hcpp/tensorrt_llm/common/attentionOp.cpptensorrt_llm/_torch/attention_backend/trtllm.pycpp/tensorrt_llm/thop/attentionOp.cppcpp/tensorrt_llm/kernels/mlaKernels.cutests/unittest/_torch/attention/test_attention_mla.pycpp/tensorrt_llm/thop/dsv3RopeOp.cpptensorrt_llm/_torch/modules/attention.py
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.
Files:
cpp/tensorrt_llm/thop/attentionOp.h
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).
Files:
cpp/tensorrt_llm/thop/attentionOp.h
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/attention_backend/trtllm.pytests/unittest/_torch/attention/test_attention_mla.pytensorrt_llm/_torch/modules/attention.py
🧬 Code graph analysis (6)
cpp/tensorrt_llm/common/attentionOp.cpp (1)
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplPrecompiled.cpp (4)
xqaParams(106-135)xqaParams(106-106)xqaParams(137-154)xqaParams(137-137)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
cpp/tests/unit_tests/kernels/ropeTest.cu (1)
attention_mask(620-662)
cpp/tensorrt_llm/kernels/mlaKernels.cu (2)
cpp/tensorrt_llm/kernels/mlaChunkedPrefill.cu (4)
void(75-104)void(107-136)void(141-221)void(226-289)cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h (1)
kv_cache_buffer(117-117)
tests/unittest/_torch/attention/test_attention_mla.py (3)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
attn_metadata(121-122)tensorrt_llm/_torch/attention_backend/trtllm.py (2)
prepare(960-1047)mla_rope_generation(1771-1920)tensorrt_llm/quantization/mode.py (1)
has_fp8_kv_cache(166-167)
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (3)
cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h (2)
kv_cache_buffer(117-117)cache_type(190-190)cpp/tensorrt_llm/kernels/mlaKernels.cu (2)
invokeMLARopeGeneration(999-1097)invokeMLARopeGeneration(999-999)cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplCommon.h (1)
KVBlockArray(170-187)
tensorrt_llm/_torch/modules/attention.py (3)
tensorrt_llm/_torch/attention_backend/interface.py (2)
num_seqs(246-250)num_tokens(268-269)tensorrt_llm/quantization/mode.py (1)
has_fp8_kv_cache(166-167)tensorrt_llm/_torch/attention_backend/trtllm.py (1)
mla_rope_generation(1771-1920)
🪛 Clang (14.0.6)
cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
[error] 1-1: 'tensorrt_llm/common/attentionOp.h' file not found
(clang-diagnostic-error)
🪛 Ruff (0.14.0)
tensorrt_llm/_torch/attention_backend/trtllm.py
1433-1433: Unused method argument: kwargs
(ARG002)
1791-1791: Comment contains ambiguous , (FULLWIDTH COMMA). Did you mean , (COMMA)?
(RUF003)
1894-1894: Comment contains ambiguous ? (FULLWIDTH QUESTION MARK). Did you mean ? (QUESTION MARK)?
(RUF003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
cpp/tensorrt_llm/thop/CMakeLists.txt (1)
38-99: No issues found. Addition ofdsv3RopeOp.cppto the shared library is correct.The source file exists, contains valid C++ code, and follows standard CMake practices—add_library() is used to create a shared library from source files. The change properly integrates the new MLA rope generation kernel implementation into the build.
cpp/tensorrt_llm/common/attentionOp.cpp (1)
1221-1226: Good: plumbs quant_q buffer into XQA.Setting xqaParams.quant_q_buffer_ptr enables FP8 Q consumption in XQA. No issues.
|
/bot run |
9e95278 to
bd29ac1
Compare
|
/bot run |
1 similar comment
|
/bot run |
|
PR_Github #21876 [ run ] triggered by Bot. Commit: |
|
PR_Github #21876 [ run ] completed with state |
|
/bot run |
|
PR_Github #21892 [ run ] triggered by Bot. Commit: |
|
PR_Github #21892 [ run ] completed with state |
|
/bot run |
|
PR_Github #21917 [ run ] triggered by Bot. Commit: |
|
PR_Github #21917 [ run ] completed with state |
bd29ac1 to
5e4f5de
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #21925 [ run ] triggered by Bot. Commit: |
|
PR_Github #21925 [ run ] completed with state |
a497cbe to
51cfc70
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #22040 [ run ] triggered by Bot. Commit: |
PerkzZheng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
others LGTM. thanks for the work. Do you have performance numbers to show the gains ?
|
PR_Github #22040 [ run ] completed with state |
|
PR_Github #23489 [ run ] triggered by Bot. Commit: |
|
PR_Github #23457 [ run ] completed with state |
76d35b3 to
954a36e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #23502 [ run ] triggered by Bot. Commit: |
|
PR_Github #23489 [ run ] completed with state |
|
PR_Github #23502 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #23523 [ run ] triggered by Bot. Commit: |
|
PR_Github #23523 [ run ] completed with state |
954a36e to
218278f
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #23570 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #23603 [ run ] triggered by Bot. Commit: |
|
PR_Github #23570 [ run ] completed with state |
|
PR_Github #23603 [ run ] completed with state |
Introduce dsv3RopeOp as a standalone operation and decouple it from attentionOp during the MLA generation stage. This improves modularity and enables overlap with bgemm for perf optimization Signed-off-by: yunruis <[email protected]>
218278f to
4c2e3eb
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #23642 [ run ] triggered by Bot. Commit: |
|
PR_Github #23642 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #23657 [ run ] triggered by Bot. Commit: |
|
PR_Github #23657 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #23685 [ run ] triggered by Bot. Commit: |
|
PR_Github #23685 [ run ] completed with state |
Introduce
dsv3RopeOpas a standalone operation and decouple it fromattentionOpduring the MLA generation stage. This improves modularity and enables overlap with bgemm for perf optimization. Detailly,Workload: ISL8K/OSL1K, local_batch=32, mtp=3(seqlen=4), no quant
Machine: B200
Summary by CodeRabbit
New Features
Improvements
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.