-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][perf] Add custom indexer k cache scatter op #8960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
📝 WalkthroughWalkthroughA CUDA kernel for scattering FP8 k-cache data and per-token scales into non-contiguous pooled cache is introduced. The kernel is exposed via PyTorch bindings and integrated into the DSA attention backend, replacing Python-based scatter logic with device-side execution. Changes
Sequence DiagramsequenceDiagram
participant Python as Python DSA Code
participant PyTorchOp as PyTorch Extension
participant Kernel as CUDA Kernel
participant GPU as GPU Memory
Python->>PyTorchOp: _update_k_cache calls indexer_k_cache_scatter_op<br/>(k_fp8_bytes, k_scale_bytes, k_cache, slot_mappings)
PyTorchOp->>PyTorchOp: Validate tensors (CUDA, shapes, dtypes, contiguity)
PyTorchOp->>PyTorchOp: Extract strides & dims<br/>Verify constraints (head_dim=128, scale_size=4)
PyTorchOp->>Kernel: invokeIndexerKCacheScatter<br/>(pointers, dimensions, strides, stream)
Kernel->>Kernel: Launch indexerKCacheScatterUnifiedKernel<br/>Grid=[num_tokens], Block=[32 threads]
Kernel->>GPU: Each thread: flatIndexToMemoryOffset<br/>→ read FP8 data & scale<br/>→ scatter to k_cache via slot_mapping
GPU->>GPU: Populate k_cache buffer
Kernel-->>PyTorchOp: Kernel complete
PyTorchOp-->>Python: Return
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu (1)
131-139: Use 'k' prefix for magic number constants.Per coding guidelines, constants that are magic numbers should use the
kUPPER_SNAKE_CASEnaming convention.Apply this diff:
- constexpr int32_t QUANT_BLOCK_SIZE = 128; + constexpr int32_t kQUANT_BLOCK_SIZE = 128; TLLM_CHECK_WITH_INFO( - head_dim == QUANT_BLOCK_SIZE, "head_dim must equal 128 for DeepSeek-V3 indexer cache (got %d)", head_dim); + head_dim == kQUANT_BLOCK_SIZE, "head_dim must equal 128 for DeepSeek-V3 indexer cache (got %d)", head_dim); TLLM_CHECK_WITH_INFO( scale_size == 4, "scale_size must equal 4 bytes (1 float32 scale per token, got %d)", scale_size); // For head_dim=128, we use 32 threads to handle 128 bytes per token and extra 4 bytes for scale - constexpr int32_t THREADS_PER_BLOCK = 32; + constexpr int32_t kTHREADS_PER_BLOCK = 32; - dim3 block(THREADS_PER_BLOCK); + dim3 block(kTHREADS_PER_BLOCK);tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (2)
499-504: Remove unused unpacked variable.The
sparse_attn_configvariable is unpacked fromcreate_dsa_cache_manager()but never used in the test. Either use it or remove it from the unpacking.Apply this diff if the variable is not needed:
- cache_manager, sparse_attn_config = create_dsa_cache_manager( + cache_manager, _ = create_dsa_cache_manager( batch_size=batch_size, head_dim=head_dim, tokens_per_block=block_size, max_seq_len=max_seq_len, num_layers=3) # Multi-layer pool for non-contiguous test
564-580: Consider using proper test logging.The test contains several print statements for debugging. Consider either:
- Removing them for cleaner test output
- Using pytest's capsys fixture for captured output
- Keeping them but gated behind a verbose flag
Also, remove the unnecessary
fprefix from f-strings without placeholders (lines 564, 575, 580).Example for line 564:
- print(f"\n=== Cache Properties ===") + print("\n=== Cache Properties ===")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.h(1 hunks)cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu(1 hunks)cpp/tensorrt_llm/thop/CMakeLists.txt(1 hunks)cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp(1 hunks)tensorrt_llm/_torch/attention_backend/sparse/dsa.py(1 hunks)tests/unittest/_torch/attention/sparse/test_dsa_indexer.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.hcpp/tensorrt_llm/kernels/indexerKCacheScatter.cucpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.hcpp/tensorrt_llm/kernels/indexerKCacheScatter.cucpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.hcpp/tensorrt_llm/kernels/indexerKCacheScatter.cucpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpptensorrt_llm/_torch/attention_backend/sparse/dsa.pytests/unittest/_torch/attention/sparse/test_dsa_indexer.py
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.
Files:
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.h
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.hcpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).
Files:
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.h
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.hcpp/tensorrt_llm/kernels/indexerKCacheScatter.cucpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpptensorrt_llm/_torch/attention_backend/sparse/dsa.pytests/unittest/_torch/attention/sparse/test_dsa_indexer.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/attention_backend/sparse/dsa.pytests/unittest/_torch/attention/sparse/test_dsa_indexer.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/thop/CMakeLists.txtcpp/tensorrt_llm/kernels/IndexerKCacheScatter.hcpp/tensorrt_llm/kernels/indexerKCacheScatter.cu
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
cpp/tensorrt_llm/thop/CMakeLists.txt
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.hcpp/tensorrt_llm/kernels/indexerKCacheScatter.cu
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/kernels/indexerKCacheScatter.cutensorrt_llm/_torch/attention_backend/sparse/dsa.pytests/unittest/_torch/attention/sparse/test_dsa_indexer.py
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/kernels/indexerKCacheScatter.cucpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
📚 Learning: 2025-08-28T10:22:02.288Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1191-1197
Timestamp: 2025-08-28T10:22:02.288Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, the object identity comparison `softmax_req_indices is not group_req_indices_cuda` on line ~1191 is intentional and used as an optimization to determine whether to reuse an existing indexer or create a new one, based on which code path was taken during tensor assignment.
Applied to files:
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
Applied to files:
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py
🧬 Code graph analysis (4)
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.h (1)
cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu (2)
invokeIndexerKCacheScatter(121-150)invokeIndexerKCacheScatter(121-124)
cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp (1)
cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu (2)
invokeIndexerKCacheScatter(121-150)invokeIndexerKCacheScatter(121-124)
tensorrt_llm/_torch/attention_backend/sparse/dsa.py (1)
cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp (2)
indexer_k_cache_scatter_op(29-92)indexer_k_cache_scatter_op(29-30)
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (4)
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.h (1)
tensorrt_llm(21-30)tensorrt_llm/_torch/attention_backend/sparse/dsa.py (5)
Indexer(575-1176)prepare(435-538)prepare(734-838)get_indexer_k_cache_buffers(1378-1384)_unravel_indices(49-64)tensorrt_llm/quantization/utils/fp8_utils.py (1)
fp8_quantize_1x128_sf_transpose(523-533)cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp (2)
indexer_k_cache_scatter_op(29-92)indexer_k_cache_scatter_op(29-30)
🪛 Clang (14.0.6)
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.h
[error] 19-19: 'tensorrt_llm/common/cudaUtils.h' file not found
(clang-diagnostic-error)
cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp
[error] 17-17: 'tensorrt_llm/common/opUtils.h' file not found
(clang-diagnostic-error)
🪛 Ruff (0.14.3)
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py
491-491: Comment contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?
(RUF003)
499-499: Unpacked variable sparse_attn_config is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
564-564: f-string without any placeholders
Remove extraneous f prefix
(F541)
575-575: f-string without any placeholders
Remove extraneous f prefix
(F541)
580-580: f-string without any placeholders
Remove extraneous f prefix
(F541)
583-583: f-string without any placeholders
Remove extraneous f prefix
(F541)
615-615: f-string without any placeholders
Remove extraneous f prefix
(F541)
621-621: f-string without any placeholders
Remove extraneous f prefix
(F541)
644-644: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
644-644: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu (3)
1-19: LGTM! Clean header and includes.The copyright header is correctly formatted with the current year range, and the includes are appropriate for the CUDA kernel implementation.
78-119: Verify alignment assumptions and consider bounds validation.The kernel uses
reinterpret_cast<uint32_t*>for 4-byte vectorized writes (lines 106, 116-117), which assumes 4-byte alignment. Additionally, there's no validation thatdst_offsetanddst_offset_scaleare within bounds of the k_cache buffer.While the slot mappings are computed in Python/host code and should be correct, consider:
- Alignment: Document the alignment requirement or add runtime assertions in the launcher
- Bounds checking: Add assertions in debug builds to catch slot mapping errors early
Can you confirm that:
- The k_cache buffer and all computed offsets are guaranteed to be 4-byte aligned?
- The slot mapping computation in Python ensures dst_offset stays within k_cache bounds?
If there are any scenarios where these assumptions might not hold, defensive checks could prevent silent corruption.
152-152: LGTM! Correct namespace closing comment.The namespace closing brace properly includes the namespace name as required by coding guidelines.
cpp/tensorrt_llm/thop/CMakeLists.txt (1)
86-86: LGTM! Clean build system integration.The new source file
IndexerKCacheScatterOp.cppis correctly added to theth_commonlibrary, maintaining alphabetical ordering with other op implementations.tensorrt_llm/_torch/attention_backend/sparse/dsa.py (1)
875-880: LGTM! Clean delegation to CUDA kernel.The Python scatter logic has been successfully replaced with the CUDA kernel call, which should provide the performance improvements reported in the PR (10.5% throughput increase).
One minor consideration: should there be error handling around the CUDA op call, or does PyTorch automatically propagate exceptions from custom ops? Verify that kernel launch failures will be properly caught and reported.
cpp/tensorrt_llm/kernels/IndexerKCacheScatter.h (1)
1-30: LGTM! Well-structured header file.The header correctly declares the kernel launcher with appropriate includes, namespace usage, and default parameters. The
#pragma oncedirective is acceptable for this codebase.Note: The static analysis warning about 'cudaUtils.h' not found is a false positive—the build system correctly resolves this path.
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (1)
476-485: Well-designed test for CUDA kernel validation.This test provides comprehensive coverage by:
- Comparing CUDA kernel output against a Python reference implementation
- Testing non-contiguous memory handling using different layer indices
- Validating byte-for-byte correctness of the scatter operation
- Including proper skip guards for SM architecture and DeepGEMM availability
cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp (5)
1-24: LGTM!The copyright header, includes, and namespace aliases are properly structured and follow coding guidelines.
26-31: LGTM!The namespace declaration and function signature are correct, with appropriate const-correctness for input and output parameters.
81-92: LGTM!The stride extraction and kernel invocation are correct, with all parameters properly extracted and passed to the CUDA launcher in the correct order.
96-106: LGTM!The PyTorch library bindings correctly define the operator schema and register the CUDA implementation.
94-94: Add namespace closing comment.Per coding guidelines, namespace closing braces must include a trailing comment with the namespace name.
As per coding guidelines.
Apply this diff:
-} // namespace torch_ext +} // namespace torch_ext⛔ Skipped due to learnings
Learnt from: CR Repo: NVIDIA/TensorRT-LLM PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2025-09-04T17:00:29.500Z Learning: Applies to **/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh} : Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').Learnt from: sklevtsov-nvidia Repo: NVIDIA/TensorRT-LLM PR: 3294 File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36 Timestamp: 2025-08-08T05:06:31.596Z Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
lfr-0531
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM~
|
/bot run |
|
PR_Github #23747 [ run ] triggered by Bot. Commit: |
|
PR_Github #23747 [ run ] completed with state |
Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>
|
/bot reuse-pipeline |
|
PR_Github #23869 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #23869 [ reuse-pipeline ] completed with state |
Before this PR (together w/ #8882):
concurrency=64; ISL/OSL=1K/2K; DEP=8:
With this PR (together with #8882):
concurrency=64; ISL/OSL=1K/2K; DEP=8:
Summary by CodeRabbit
Release Notes
New Features
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.