[None][perf] Add custom indexer k cache scatter op #8960

chang-l · 2025-11-06T01:22:26Z

Before this PR (together w/ #8882):
concurrency=64; ISL/OSL=1K/2K; DEP=8:

= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     1.1853
Total Output Throughput (tokens/sec):             2427.5549
Total Token Throughput (tokens/sec):              3663.6499
Total Latency (ms):                               53993.4233
Average request latency (ms):                     53940.7480
Per User Output Throughput [w/ ctx] (tps/user):   37.9676
Per GPU Output Throughput (tps/gpu):              303.4444

With this PR (together with #8882):
concurrency=64; ISL/OSL=1K/2K; DEP=8:

= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     1.3097
Total Output Throughput (tokens/sec):             2682.1773
Total Token Throughput (tokens/sec):              4047.9243
Total Latency (ms):                               48867.7612
Average request latency (ms):                     48819.1827
Per User Output Throughput [w/ ctx] (tps/user):   41.9507
Per GPU Output Throughput (tps/gpu):              335.2722

Summary by CodeRabbit

Release Notes

New Features
- Implemented GPU-accelerated K-cache scatter operations for improved sparse attention performance, moving processing from CPU to device kernel.
Tests
- Added comprehensive validation tests for K-cache scatter kernel implementation across compute architectures.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Chang Liu (Enterprise Products) <[email protected]>

coderabbitai · 2025-11-06T01:29:47Z

📝 Walkthrough

Walkthrough

A CUDA kernel for scattering FP8 k-cache data and per-token scales into non-contiguous pooled cache is introduced. The kernel is exposed via PyTorch bindings and integrated into the DSA attention backend, replacing Python-based scatter logic with device-side execution.

Changes

Cohort / File(s)	Summary
CUDA Kernel Definition `cpp/tensorrt_llm/kernels/IndexerKCacheScatter.h`, `cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu`	Header declares public launcher function `invokeIndexerKCacheScatter`. Implementation provides CUDA kernel `indexerKCacheScatterUnifiedKernel` with device helper `flatIndexToMemoryOffset` for stride-based memory offset computation. Kernel processes tokens in parallel, scattering 4-byte FP8 chunks and scales. Launcher validates constraints (head_dim = 128, scale_size = 4) and configures grid/block (32 threads per block).
Build Configuration `cpp/tensorrt_llm/thop/CMakeLists.txt`	Adds `IndexerKCacheScatterOp.cpp` to `th_common` library sources.
PyTorch Binding `cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp`	Wraps kernel in PyTorch extension `indexer_k_cache_scatter_op`. Validates tensor properties (CUDA, dimensions, dtypes, contiguity), enforces DeepSeek-V3.2 constraints, extracts metadata, obtains CUDA stream, invokes kernel launcher via `tk::invokeIndexerKCacheScatter`.
Python Integration `tensorrt_llm/_torch/attention_backend/sparse/dsa.py`	Replaces Python scatter logic in `_update_k_cache` with kernel call `torch.ops.trtllm.indexer_k_cache_scatter_op`. Delegates FP8/scale placement to device kernel instead of CPU-side indexing.
Tests `tests/unittest/_torch/attention/sparse/test_dsa_indexer.py`	Removes `max_attention_window` from `KvCacheConfig` instantiation. Adds test `test_indexer_k_cache_scatter_custom_op` with DeepGEMM/SM90 guards validating kernel output against Python reference for multi-layer handling and byte-for-byte correctness.

Sequence Diagram

sequenceDiagram
    participant Python as Python DSA Code
    participant PyTorchOp as PyTorch Extension
    participant Kernel as CUDA Kernel
    participant GPU as GPU Memory

    Python->>PyTorchOp: _update_k_cache calls indexer_k_cache_scatter_op<br/>(k_fp8_bytes, k_scale_bytes, k_cache, slot_mappings)
    PyTorchOp->>PyTorchOp: Validate tensors (CUDA, shapes, dtypes, contiguity)
    PyTorchOp->>PyTorchOp: Extract strides & dims<br/>Verify constraints (head_dim=128, scale_size=4)
    PyTorchOp->>Kernel: invokeIndexerKCacheScatter<br/>(pointers, dimensions, strides, stream)
    Kernel->>Kernel: Launch indexerKCacheScatterUnifiedKernel<br/>Grid=[num_tokens], Block=[32 threads]
    Kernel->>GPU: Each thread: flatIndexToMemoryOffset<br/>→ read FP8 data & scale<br/>→ scatter to k_cache via slot_mapping
    GPU->>GPU: Populate k_cache buffer
    Kernel-->>PyTorchOp: Kernel complete
    PyTorchOp-->>Python: Return

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

CUDA kernel logic: Device memory offset computation via strides and parallel scatter pattern requires careful verification of index mapping and boundary handling.
PyTorch binding validation: Multiple tensor constraints (dimensions, dtype, contiguity) must be checked for correctness; ensure all DeepSeek-V3.2 hardcoded assumptions (head_dim = 128, scale_size = 4, k_cache dim2 = 1) are justified.
Integration impact: Shift from Python scatter to kernel execution changes control flow and error modes; verify slot_mapping logic and stream handling are equivalent.
Test coverage: Duplication of test in diff suggests potential merge/rebase artifact; verify test actually runs once and covers multi-layer scenarios.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description lacks key sections: no explanation of what/why, no test coverage details, and the PR checklist is incomplete. Only performance metrics are provided without context about the implementation.	Add a Description section explaining the issue and solution, list relevant tests in Test Coverage, and provide more details about the implementation and trade-offs.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding a custom indexer k cache scatter operation as a performance optimization.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu (1)
131-139: Use 'k' prefix for magic number constants.

Per coding guidelines, constants that are magic numbers should use the kUPPER_SNAKE_CASE naming convention.

Apply this diff:
-    constexpr int32_t QUANT_BLOCK_SIZE = 128;
+    constexpr int32_t kQUANT_BLOCK_SIZE = 128;
     TLLM_CHECK_WITH_INFO(
-        head_dim == QUANT_BLOCK_SIZE, "head_dim must equal 128 for DeepSeek-V3 indexer cache (got %d)", head_dim);
+        head_dim == kQUANT_BLOCK_SIZE, "head_dim must equal 128 for DeepSeek-V3 indexer cache (got %d)", head_dim);
     TLLM_CHECK_WITH_INFO(
         scale_size == 4, "scale_size must equal 4 bytes (1 float32 scale per token, got %d)", scale_size);
 
     // For head_dim=128, we use 32 threads to handle 128 bytes per token and extra 4 bytes for scale
-    constexpr int32_t THREADS_PER_BLOCK = 32;
+    constexpr int32_t kTHREADS_PER_BLOCK = 32;
 
-    dim3 block(THREADS_PER_BLOCK);
+    dim3 block(kTHREADS_PER_BLOCK);
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (2)
499-504: Remove unused unpacked variable.

The sparse_attn_config variable is unpacked from create_dsa_cache_manager() but never used in the test. Either use it or remove it from the unpacking.

Apply this diff if the variable is not needed:
-    cache_manager, sparse_attn_config = create_dsa_cache_manager(
+    cache_manager, _ = create_dsa_cache_manager(
         batch_size=batch_size,
         head_dim=head_dim,
         tokens_per_block=block_size,
         max_seq_len=max_seq_len,
         num_layers=3)  # Multi-layer pool for non-contiguous test
564-580: Consider using proper test logging.

The test contains several print statements for debugging. Consider either:

Removing them for cleaner test output

Using pytest's capsys fixture for captured output

Keeping them but gated behind a verbose flag

Also, remove the unnecessary f prefix from f-strings without placeholders (lines 564, 575, 580).

Example for line 564:
-    print(f"\n=== Cache Properties ===")
+    print("\n=== Cache Properties ===")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a552c4 and ca8f09d.

📒 Files selected for processing (6)

cpp/tensorrt_llm/kernels/IndexerKCacheScatter.h (1 hunks)
cpp/tensorrt_llm/kernels/indexerKCacheScatter.cu (1 hunks)
cpp/tensorrt_llm/thop/CMakeLists.txt (1 hunks)
cpp/tensorrt_llm/thop/IndexerKCacheScatterOp.cpp (1 hunks)
tensorrt_llm/_torch/attention_backend/sparse/dsa.py (1 hunks)
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}