Skip to content

[None][feat] Support Gemma4 multi-head_dim pools and host-side slicing to provide local view to Triton kernels for SWA#13745

Merged
eopXD merged 21 commits into
NVIDIA:mainfrom
eopXD:sg/swa-take2
May 28, 2026
Merged

[None][feat] Support Gemma4 multi-head_dim pools and host-side slicing to provide local view to Triton kernels for SWA#13745
eopXD merged 21 commits into
NVIDIA:mainfrom
eopXD:sg/swa-take2

Conversation

@eopXD
Copy link
Copy Markdown
Collaborator

@eopXD eopXD commented May 5, 2026

Summary

  • For the V1 C++ KV cache manager: Previously multi-pool has the implicit assumption that they all operate on same head_dim. To run Gemma4 correctly, this needs to be relaxed. (e.g. Gemma-4 MoE: SWA layers with head_dim=256 + global-attention layers with head_dim=512)
  • Further more for the V1 C++ KV cache manager, it originally operates on the assumption of a window mapping to a single pool. The window -> pool mapping should not be the correct set-up of memory pool since we should provide flexibility to all three attributes - (window_size, dtype, head_dim). This MR abstracts out a "PoolConfiguration" concept for the v1 manager in order to further provide flexibility in pool management.
  • For the auto deploy, attention grouping and kernel call plumbing originally operates on a single pool assumption. This does not match the multi-pool set-up we hope to achieve for modern LLM models. This MR plumbs to support multi-pool attenion grouping of the kv cache manager.
  • For triton kernel, sliding-window attention will result in front-evicting pages. The kernels will require to have a local view of the pool as the KV cache block writes forwards, leaving the front blocks out. Such local view has to be supported within this PR because previously gemma3 model running MMLU (including long sequence requests) have been added to the AutoDeploy support.

Test coverage

  • Existing VSWA model test cases should pass (gemma3-e2n) under the changes done in this merge request
  • Unit tests for triton kernel local view
    • For prefill, the full sequence is added.
    • For decode, front blocks are evicted.
  • test_cached_sequence_interface.py updated, multi-head_dim pool configuration is now managed.
  • Currently gemma4 model run is still waived out for other existing problems

@eopXD eopXD requested review from a team as code owners May 5, 2026 02:21
@eopXD eopXD requested a review from tcherckez-nvidia May 5, 2026 02:21
@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 5, 2026

/bot run --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

📝 Walkthrough

Walkthrough

This PR extends TensorRT-LLM's KV cache manager to support heterogeneous per-window configurations for Vector Sliding Window Attention (VSWA). Key changes include per-window tracking of front-block evictions in GenerationRequest, per-window overrides for head dimension and dtype in BlockManager and KVCacheManager, corresponding Python binding updates, and comprehensive wiring through AutoDeploy custom ops, transforms, and executor layers to propagate these settings end-to-end. Unit and integration tests validate per-window block sizing, dtype handling, front-eviction isolation, and unified multi-pool KV manager behavior.

Changes

Vector Sliding Window Attention (VSWA) KV Cache Support

Layer / File(s) Summary
Data Shape & Type Definitions
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
GenerationRequest switches from single mNumFrontBlocksRemoved counter to per-window map mNumFrontBlocksRemovedPerWindow. BlockManager and KVCacheManager constructors accept new optional maps: sizePerHeadPerWindow and dtypePerWindow. WindowBlockManager, BlockManager, and KVCacheManager gain corresponding accessor methods for per-window head dimensions and dtypes.
Core KV Cache Logic
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
BlockManager constructor threads per-window override maps through window-manager initialization, selecting per-window sizePerHead/dtype with fallback to scalar defaults. GenerationRequest front-block accounting is made window-aware: getNumFrontBlocksRemoved(windowSize) and per-window counter increments in removeFrontBlock(windowSize). BaseKVCacheManager::calculateMaxNumBlocks computes cache-size budgets per window using overridden head dimensions and dtypes. KV cache memory sizing in allocatePools queries dtype from each pool tensor instead of manager-level scalar.
Python Bindings
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
New binding BaseKVCacheManager.get_num_front_blocks_removed(request_id, window_size=None) exposes per-window or aggregated front-block counts. Updated bindings for calculate_max_num_blocks and KVCacheManager constructor to accept size_per_head_per_window and dtype_per_window maps. New read-only properties expose size_per_head_per_window and dtype_per_window from the block manager.
Custom Ops: Triton Paged Attention
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
Added kv_page_offset support for windowed cache layouts. Update and context kernels load per-sequence page offsets and adjust computed page indices by subtracting the offset. triton_paged_mha_with_cache caps cached-token portion to available pages before context kernel invocation. TritonPagedAttention.get_standard_metadata_args includes "kv_page_offset" and get_cache_initializers propagates sliding_window into KVPagedResourceHandler.
Custom Ops: FlashInfer & TRTLLm
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py, tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
FlashInferAttention.get_cache_initializers and TrtllmAttention.get_cache_initializers extract sliding_window from source nodes and propagate it into KVPagedResourceHandler for per-window pool grouping. trtllm_mha_with_cache and its fake variant now accept runtime sink_token_length parameter instead of hardcoded value.
Attention Interface & Input Buffering
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
InputBuffer gains add_truncatable_tensor to register additional tensors post-construction. SequenceInfo adds kv_page_offset tensor and VSWA window-group support via register_window_groups(...) and accessors. nest_sequences extended with kv_page_offset and per-group cache metadata arguments for VSWA multi-pool dispatch. offset_pos_and_cache_ adjusted for per-group ragged assignments. KVPagedResourceHandler incorporates sliding_window into constructor and equality check, grouping layers by window configuration.
Executor & Shim Integration
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
_prepare_inputs builds per-window cache metadata (kv_page_offset, per-group indices, eviction counts) using unified KV manager's get_cache_indices(..., window_size=...) and get_num_front_blocks_removed(...) for VSWA routing. Non-VSWA path populates kv_page_offset=0 for consistency. create_autodeploy_executor caps max_num_requests to manager's max_concurrent_sequences when available.
Unified KV Cache Manager Interface
tensorrt_llm/_torch/auto_deploy/shim/interface.py
CachedSequenceInterface refactored to build and manage a single unified C++ KVCacheManager hosting all KV pools (full-attention and SWA). Per-window routing maps (size_per_head_per_window, dtype_per_window, window_per_layer) are computed during resource identification. Uniform-cache validation and pool/window ordering APIs (kv_group_windows property, set_kv_groups method) enable downstream dispatch. KV manager construction computes a unified token budget and passes per-window overrides into manager kwargs.
KV Cache Transform
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
Two-pass insertion logic: Pass 1 registers per-layer KV cache resources, derives group_idx via KVPagedResourceHandler equality (including sliding_window), and computes max_attention_window from per-layer sliding windows when unset. For multi-group models, registers window groups and creates per-group graph placeholder inputs for standard metadata args. Pass 2 inserts cached attention ops per layer, selecting group-specific metadata placeholders when applicable.
PyExecutor Resource Manager
tensorrt_llm/_torch/pyexecutor/resource_manager.py
KVCacheManager.__init__ accepts optional size_per_head_per_window and dtype_per_window parameters, forwarding them to C++ constructor. New get_num_front_blocks_removed(request_id, window_size=None) exposes per-window eviction counts. get_buffers resolves per-layer head dimensions via the manager's size_per_head_per_window map. VSWA window sizing no longer requires model_config; uses Python helper with per-manager head dimension.
Unit Tests
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp, tests/unittest/auto_deploy/singlegpu/shim/test_cached_sequence_interface.py, tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py
New tests validate per-window sizePerHead/dtype overrides on BlockManager and resulting per-pool block sizes, per-window fallback behavior, BaseKVCacheManager block divergence with per-window overrides, SWA front-eviction tracking per window, GenerationRequest per-window counter isolation, unified KV manager creation with multiple pools, VSWA transformation logic (sliding window extraction, window-group registration, per-group placeholders), and VSWA wiring in executor/interface layers.
Integration Test & References
tests/integration/defs/accuracy/test_llm_api_autodeploy.py, tests/integration/defs/accuracy/references/gsm8k.yaml, tests/integration/defs/accuracy/references/mmlu.yaml, examples/auto_deploy/model_registry/configs/gemma4_moe.yaml
TestGemma4MoE refactored to exercise fixed-config AutoDeploy Gemma4 MoE with dual-pool KV manager (attn_backend="triton_paged", customized KV cache). Accuracy reference entries added for google/gemma-4-26B-A4B-it. YAML config includes cuda_graph_config.max_batch_size: 512.
Stream-Level CUDA Synchronization
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
After graph splitting in PiecewiseCapturedGraph.forward, synchronization changed from device-wide torch.cuda.synchronize() to stream-level torch.cuda.current_stream().synchronize() for reduced overhead.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • Fridah-nv
  • nvchenghaoz
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description covers the main objectives and implementation details but is missing required sections like Title format, Test Coverage list, and PR Checklist confirmation. Add PR title in required format [ticket][type] format, expand Test Coverage section with specific test names/files, and confirm PR Checklist items with checkboxes.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly summarizes the main change: implementing support for Gemma4 multi-head_dim pools with host-side slicing for SWA (Sliding Window Attention) through a unified KVCacheManager.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)

653-669: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Single-manager VSWA breaks the current pin/unpin contract.

Now that one BlockManager can host several WindowBlockManagers, block IDs are only unique within a window. The existing pin path still treats them as global: BlockManager::storeBlocksForReuse() overwrites earlier windows' pinned IDs, and unpinBlocksById() always dispatches the flat ID list to the first window manager. With pinBlocks=true, unpinning can decrement the refcount of an unrelated block in the wrong window or leak pins in the others.

This needs a window-qualified pin handle, e.g. {windowSize, blockId} (or a per-window map) end-to-end before the single-manager path is safe.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp` around lines 653 - 669,
Block IDs are currently treated as global but WindowBlockManager instances
hosted under BlockManager (mWindowBlockManagers) make IDs unique only per
window; update the pin/unpin contract so pinned handles are window-qualified
(e.g., pair<windowSize, blockId> or a per-window map) and propagate this
everywhere: change BlockManager::storeBlocksForReuse() to record
window-qualified IDs when pinBlocks==true, change
BlockManager::unpinBlocksById() to group incoming handles by window and call the
appropriate WindowBlockManager unpin path (dispatch into each WindowBlockManager
in mWindowBlockManagers rather than always using the first), and update any
callers and types that assume flat IDs to accept and pass the new
window-qualified pin handle.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)

554-561: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Carry window clamping through to the per-window override dicts.

_validate_and_adjust_attention_windows() can rewrite self.max_attention_window_vec, but the ctor still forwards the original size_per_head_per_window / dtype_per_window keys unchanged. If a window is clamped, lookups by the adjusted key miss and both the C++ ctor and get_buffers() fall back to the scalar defaults for that pool.

Also applies to: 607-608

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 554 - 561,
The adjusted attention-window sizes returned by
_validate_and_adjust_attention_windows may rewrite
self.max_attention_window_vec, but the per-window override dicts
(size_per_head_per_window and dtype_per_window) are still using the original
keys; update those dicts after calling _validate_and_adjust_attention_windows to
remap or rebuild their entries to the adjusted window sizes (i.e., iterate the
original override entries and replace keys with the corresponding values from
the new self.max_attention_window_vec or reconstruct the dicts from the adjusted
vector) so lookups in the C++ ctor and get_buffers() use the clamped keys rather
than falling back to scalar defaults.

1412-1429: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the per-window head_dim/dtype maps when sizing mixed-shape VSWA pools.

Both sizing paths still multiply by self.head_dim and dtype/self.dtype, but in the unified manager those are only fallback scalars. For Gemma-4-style mixed windows, that budgets every pool as if it had the fallback shape, which under-allocates smaller pools and skews blocks_per_window away from the actual C++ layout.

Also applies to: 1584-1639

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 1412 - 1429,
The sizing loop currently uses the fallback self.head_dim and dtype for all
windows; update it to use the per-window/per-layer head-dim and dtype maps when
computing cache_size_per_token and cache_size_bytes_per_token so mixed-shape
VSWA pools are sized correctly: inside calculate_cache_size_per_token (and where
you call get_size_in_bytes and the NVFP4 scaling branch) look up the actual
head_dim and dtype for the current window/layers (e.g., use self.head_dim_map or
self.head_dim_per_layer and self.dtype_map/self.dtype_per_layer) instead of
self.head_dim/self.dtype, compute per-layer/per-window contributions
accordingly, and then sum them to produce required_mem_bytes_per_seq and to feed
KVCacheManager.calculate_scaling_factor_size_bytes so blocks_per_window reflects
the real C++ layout.
tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py (1)

1-2: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Missing NVIDIA copyright header and legacy typing import.

Per coding guidelines, all Python source files should contain an NVIDIA copyright header with the year of latest modification and Apache 2.0 license notice. Additionally, prefer built-in list over typing.List for Python 3.10+.

Suggested fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 from types import SimpleNamespace
-from typing import List, Optional
+from typing import Optional

And at line 49, update the return type:

-    def get_export_infos(self, model: nn.Module) -> List[SubModuleExportInfo]:
+    def get_export_infos(self, model: nn.Module) -> list[SubModuleExportInfo]:

Based on learnings: In TensorRT-LLM (Python requires >=3.10), you can use Python 3.10+ features (e.g., PEP 585 generics like list[str]) throughout the codebase.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py`
around lines 1 - 2, Add the NVIDIA copyright + Apache-2.0 header at the top of
the file, remove the legacy typing imports, and replace usages of typing.List
and typing.Optional with Python 3.10+ builtins/generics: delete "from typing
import List, Optional", keep "from types import SimpleNamespace", change any
"List[T]" to "list[T]" and any "Optional[T]" to "T | None", and update the
function/method return annotation referenced around line 49 to use the new
"list[...]" or "| None" form as appropriate (adjust imports accordingly so no
typing.List/Optional remain).
🧹 Nitpick comments (4)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

8698-8720: ⚡ Quick win

Avoid GPU pool allocation for a pure GenerationRequest bookkeeping test

This test targets GenerationRequest front-eviction counters, but it currently builds a full BlockManager and calls allocatePools(). That adds unnecessary CUDA dependency and potential resource flakiness unrelated to what is being asserted. Prefer constructing the window metadata directly (or via a lightweight helper) and instantiate GenerationRequest without allocating pools.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 8698
- 8720, The test unnecessarily constructs a full BlockManager and calls
BlockManager::allocatePools() for a pure GenerationRequest bookkeeping test;
instead create the needed windowSizeToMetadata/BlocksPerWindow data directly (or
via a small helper) and construct a GenerationRequest with that metadata so no
GPU pools are allocated. Replace the BlockManager instantiation and
blockManager.allocatePools() calls with construction of the BlocksPerWindow /
maxAttentionWindowVec used in the BlockManager call, then pass that metadata
into GenerationRequest (or a test helper) so only
removeFrontBlock/clearCacheBlocks/getNumFrontBlocksRemoved logic on
GenerationRequest is exercised without invoking BlockManager::allocatePools or
creating a CudaStream.
tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py (1)

897-898: 💤 Low value

Consider moving import math to module level.

The math module is imported inside two test functions (lines 897 and 974). Moving it to the top of the file with other imports improves consistency and avoids repeated import overhead.

Also applies to: 974-975

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py`
around lines 897 - 898, Move the two local "import math" statements out of the
test functions and place a single "import math" at the module level with the
other imports; locate the inline imports by searching for "import math" inside
the test functions in this test_kv_cache module and remove them there so all
tests use the top-level module import instead.
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (2)

1104-1155: Consider a perf-list follow-up for this KV-cache path.

This gives useful functional coverage, and QA list updates look unnecessary because it stays within the existing accuracy suite. Separately, the PR changes KV-cache management / CUDA-graph behavior, so only correctness is protected here; if you want pre-merge and scheduled regression detection, add or confirm a perf entry in tests/integration/test_lists/test-db/l0_perf.yml and the matching tests/integration/test_lists/qa/llm_perf_*.yml.

As per coding guidelines: "If the PR touches performance-sensitive paths (attention kernels, MoE routing/dispatch, KV cache management, scheduler, batching logic, CUDA graph capture, speculative decoding, or quantization kernels), check whether a perf test entry is present or updated in: (a) tests/integration/test_lists/test-db/l0_perf.yml ... and (b) tests/integration/test_lists/qa/llm_perf_*.yml."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py` around lines 1104
- 1155, Add a perf-test entry for this KV-cache/CUDA-graph path: create or
update an l0_perf.yml entry and a matching qa llm_perf_*.yml entry that
reference the new/changed scenario exercised by test_bf16 (the AutoDeployLLM
invocation using model_factory="Gemma4ForConditionalGeneration",
compile_backend="torch-cudagraph", attn_backend="triton_paged", kv_cache_config
and cuda_graph_config), so scheduled and pre-merge perf regressions are caught;
ensure the perf test captures relevant config tags (kv-cache, cuda-graph,
compile_backend) and points to the appropriate benchmark task(s) (e.g.,
MMLU/GSM8K) used in the test.

1100-1102: ⚡ Quick win

Resolve Gemma through hf_id_to_local_model_dir() instead of hard-coding the cache layout.

This is the only test here that bypasses the shared resolver. Hard-coding .../gemma/gemma-4-26B-A4B-it makes the test depend on one on-disk layout and skips any alias/symlink handling the helper already centralizes.

♻️ Proposed change
-    MODEL_PATH = f"{llm_models_root()}/gemma/gemma-4-26B-A4B-it"
+    MODEL_PATH = hf_id_to_local_model_dir(MODEL_NAME)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py` around lines 1100
- 1102, The test hard-codes MODEL_PATH to llm_models_root()/gemma/... which
bypasses the shared resolver; replace the manual path with a call to
hf_id_to_local_model_dir(MODEL_NAME) so the test uses the central resolver (keep
MODEL_NAME and EXTRA_EVALUATOR_KWARGS unchanged), i.e. set MODEL_PATH =
hf_id_to_local_model_dir(MODEL_NAME) and import or reference
hf_id_to_local_model_dir where used.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Around line 940-951: getSizePerHead() can return -1 for recurrent-state pools
created by WindowBlockManager::WindowBlockManager(), which then propagates
through getSizePerHeadPerWindow()/getSizePerHeadForWindow() and the nanobind
API; change getSizePerHead() to guard against negative sizePerHead by checking
mPools.front().sizePerHead and returning 0 when it is less than 0 (keep the
existing behavior of returning 0 when mPools.empty()), so callers never observe
a negative head dimension.

In `@examples/auto_deploy/model_registry/configs/gemma4_moe.yaml`:
- Line 13: Remove the duplicate max_batch_size entry nested inside the
cuda_graph_config block so the file only defines max_batch_size once at the top
level; specifically edit the gemma4_moe.yaml to delete the nested
"max_batch_size" key under "cuda_graph_config" and leave only the top-level
"max_batch_size: 512", keeping "cuda_graph_config" containing just the
"batch_sizes" array to match the other configs.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 1326-1356: When cache_loc_per_group is provided but may be
partial, missing groups silently reuse prior staged values; fix by normalizing
the per-group maps before staging: for each group_idx in
1..self.num_window_groups-1, ensure cache_loc_per_group, cu_num_pages_per_group,
extra_page_per_seq_per_group, and kv_page_offset_per_group contain an entry (use
cache_loc, cu_num_pages, extra_page_per_seq, and kv_page_offset or a
per-sequence [0]*self.num_sequences for kv_page_offset as the group-0 defaults),
then proceed to call self._stage_arg for each suffix as currently done;
alternatively raise a clear error if any map is missing an entry — implement
this normalization where the current "if cache_loc_per_group is not None:" loop
runs so subsequent calls to self._stage_arg always see explicit values for every
group.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py`:
- Around line 1163-1174: The current max_cached = page_counts *
kv_cache.shape[3] overcounts tokens by treating every page as full; instead
compute the actual addressable tokens per sample as full_pages * page_size +
last_page_len (and zero when page_counts==0). Replace the max_cached line with
logic using page_counts, kv_cache.shape[3] (page_size) and the per-sample
last-page lengths (e.g. cu_last_page_lens): full_pages = torch.clamp(page_counts
- 1, min=0); max_cached = full_pages * page_size + torch.where(page_counts > 0,
cu_last_page_lens, torch.zeros_like(cu_last_page_lens)); then continue to use
cache_len_capped and seq_len_with_cache as before.

In `@tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`:
- Around line 887-923: The loop currently sets num_active =
len(all_indices[front_removed:]) so the "extra" slot is never extracted; instead
compute the intended active block count first (e.g., active_count = determine
the number of active pages for this window — often window_size or a configured
per-group active size), then slice all_indices into active_pages =
all_indices[front_removed : front_removed + active_count] and extra_page =
all_indices[front_removed + active_count] if that index exists else -1; replace
uses of active_indices/num_active with active_pages/len(active_pages) and set
extra_page_per_seq (and extra_page_per_seq_per_group) from extra_page, updating
cache_loc, cache_loc_per_group, cu_num_pages, cu_num_pages_per_group,
kv_page_offset and kv_page_offset_per_group accordingly (refer to
variables/all_indices, front_removed, cache_loc, extra_page_per_seq,
cache_loc_per_group, extra_page_per_seq_per_group, cu_num_pages,
cu_num_pages_per_group in ad_executor.py).

In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py`:
- Around line 121-131: This file is missing the required NVIDIA/Apache header;
add the repository-standard copyright/license header (with the latest
modification year) to the top of
tensorrt_llm/_torch/auto_deploy/shim/interface.py, before any imports or code,
following the same format used in other source files (NVIDIA copyright line +
Apache 2.0 license block and SPDX identifier), so the module (containing symbols
like KVCacheManager, MambaHybridCacheManager, self._kv_cache_manager,
self._kv_group_windows) complies with the project's header policy.
- Around line 916-924: The transform-time _kv_group_windows can become stale
because KVCacheManager may clamp windows during construction; after calling
self._assign_kv_cache_views(...) you must refresh the cached group windows from
the manager so downstream callers (e.g., ad_executor.py using get_cache_indices
and get_num_front_blocks_removed) see the finalized per-window sizes—update
self._kv_group_windows from the KVCacheManager's authoritative windows (via
self._kv_cache_manager.windows or the manager API that exposes per-window sizes)
immediately after _assign_kv_cache_views returns, keeping _kv_cache_config_tuned
and block_offset_multiplier logic unchanged.
- Around line 814-823: The branch that returns self.info.max_seq_len when
total_max_tokens is None and SWA is present incorrectly overrides an explicit
VSWA budget stored on self._kv_cache_config_original; update the logic in the
function (around the total_max_tokens /
self._has_swa_window(size_per_head_per_window) check) to first detect and
preserve an existing user-set max_tokens on self._kv_cache_config_original
(e.g., check getattr(self._kv_cache_config_original, "max_tokens", None) or
similar) and return that value when present instead of collapsing to
self.info.max_seq_len, so that _prepare_kv_cache_config and config.max_tokens
keep the intended budget.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`:
- Line 343: The loop in kvcache.py uses enumerate but never uses the index
variable (for idx, (attn_node, qkv, cache_in_nodes, constants, group_idx) in
enumerate(layer_infos)), triggering Ruff B007; fix by either removing enumerate
and iterating directly over layer_infos (for attn_node, qkv, cache_in_nodes,
constants, group_idx in layer_infos) or rename the index to _idx if you want to
keep enumerate, updating the loop header in the same location.
- Around line 319-350: Pass 2 only replaces per-layer standard metadata
(layer_meta_nodes_std) but leaves group-sensitive inputs in meta_nodes_extra
untouched, so multi-group backends like TrtllmAttention (which populate
meta_nodes_extra via prepare_trtllm_metadata) still reference group-0 nodes;
update the loop over layer_infos to also build a per-group copy of
meta_nodes_extra (e.g., layer_meta_nodes_extra = list(meta_nodes_extra)) for
group_idx > 0 and, using the same std_arg_names / vswa_group_nodes mapping you
use for layer_meta_nodes_std, replace any entries in layer_meta_nodes_extra
whose argument names match keys in vswa_group_nodes[group_idx] with the
corresponding vswa_group_nodes[group_idx][arg_name]; ensure downstream uses the
per-group layer_meta_nodes_extra where prepare_trtllm_metadata or
TrtllmAttention expects group-specific inputs so VSWA routing is correct.

In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py`:
- Around line 764-769: CachedSequenceInterface is being instantiated without the
required max_num_tokens argument; update each creation of
CachedSequenceInterface (the instances inside this test file) to pass
max_num_tokens=default_max_num_tokens(max_seq_len, batch_size) so the
constructor matches its signature—specifically add that parameter to the calls
that currently set max_seq_len, max_batch_size, device, and kv_cache_config.

---

Outside diff comments:
In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 653-669: Block IDs are currently treated as global but
WindowBlockManager instances hosted under BlockManager (mWindowBlockManagers)
make IDs unique only per window; update the pin/unpin contract so pinned handles
are window-qualified (e.g., pair<windowSize, blockId> or a per-window map) and
propagate this everywhere: change BlockManager::storeBlocksForReuse() to record
window-qualified IDs when pinBlocks==true, change
BlockManager::unpinBlocksById() to group incoming handles by window and call the
appropriate WindowBlockManager unpin path (dispatch into each WindowBlockManager
in mWindowBlockManagers rather than always using the first), and update any
callers and types that assume flat IDs to accept and pass the new
window-qualified pin handle.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 554-561: The adjusted attention-window sizes returned by
_validate_and_adjust_attention_windows may rewrite
self.max_attention_window_vec, but the per-window override dicts
(size_per_head_per_window and dtype_per_window) are still using the original
keys; update those dicts after calling _validate_and_adjust_attention_windows to
remap or rebuild their entries to the adjusted window sizes (i.e., iterate the
original override entries and replace keys with the corresponding values from
the new self.max_attention_window_vec or reconstruct the dicts from the adjusted
vector) so lookups in the C++ ctor and get_buffers() use the clamped keys rather
than falling back to scalar defaults.
- Around line 1412-1429: The sizing loop currently uses the fallback
self.head_dim and dtype for all windows; update it to use the
per-window/per-layer head-dim and dtype maps when computing cache_size_per_token
and cache_size_bytes_per_token so mixed-shape VSWA pools are sized correctly:
inside calculate_cache_size_per_token (and where you call get_size_in_bytes and
the NVFP4 scaling branch) look up the actual head_dim and dtype for the current
window/layers (e.g., use self.head_dim_map or self.head_dim_per_layer and
self.dtype_map/self.dtype_per_layer) instead of self.head_dim/self.dtype,
compute per-layer/per-window contributions accordingly, and then sum them to
produce required_mem_bytes_per_seq and to feed
KVCacheManager.calculate_scaling_factor_size_bytes so blocks_per_window reflects
the real C++ layout.

In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py`:
- Around line 1-2: Add the NVIDIA copyright + Apache-2.0 header at the top of
the file, remove the legacy typing imports, and replace usages of typing.List
and typing.Optional with Python 3.10+ builtins/generics: delete "from typing
import List, Optional", keep "from types import SimpleNamespace", change any
"List[T]" to "list[T]" and any "Optional[T]" to "T | None", and update the
function/method return annotation referenced around line 49 to use the new
"list[...]" or "| None" form as appropriate (adjust imports accordingly so no
typing.List/Optional remain).

---

Nitpick comments:
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 8698-8720: The test unnecessarily constructs a full BlockManager
and calls BlockManager::allocatePools() for a pure GenerationRequest bookkeeping
test; instead create the needed windowSizeToMetadata/BlocksPerWindow data
directly (or via a small helper) and construct a GenerationRequest with that
metadata so no GPU pools are allocated. Replace the BlockManager instantiation
and blockManager.allocatePools() calls with construction of the BlocksPerWindow
/ maxAttentionWindowVec used in the BlockManager call, then pass that metadata
into GenerationRequest (or a test helper) so only
removeFrontBlock/clearCacheBlocks/getNumFrontBlocksRemoved logic on
GenerationRequest is exercised without invoking BlockManager::allocatePools or
creating a CudaStream.

In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Around line 1104-1155: Add a perf-test entry for this KV-cache/CUDA-graph
path: create or update an l0_perf.yml entry and a matching qa llm_perf_*.yml
entry that reference the new/changed scenario exercised by test_bf16 (the
AutoDeployLLM invocation using model_factory="Gemma4ForConditionalGeneration",
compile_backend="torch-cudagraph", attn_backend="triton_paged", kv_cache_config
and cuda_graph_config), so scheduled and pre-merge perf regressions are caught;
ensure the perf test captures relevant config tags (kv-cache, cuda-graph,
compile_backend) and points to the appropriate benchmark task(s) (e.g.,
MMLU/GSM8K) used in the test.
- Around line 1100-1102: The test hard-codes MODEL_PATH to
llm_models_root()/gemma/... which bypasses the shared resolver; replace the
manual path with a call to hf_id_to_local_model_dir(MODEL_NAME) so the test uses
the central resolver (keep MODEL_NAME and EXTRA_EVALUATOR_KWARGS unchanged),
i.e. set MODEL_PATH = hf_id_to_local_model_dir(MODEL_NAME) and import or
reference hf_id_to_local_model_dir where used.

In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py`:
- Around line 897-898: Move the two local "import math" statements out of the
test functions and place a single "import math" at the module level with the
other imports; locate the inline imports by searching for "import math" inside
the test functions in this test_kv_cache module and remove them there so all
tests use the top-level module import instead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7d9a15af-c921-4ad0-9553-71fb1e604d70

📥 Commits

Reviewing files that changed from the base of the PR and between 45d15a1 and 685d53e.

📒 Files selected for processing (19)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • examples/auto_deploy/model_registry/configs/gemma4_moe.yaml
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/auto_deploy/shim/interface.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tests/unittest/auto_deploy/singlegpu/shim/test_cached_sequence_interface.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py

Comment thread cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Comment thread examples/auto_deploy/model_registry/configs/gemma4_moe.yaml Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/shim/interface.py
Comment thread tensorrt_llm/_torch/auto_deploy/shim/interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py Outdated
Comment thread tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py Outdated
@eopXD eopXD changed the title [None][feat] AutoDeploy: route VSWA pools through one C++ KVCacheManager [None][feat] AutoDeploy: Support gemma4 head configurations and VSWA pools in KVCacheManager May 5, 2026
@eopXD eopXD changed the title [None][feat] AutoDeploy: Support gemma4 head configurations and VSWA pools in KVCacheManager [None][feat] AutoDeploy: Support Gemma4 mixed-shape pools in KVCacheManager May 5, 2026
@eopXD eopXD force-pushed the sg/swa-take2 branch 4 times, most recently from e32e499 to 9fb2e92 Compare May 5, 2026 04:01
@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 5, 2026

/bot run --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 5, 2026

/bot run --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46738 [ run ] triggered by Bot. Commit: cf08c0f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46739 [ run ] triggered by Bot. Commit: cf08c0f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46739 [ run ] completed with state FAILURE. Commit: cf08c0f
/LLM/main/L0_MergeRequest_PR pipeline #36769 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 5, 2026

/bot run --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46795 [ run ] triggered by Bot. Commit: fa2df03 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46795 [ run ] completed with state FAILURE. Commit: fa2df03
/LLM/main/L0_MergeRequest_PR pipeline #36818 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 5, 2026

/bot run --disable-fail-fast --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46803 [ run ] triggered by Bot. Commit: fa2df03 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46803 [ run ] completed with state FAILURE. Commit: fa2df03
/LLM/main/L0_MergeRequest_PR pipeline #36824 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 6, 2026

/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1"

1 similar comment
@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 6, 2026

/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46896 [ run ] triggered by Bot. Commit: f9529b5 Link to invocation

The fixture set max_seq_len=128 while the eviction and long-prefill cases
constructed 129- and 192-token requests; kvCacheManager caps allocation at
max(windowSize, maxSequenceLength) so the helper saw fewer live pages than
expected. Bump FULL_WINDOW to 256 so the SWA pool can hold every scenario.

Signed-off-by: Yueh-Ting Chen <[email protected]>
@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 26, 2026

/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50400 [ run ] triggered by Bot. Commit: eaf7c2f Link to invocation

Copy link
Copy Markdown
Collaborator

@SimengLiu-nv SimengLiu-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50400 [ run ] completed with state SUCCESS. Commit: eaf7c2f
/LLM/main/L0_MergeRequest_PR pipeline #39924 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 27, 2026

/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50507 [ run ] triggered by Bot. Commit: eaf7c2f Link to invocation

eopXD added 5 commits May 27, 2026 17:07
Add diagnostic ad_logger.debug statements to help isolate why
test_gemma3n_e2b_it fails on GSM8K (256-token decode) after MMLU
(2-token decode) passes.  Two traces:

- ad_executor.py: per (request, pool) trace after
  _compute_window_local_view, surfacing front_removed, num_active,
  extra_page, active_token_count, last_page_len, len(all_indices).

- attention_interface.py: in offset_pos_and_cache_, conditional on
  any delta != 0 (page-boundary rollover), surface delta, extra_idx,
  and a smoking-gun bool indicating whether extra_page=-1 is about to
  be rotated into cache_loc by adjust_ragged_triton.

Enable with AUTO_DEPLOY_LOG_LEVEL=debug.  The conditional in
offset_pos_and_cache_ keeps the log quiet on non-rollover steps.

Signed-off-by: Yueh-Ting Chen <[email protected]>
Standalone repro for the GSM8K-style failure in TestGemmaE2B::test_gemma3n_e2b_it.
Loads the model via AutoDeployLLM with the same gemma3n_e2b_it.yaml config the
failing test uses (FlashInfer backend, multi-pool VSWA path), builds a
~prompt-tokens long synthetic prompt, and generates max-new-tokens greedily.

Defaults reproduce the failing-CI conditions in <1 minute:
  prompt-tokens=1024 (> 512 SWA window, fires eviction on first decode)
  max-new-tokens=256 (~8 SWA page boundaries crossed during decode)
  max-seq-len=4352 (matches the test's MAX_SEQ_LEN)

Intended use with the swa-trace debug logging:

  AUTO_DEPLOY_LOG_LEVEL=debug \
    python examples/auto_deploy/repro_swa_gemma3n.py \
      --model-path /path/to/google/gemma-3n-E2B-it \
    2>&1 | tee /tmp/swa-trace.log

  grep "[swa-trace]" /tmp/swa-trace.log

Signed-off-by: Yueh-Ting Chen <[email protected]>
The PR-added clamp on the reverse iterator in
WindowBlockManager::releaseBlocks:

    for (auto it = allocatedBlocks.rbegin();
         it != allocatedBlocks.rend() - sequence.getNumFrontBlocksRemoved(mWindowSize);
         ++it)

is unnecessary and a potential source of off-by-one corruption under
SWA front-eviction.  The loop body already handles placeholder sentinels
safely (placeholders have mRefCount==0, so decRefCount is skipped and
EvictionPolicy::releaseBlock is "silently ignored" per the existing
comment).  Restore the original full-range reverse iteration.

This removes one suspect in the GSM8K accuracy regression on
TestGemmaE2B::test_gemma3n_e2b_it.  Tracking the radix-tree /
storeBlocksForReuse interaction with SWA eviction (the larger suspect)
remains as a separate follow-up.

Signed-off-by: Yueh-Ting Chen <[email protected]>
prepare_flashinfer_metadata_host -> plan_generate_only iterates ALL
cached_cuda_graph_decode_wrappers unconditionally, refreshing each
wrapper's pre-allocated indptr/indices/last_page_len buffers with the
same single-pool (unsuffixed) cache_loc + cu_num_pages + last_page_len.

In single-pool deployments (main: no sliding_window on the KV handler,
no pool partitioning) this was harmless -- one wrapper, one pool.

In multi-pool VSWA (this PR: FlashInfer declares sliding_window on its
KVPagedResourceHandler so the kvcache transform splits SWA + full-attn
layers into separate pools), pool 0's cache_loc was being written into
EVERY cached wrapper's buffers, including the full-attention pool's
wrapper.  Block IDs are pool-local (each WindowBlockManager has its
own mAllBlocksById), so the full-attention kernel then reads from its
own KV memory at the SWA pool's block-ID coordinates -- unrelated
physical pages.  Visible as: MMLU passes (2-token decode tolerates
the stomp), GSM8K collapses (256-token decode compounds the error
into off-by-one arithmetic / prompt regurgitation).

Fix:
- flashinfer_attention.py: plan_generate_only takes an optional
  pool_window_left filter; when set, only refreshes wrappers whose
  plan_params.window_left matches.  prepare_flashinfer_metadata_host
  accepts and forwards pool_window_left.
- transform/library/kvcache.py: when the backend's host-prep declares
  pool_window_left AND we have 2+ KV pools, register the host-prep
  once per pool with per-pool arg names (pool 0 unsuffixed, pools
  1..N-1 with _g{i} / _g{i}_host) and a closure that remaps those
  suffixed kwargs back to the function's canonical parameter names
  before delegating, plus a bound pool_window_left = sliding_window-1
  (SWA) or -1 (full).  Single-pool and non-opt-in backends retain
  legacy single registration.

The host-prep call site (run_host_prepare_for_attention_forward) invokes
each registered function as host_function(**{arg: get_arg(arg)}); the
closure is required because the suffixed arg names become the kwarg
keys, but the underlying host-prep function only declares the
unsuffixed parameter names.

_process_metadata_host is moved from the pre-pass-1 location to after
register_window_groups in each of the three group-setup branches, so
handler_groups is available at registration time.

Signed-off-by: Yueh-Ting Chen <[email protected]>
The fix for the GSM8K accuracy regression (0b76919
"per-pool host-prep for FlashInfer VSWA") landed, so the diagnostic
instrumentation added for the investigation can come out.

Reverts:
- 20c0ff6 (ad_logger.debug swa-trace lines in ad_executor.py and
  attention_interface.py)
- c5551ce (examples/auto_deploy/repro_swa_gemma3n.py)

Signed-off-by: Yueh-Ting Chen <[email protected]>
@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 27, 2026

/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1"

1 similar comment
@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 27, 2026

/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50542 [ run ] triggered by Bot. Commit: 6a739d1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50507 [ run ] completed with state ABORTED. Commit: eaf7c2f

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50542 [ run ] completed with state FAILURE. Commit: 6a739d1
/LLM/main/L0_MergeRequest_PR pipeline #40046 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 28, 2026

/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50646 [ run ] triggered by Bot. Commit: 6a739d1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50646 [ run ] completed with state SUCCESS. Commit: 6a739d1
/LLM/main/L0_MergeRequest_PR pipeline #40140 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@eopXD
Copy link
Copy Markdown
Collaborator Author

eopXD commented May 28, 2026

/bot skip --comment "The only failing pipeline H100_PCIe-PyTorch-Perf-1 is not related to the scope of change of this merge request. The pipeline has passed before too in the CI attempts of this merge request. To avoid from running the long pipeline again, we have AutoDeploy-related pipeline passing now, which proves that the MR should be good now. Given the above, I have enough confidence to use skip here and merge the MR."

@eopXD eopXD enabled auto-merge (squash) May 28, 2026 07:50
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50741 [ skip ] triggered by Bot. Commit: 6a739d1 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50741 [ skip ] completed with state SUCCESS. Commit: 6a739d1
Skipping testing for commit 6a739d1

Link to invocation

@eopXD eopXD merged commit 13ca44a into NVIDIA:main May 28, 2026
8 checks passed
taylor-yb-lee added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Jun 1, 2026
…-window models

- PR NVIDIA#13745 made trtllm `get_cache_initializers` propagate `sliding_window`, so models with non-uniform windows (e.g. gpt-oss-120b: 128/4096) form >1 KV pool.
- trtllm enforces a single uniform pool (`requires_uniform_kv_caches`) → crash at cache_init: "KV resources are not uniform".
- Temporary fix: trtllm-only revert of the pool-alloc change (handler `sliding_window=0` → single full-seq pool, SWA layers over-allocate KV); the sliding-window mask is still applied via the op's own `sliding_window` arg. Validated gsm8k[120b]=90.4.
- Re-enabling multi-pool needs trtllm-native VSWA: per-pool `block_offsets` tables + real `pool_mapping` routing — the degenerate per-layer-pointer path can't address multiple pools, and flashinfer-style host-sliced views corrupt the cyclic-window trtllm kernel (naive attempt → gsm8k 6%).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Taylor Yeonbok Lee <[email protected]>
taylor-yb-lee added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Jun 2, 2026
…-window models

- PR NVIDIA#13745 made trtllm `get_cache_initializers` propagate `sliding_window`, so models with non-uniform windows (e.g. gpt-oss-120b: 128/4096) form >1 KV pool.
- trtllm enforces a single uniform pool (`requires_uniform_kv_caches`) → crash at cache_init: "KV resources are not uniform".
- Temporary fix: trtllm-only revert of the pool-alloc change (handler `sliding_window=0` → single full-seq pool, SWA layers over-allocate KV); the sliding-window mask is still applied via the op's own `sliding_window` arg. Validated gsm8k[120b]=90.4.
- Re-enabling multi-pool needs trtllm-native VSWA: per-pool `block_offsets` tables + real `pool_mapping` routing — the degenerate per-layer-pointer path can't address multiple pools, and flashinfer-style host-sliced views corrupt the cyclic-window trtllm kernel (naive attempt → gsm8k 6%).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Taylor Yeonbok Lee <[email protected]>
taylor-yb-lee added a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Jun 4, 2026
…-window models

- PR NVIDIA#13745 made trtllm `get_cache_initializers` propagate `sliding_window`, so models with non-uniform windows (e.g. gpt-oss-120b: 128/4096) form >1 KV pool.
- trtllm enforces a single uniform pool (`requires_uniform_kv_caches`) → crash at cache_init: "KV resources are not uniform".
- Temporary fix: trtllm-only revert of the pool-alloc change (handler `sliding_window=0` → single full-seq pool, SWA layers over-allocate KV); the sliding-window mask is still applied via the op's own `sliding_window` arg. Validated gsm8k[120b]=90.4.
- Re-enabling multi-pool needs trtllm-native VSWA: per-pool `block_offsets` tables + real `pool_mapping` routing — the degenerate per-layer-pointer path can't address multiple pools, and flashinfer-style host-sliced views corrupt the cyclic-window trtllm kernel (naive attempt → gsm8k 6%).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Signed-off-by: Taylor Yeonbok Lee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.