[None][feat] Support Gemma4 multi-head_dim pools and host-side slicing to provide local view to Triton kernels for SWA#13745
Conversation
|
/bot run --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
📝 WalkthroughWalkthroughThis PR extends TensorRT-LLM's KV cache manager to support heterogeneous per-window configurations for Vector Sliding Window Attention (VSWA). Key changes include per-window tracking of front-block evictions in ChangesVector Sliding Window Attention (VSWA) KV Cache Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (1)
653-669:⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy liftSingle-manager VSWA breaks the current pin/unpin contract.
Now that one
BlockManagercan host severalWindowBlockManagers, block IDs are only unique within a window. The existing pin path still treats them as global:BlockManager::storeBlocksForReuse()overwrites earlier windows' pinned IDs, andunpinBlocksById()always dispatches the flat ID list to the first window manager. WithpinBlocks=true, unpinning can decrement the refcount of an unrelated block in the wrong window or leak pins in the others.This needs a window-qualified pin handle, e.g.
{windowSize, blockId}(or a per-window map) end-to-end before the single-manager path is safe.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp` around lines 653 - 669, Block IDs are currently treated as global but WindowBlockManager instances hosted under BlockManager (mWindowBlockManagers) make IDs unique only per window; update the pin/unpin contract so pinned handles are window-qualified (e.g., pair<windowSize, blockId> or a per-window map) and propagate this everywhere: change BlockManager::storeBlocksForReuse() to record window-qualified IDs when pinBlocks==true, change BlockManager::unpinBlocksById() to group incoming handles by window and call the appropriate WindowBlockManager unpin path (dispatch into each WindowBlockManager in mWindowBlockManagers rather than always using the first), and update any callers and types that assume flat IDs to accept and pass the new window-qualified pin handle.tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
554-561:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCarry window clamping through to the per-window override dicts.
_validate_and_adjust_attention_windows()can rewriteself.max_attention_window_vec, but the ctor still forwards the originalsize_per_head_per_window/dtype_per_windowkeys unchanged. If a window is clamped, lookups by the adjusted key miss and both the C++ ctor andget_buffers()fall back to the scalar defaults for that pool.Also applies to: 607-608
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 554 - 561, The adjusted attention-window sizes returned by _validate_and_adjust_attention_windows may rewrite self.max_attention_window_vec, but the per-window override dicts (size_per_head_per_window and dtype_per_window) are still using the original keys; update those dicts after calling _validate_and_adjust_attention_windows to remap or rebuild their entries to the adjusted window sizes (i.e., iterate the original override entries and replace keys with the corresponding values from the new self.max_attention_window_vec or reconstruct the dicts from the adjusted vector) so lookups in the C++ ctor and get_buffers() use the clamped keys rather than falling back to scalar defaults.
1412-1429:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse the per-window head_dim/dtype maps when sizing mixed-shape VSWA pools.
Both sizing paths still multiply by
self.head_dimanddtype/self.dtype, but in the unified manager those are only fallback scalars. For Gemma-4-style mixed windows, that budgets every pool as if it had the fallback shape, which under-allocates smaller pools and skewsblocks_per_windowaway from the actual C++ layout.Also applies to: 1584-1639
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 1412 - 1429, The sizing loop currently uses the fallback self.head_dim and dtype for all windows; update it to use the per-window/per-layer head-dim and dtype maps when computing cache_size_per_token and cache_size_bytes_per_token so mixed-shape VSWA pools are sized correctly: inside calculate_cache_size_per_token (and where you call get_size_in_bytes and the NVFP4 scaling branch) look up the actual head_dim and dtype for the current window/layers (e.g., use self.head_dim_map or self.head_dim_per_layer and self.dtype_map/self.dtype_per_layer) instead of self.head_dim/self.dtype, compute per-layer/per-window contributions accordingly, and then sum them to produce required_mem_bytes_per_seq and to feed KVCacheManager.calculate_scaling_factor_size_bytes so blocks_per_window reflects the real C++ layout.tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py (1)
1-2:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMissing NVIDIA copyright header and legacy typing import.
Per coding guidelines, all Python source files should contain an NVIDIA copyright header with the year of latest modification and Apache 2.0 license notice. Additionally, prefer built-in
listovertyping.Listfor Python 3.10+.Suggested fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from types import SimpleNamespace -from typing import List, Optional +from typing import OptionalAnd at line 49, update the return type:
- def get_export_infos(self, model: nn.Module) -> List[SubModuleExportInfo]: + def get_export_infos(self, model: nn.Module) -> list[SubModuleExportInfo]:Based on learnings: In TensorRT-LLM (Python requires >=3.10), you can use Python 3.10+ features (e.g., PEP 585 generics like
list[str]) throughout the codebase.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py` around lines 1 - 2, Add the NVIDIA copyright + Apache-2.0 header at the top of the file, remove the legacy typing imports, and replace usages of typing.List and typing.Optional with Python 3.10+ builtins/generics: delete "from typing import List, Optional", keep "from types import SimpleNamespace", change any "List[T]" to "list[T]" and any "Optional[T]" to "T | None", and update the function/method return annotation referenced around line 49 to use the new "list[...]" or "| None" form as appropriate (adjust imports accordingly so no typing.List/Optional remain).
🧹 Nitpick comments (4)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
8698-8720: ⚡ Quick winAvoid GPU pool allocation for a pure
GenerationRequestbookkeeping testThis test targets
GenerationRequestfront-eviction counters, but it currently builds a fullBlockManagerand callsallocatePools(). That adds unnecessary CUDA dependency and potential resource flakiness unrelated to what is being asserted. Prefer constructing the window metadata directly (or via a lightweight helper) and instantiateGenerationRequestwithout allocating pools.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 8698 - 8720, The test unnecessarily constructs a full BlockManager and calls BlockManager::allocatePools() for a pure GenerationRequest bookkeeping test; instead create the needed windowSizeToMetadata/BlocksPerWindow data directly (or via a small helper) and construct a GenerationRequest with that metadata so no GPU pools are allocated. Replace the BlockManager instantiation and blockManager.allocatePools() calls with construction of the BlocksPerWindow / maxAttentionWindowVec used in the BlockManager call, then pass that metadata into GenerationRequest (or a test helper) so only removeFrontBlock/clearCacheBlocks/getNumFrontBlocksRemoved logic on GenerationRequest is exercised without invoking BlockManager::allocatePools or creating a CudaStream.tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py (1)
897-898: 💤 Low valueConsider moving
import mathto module level.The
mathmodule is imported inside two test functions (lines 897 and 974). Moving it to the top of the file with other imports improves consistency and avoids repeated import overhead.Also applies to: 974-975
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py` around lines 897 - 898, Move the two local "import math" statements out of the test functions and place a single "import math" at the module level with the other imports; locate the inline imports by searching for "import math" inside the test functions in this test_kv_cache module and remove them there so all tests use the top-level module import instead.tests/integration/defs/accuracy/test_llm_api_autodeploy.py (2)
1104-1155: Consider a perf-list follow-up for this KV-cache path.This gives useful functional coverage, and QA list updates look unnecessary because it stays within the existing accuracy suite. Separately, the PR changes KV-cache management / CUDA-graph behavior, so only correctness is protected here; if you want pre-merge and scheduled regression detection, add or confirm a perf entry in
tests/integration/test_lists/test-db/l0_perf.ymland the matchingtests/integration/test_lists/qa/llm_perf_*.yml.As per coding guidelines: "If the PR touches performance-sensitive paths (attention kernels, MoE routing/dispatch, KV cache management, scheduler, batching logic, CUDA graph capture, speculative decoding, or quantization kernels), check whether a perf test entry is present or updated in: (a) tests/integration/test_lists/test-db/l0_perf.yml ... and (b) tests/integration/test_lists/qa/llm_perf_*.yml."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py` around lines 1104 - 1155, Add a perf-test entry for this KV-cache/CUDA-graph path: create or update an l0_perf.yml entry and a matching qa llm_perf_*.yml entry that reference the new/changed scenario exercised by test_bf16 (the AutoDeployLLM invocation using model_factory="Gemma4ForConditionalGeneration", compile_backend="torch-cudagraph", attn_backend="triton_paged", kv_cache_config and cuda_graph_config), so scheduled and pre-merge perf regressions are caught; ensure the perf test captures relevant config tags (kv-cache, cuda-graph, compile_backend) and points to the appropriate benchmark task(s) (e.g., MMLU/GSM8K) used in the test.
1100-1102: ⚡ Quick winResolve Gemma through
hf_id_to_local_model_dir()instead of hard-coding the cache layout.This is the only test here that bypasses the shared resolver. Hard-coding
.../gemma/gemma-4-26B-A4B-itmakes the test depend on one on-disk layout and skips any alias/symlink handling the helper already centralizes.♻️ Proposed change
- MODEL_PATH = f"{llm_models_root()}/gemma/gemma-4-26B-A4B-it" + MODEL_PATH = hf_id_to_local_model_dir(MODEL_NAME)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py` around lines 1100 - 1102, The test hard-codes MODEL_PATH to llm_models_root()/gemma/... which bypasses the shared resolver; replace the manual path with a call to hf_id_to_local_model_dir(MODEL_NAME) so the test uses the central resolver (keep MODEL_NAME and EXTRA_EVALUATOR_KWARGS unchanged), i.e. set MODEL_PATH = hf_id_to_local_model_dir(MODEL_NAME) and import or reference hf_id_to_local_model_dir where used.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Around line 940-951: getSizePerHead() can return -1 for recurrent-state pools
created by WindowBlockManager::WindowBlockManager(), which then propagates
through getSizePerHeadPerWindow()/getSizePerHeadForWindow() and the nanobind
API; change getSizePerHead() to guard against negative sizePerHead by checking
mPools.front().sizePerHead and returning 0 when it is less than 0 (keep the
existing behavior of returning 0 when mPools.empty()), so callers never observe
a negative head dimension.
In `@examples/auto_deploy/model_registry/configs/gemma4_moe.yaml`:
- Line 13: Remove the duplicate max_batch_size entry nested inside the
cuda_graph_config block so the file only defines max_batch_size once at the top
level; specifically edit the gemma4_moe.yaml to delete the nested
"max_batch_size" key under "cuda_graph_config" and leave only the top-level
"max_batch_size: 512", keeping "cuda_graph_config" containing just the
"batch_sizes" array to match the other configs.
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 1326-1356: When cache_loc_per_group is provided but may be
partial, missing groups silently reuse prior staged values; fix by normalizing
the per-group maps before staging: for each group_idx in
1..self.num_window_groups-1, ensure cache_loc_per_group, cu_num_pages_per_group,
extra_page_per_seq_per_group, and kv_page_offset_per_group contain an entry (use
cache_loc, cu_num_pages, extra_page_per_seq, and kv_page_offset or a
per-sequence [0]*self.num_sequences for kv_page_offset as the group-0 defaults),
then proceed to call self._stage_arg for each suffix as currently done;
alternatively raise a clear error if any map is missing an entry — implement
this normalization where the current "if cache_loc_per_group is not None:" loop
runs so subsequent calls to self._stage_arg always see explicit values for every
group.
In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.py`:
- Around line 1163-1174: The current max_cached = page_counts *
kv_cache.shape[3] overcounts tokens by treating every page as full; instead
compute the actual addressable tokens per sample as full_pages * page_size +
last_page_len (and zero when page_counts==0). Replace the max_cached line with
logic using page_counts, kv_cache.shape[3] (page_size) and the per-sample
last-page lengths (e.g. cu_last_page_lens): full_pages = torch.clamp(page_counts
- 1, min=0); max_cached = full_pages * page_size + torch.where(page_counts > 0,
cu_last_page_lens, torch.zeros_like(cu_last_page_lens)); then continue to use
cache_len_capped and seq_len_with_cache as before.
In `@tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`:
- Around line 887-923: The loop currently sets num_active =
len(all_indices[front_removed:]) so the "extra" slot is never extracted; instead
compute the intended active block count first (e.g., active_count = determine
the number of active pages for this window — often window_size or a configured
per-group active size), then slice all_indices into active_pages =
all_indices[front_removed : front_removed + active_count] and extra_page =
all_indices[front_removed + active_count] if that index exists else -1; replace
uses of active_indices/num_active with active_pages/len(active_pages) and set
extra_page_per_seq (and extra_page_per_seq_per_group) from extra_page, updating
cache_loc, cache_loc_per_group, cu_num_pages, cu_num_pages_per_group,
kv_page_offset and kv_page_offset_per_group accordingly (refer to
variables/all_indices, front_removed, cache_loc, extra_page_per_seq,
cache_loc_per_group, extra_page_per_seq_per_group, cu_num_pages,
cu_num_pages_per_group in ad_executor.py).
In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py`:
- Around line 121-131: This file is missing the required NVIDIA/Apache header;
add the repository-standard copyright/license header (with the latest
modification year) to the top of
tensorrt_llm/_torch/auto_deploy/shim/interface.py, before any imports or code,
following the same format used in other source files (NVIDIA copyright line +
Apache 2.0 license block and SPDX identifier), so the module (containing symbols
like KVCacheManager, MambaHybridCacheManager, self._kv_cache_manager,
self._kv_group_windows) complies with the project's header policy.
- Around line 916-924: The transform-time _kv_group_windows can become stale
because KVCacheManager may clamp windows during construction; after calling
self._assign_kv_cache_views(...) you must refresh the cached group windows from
the manager so downstream callers (e.g., ad_executor.py using get_cache_indices
and get_num_front_blocks_removed) see the finalized per-window sizes—update
self._kv_group_windows from the KVCacheManager's authoritative windows (via
self._kv_cache_manager.windows or the manager API that exposes per-window sizes)
immediately after _assign_kv_cache_views returns, keeping _kv_cache_config_tuned
and block_offset_multiplier logic unchanged.
- Around line 814-823: The branch that returns self.info.max_seq_len when
total_max_tokens is None and SWA is present incorrectly overrides an explicit
VSWA budget stored on self._kv_cache_config_original; update the logic in the
function (around the total_max_tokens /
self._has_swa_window(size_per_head_per_window) check) to first detect and
preserve an existing user-set max_tokens on self._kv_cache_config_original
(e.g., check getattr(self._kv_cache_config_original, "max_tokens", None) or
similar) and return that value when present instead of collapsing to
self.info.max_seq_len, so that _prepare_kv_cache_config and config.max_tokens
keep the intended budget.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`:
- Line 343: The loop in kvcache.py uses enumerate but never uses the index
variable (for idx, (attn_node, qkv, cache_in_nodes, constants, group_idx) in
enumerate(layer_infos)), triggering Ruff B007; fix by either removing enumerate
and iterating directly over layer_infos (for attn_node, qkv, cache_in_nodes,
constants, group_idx in layer_infos) or rename the index to _idx if you want to
keep enumerate, updating the loop header in the same location.
- Around line 319-350: Pass 2 only replaces per-layer standard metadata
(layer_meta_nodes_std) but leaves group-sensitive inputs in meta_nodes_extra
untouched, so multi-group backends like TrtllmAttention (which populate
meta_nodes_extra via prepare_trtllm_metadata) still reference group-0 nodes;
update the loop over layer_infos to also build a per-group copy of
meta_nodes_extra (e.g., layer_meta_nodes_extra = list(meta_nodes_extra)) for
group_idx > 0 and, using the same std_arg_names / vswa_group_nodes mapping you
use for layer_meta_nodes_std, replace any entries in layer_meta_nodes_extra
whose argument names match keys in vswa_group_nodes[group_idx] with the
corresponding vswa_group_nodes[group_idx][arg_name]; ensure downstream uses the
per-group layer_meta_nodes_extra where prepare_trtllm_metadata or
TrtllmAttention expects group-specific inputs so VSWA routing is correct.
In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py`:
- Around line 764-769: CachedSequenceInterface is being instantiated without the
required max_num_tokens argument; update each creation of
CachedSequenceInterface (the instances inside this test file) to pass
max_num_tokens=default_max_num_tokens(max_seq_len, batch_size) so the
constructor matches its signature—specifically add that parameter to the calls
that currently set max_seq_len, max_batch_size, device, and kv_cache_config.
---
Outside diff comments:
In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 653-669: Block IDs are currently treated as global but
WindowBlockManager instances hosted under BlockManager (mWindowBlockManagers)
make IDs unique only per window; update the pin/unpin contract so pinned handles
are window-qualified (e.g., pair<windowSize, blockId> or a per-window map) and
propagate this everywhere: change BlockManager::storeBlocksForReuse() to record
window-qualified IDs when pinBlocks==true, change
BlockManager::unpinBlocksById() to group incoming handles by window and call the
appropriate WindowBlockManager unpin path (dispatch into each WindowBlockManager
in mWindowBlockManagers rather than always using the first), and update any
callers and types that assume flat IDs to accept and pass the new
window-qualified pin handle.
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 554-561: The adjusted attention-window sizes returned by
_validate_and_adjust_attention_windows may rewrite
self.max_attention_window_vec, but the per-window override dicts
(size_per_head_per_window and dtype_per_window) are still using the original
keys; update those dicts after calling _validate_and_adjust_attention_windows to
remap or rebuild their entries to the adjusted window sizes (i.e., iterate the
original override entries and replace keys with the corresponding values from
the new self.max_attention_window_vec or reconstruct the dicts from the adjusted
vector) so lookups in the C++ ctor and get_buffers() use the clamped keys rather
than falling back to scalar defaults.
- Around line 1412-1429: The sizing loop currently uses the fallback
self.head_dim and dtype for all windows; update it to use the
per-window/per-layer head-dim and dtype maps when computing cache_size_per_token
and cache_size_bytes_per_token so mixed-shape VSWA pools are sized correctly:
inside calculate_cache_size_per_token (and where you call get_size_in_bytes and
the NVFP4 scaling branch) look up the actual head_dim and dtype for the current
window/layers (e.g., use self.head_dim_map or self.head_dim_per_layer and
self.dtype_map/self.dtype_per_layer) instead of self.head_dim/self.dtype,
compute per-layer/per-window contributions accordingly, and then sum them to
produce required_mem_bytes_per_seq and to feed
KVCacheManager.calculate_scaling_factor_size_bytes so blocks_per_window reflects
the real C++ layout.
In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py`:
- Around line 1-2: Add the NVIDIA copyright + Apache-2.0 header at the top of
the file, remove the legacy typing imports, and replace usages of typing.List
and typing.Optional with Python 3.10+ builtins/generics: delete "from typing
import List, Optional", keep "from types import SimpleNamespace", change any
"List[T]" to "list[T]" and any "Optional[T]" to "T | None", and update the
function/method return annotation referenced around line 49 to use the new
"list[...]" or "| None" form as appropriate (adjust imports accordingly so no
typing.List/Optional remain).
---
Nitpick comments:
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 8698-8720: The test unnecessarily constructs a full BlockManager
and calls BlockManager::allocatePools() for a pure GenerationRequest bookkeeping
test; instead create the needed windowSizeToMetadata/BlocksPerWindow data
directly (or via a small helper) and construct a GenerationRequest with that
metadata so no GPU pools are allocated. Replace the BlockManager instantiation
and blockManager.allocatePools() calls with construction of the BlocksPerWindow
/ maxAttentionWindowVec used in the BlockManager call, then pass that metadata
into GenerationRequest (or a test helper) so only
removeFrontBlock/clearCacheBlocks/getNumFrontBlocksRemoved logic on
GenerationRequest is exercised without invoking BlockManager::allocatePools or
creating a CudaStream.
In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Around line 1104-1155: Add a perf-test entry for this KV-cache/CUDA-graph
path: create or update an l0_perf.yml entry and a matching qa llm_perf_*.yml
entry that reference the new/changed scenario exercised by test_bf16 (the
AutoDeployLLM invocation using model_factory="Gemma4ForConditionalGeneration",
compile_backend="torch-cudagraph", attn_backend="triton_paged", kv_cache_config
and cuda_graph_config), so scheduled and pre-merge perf regressions are caught;
ensure the perf test captures relevant config tags (kv-cache, cuda-graph,
compile_backend) and points to the appropriate benchmark task(s) (e.g.,
MMLU/GSM8K) used in the test.
- Around line 1100-1102: The test hard-codes MODEL_PATH to
llm_models_root()/gemma/... which bypasses the shared resolver; replace the
manual path with a call to hf_id_to_local_model_dir(MODEL_NAME) so the test uses
the central resolver (keep MODEL_NAME and EXTRA_EVALUATOR_KWARGS unchanged),
i.e. set MODEL_PATH = hf_id_to_local_model_dir(MODEL_NAME) and import or
reference hf_id_to_local_model_dir where used.
In
`@tests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py`:
- Around line 897-898: Move the two local "import math" statements out of the
test functions and place a single "import math" at the module level with the
other imports; locate the inline imports by searching for "import math" inside
the test functions in this test_kv_cache module and remove them there so all
tests use the top-level module import instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7d9a15af-c921-4ad0-9553-71fb1e604d70
📒 Files selected for processing (19)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cppcpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cppexamples/auto_deploy/model_registry/configs/gemma4_moe.yamltensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_paged_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.pytensorrt_llm/_torch/auto_deploy/shim/interface.pytensorrt_llm/_torch/auto_deploy/transform/library/kvcache.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytests/integration/defs/accuracy/references/gsm8k.yamltests/integration/defs/accuracy/references/mmlu.yamltests/integration/defs/accuracy/test_llm_api_autodeploy.pytests/unittest/auto_deploy/singlegpu/shim/test_cached_sequence_interface.pytests/unittest/auto_deploy/singlegpu/transformations/library/test_kv_cache.py
e32e499 to
9fb2e92
Compare
|
/bot run --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
/bot run --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #46738 [ run ] triggered by Bot. Commit: |
|
PR_Github #46739 [ run ] triggered by Bot. Commit: |
|
PR_Github #46739 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #46795 [ run ] triggered by Bot. Commit: |
|
PR_Github #46795 [ run ] completed with state
|
|
/bot run --disable-fail-fast --extra-stage "DGX_H100-1_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #46803 [ run ] triggered by Bot. Commit: |
|
PR_Github #46803 [ run ] completed with state
|
|
/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1" |
1 similar comment
|
/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1" |
|
PR_Github #46896 [ run ] triggered by Bot. Commit: |
The fixture set max_seq_len=128 while the eviction and long-prefill cases constructed 129- and 192-token requests; kvCacheManager caps allocation at max(windowSize, maxSequenceLength) so the helper saw fewer live pages than expected. Bump FULL_WINDOW to 256 so the SWA pool can hold every scenario. Signed-off-by: Yueh-Ting Chen <[email protected]>
|
/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1" |
|
PR_Github #50400 [ run ] triggered by Bot. Commit: |
|
PR_Github #50400 [ run ] completed with state
|
|
/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1" |
|
PR_Github #50507 [ run ] triggered by Bot. Commit: |
Add diagnostic ad_logger.debug statements to help isolate why test_gemma3n_e2b_it fails on GSM8K (256-token decode) after MMLU (2-token decode) passes. Two traces: - ad_executor.py: per (request, pool) trace after _compute_window_local_view, surfacing front_removed, num_active, extra_page, active_token_count, last_page_len, len(all_indices). - attention_interface.py: in offset_pos_and_cache_, conditional on any delta != 0 (page-boundary rollover), surface delta, extra_idx, and a smoking-gun bool indicating whether extra_page=-1 is about to be rotated into cache_loc by adjust_ragged_triton. Enable with AUTO_DEPLOY_LOG_LEVEL=debug. The conditional in offset_pos_and_cache_ keeps the log quiet on non-rollover steps. Signed-off-by: Yueh-Ting Chen <[email protected]>
Standalone repro for the GSM8K-style failure in TestGemmaE2B::test_gemma3n_e2b_it.
Loads the model via AutoDeployLLM with the same gemma3n_e2b_it.yaml config the
failing test uses (FlashInfer backend, multi-pool VSWA path), builds a
~prompt-tokens long synthetic prompt, and generates max-new-tokens greedily.
Defaults reproduce the failing-CI conditions in <1 minute:
prompt-tokens=1024 (> 512 SWA window, fires eviction on first decode)
max-new-tokens=256 (~8 SWA page boundaries crossed during decode)
max-seq-len=4352 (matches the test's MAX_SEQ_LEN)
Intended use with the swa-trace debug logging:
AUTO_DEPLOY_LOG_LEVEL=debug \
python examples/auto_deploy/repro_swa_gemma3n.py \
--model-path /path/to/google/gemma-3n-E2B-it \
2>&1 | tee /tmp/swa-trace.log
grep "[swa-trace]" /tmp/swa-trace.log
Signed-off-by: Yueh-Ting Chen <[email protected]>
The PR-added clamp on the reverse iterator in
WindowBlockManager::releaseBlocks:
for (auto it = allocatedBlocks.rbegin();
it != allocatedBlocks.rend() - sequence.getNumFrontBlocksRemoved(mWindowSize);
++it)
is unnecessary and a potential source of off-by-one corruption under
SWA front-eviction. The loop body already handles placeholder sentinels
safely (placeholders have mRefCount==0, so decRefCount is skipped and
EvictionPolicy::releaseBlock is "silently ignored" per the existing
comment). Restore the original full-range reverse iteration.
This removes one suspect in the GSM8K accuracy regression on
TestGemmaE2B::test_gemma3n_e2b_it. Tracking the radix-tree /
storeBlocksForReuse interaction with SWA eviction (the larger suspect)
remains as a separate follow-up.
Signed-off-by: Yueh-Ting Chen <[email protected]>
prepare_flashinfer_metadata_host -> plan_generate_only iterates ALL
cached_cuda_graph_decode_wrappers unconditionally, refreshing each
wrapper's pre-allocated indptr/indices/last_page_len buffers with the
same single-pool (unsuffixed) cache_loc + cu_num_pages + last_page_len.
In single-pool deployments (main: no sliding_window on the KV handler,
no pool partitioning) this was harmless -- one wrapper, one pool.
In multi-pool VSWA (this PR: FlashInfer declares sliding_window on its
KVPagedResourceHandler so the kvcache transform splits SWA + full-attn
layers into separate pools), pool 0's cache_loc was being written into
EVERY cached wrapper's buffers, including the full-attention pool's
wrapper. Block IDs are pool-local (each WindowBlockManager has its
own mAllBlocksById), so the full-attention kernel then reads from its
own KV memory at the SWA pool's block-ID coordinates -- unrelated
physical pages. Visible as: MMLU passes (2-token decode tolerates
the stomp), GSM8K collapses (256-token decode compounds the error
into off-by-one arithmetic / prompt regurgitation).
Fix:
- flashinfer_attention.py: plan_generate_only takes an optional
pool_window_left filter; when set, only refreshes wrappers whose
plan_params.window_left matches. prepare_flashinfer_metadata_host
accepts and forwards pool_window_left.
- transform/library/kvcache.py: when the backend's host-prep declares
pool_window_left AND we have 2+ KV pools, register the host-prep
once per pool with per-pool arg names (pool 0 unsuffixed, pools
1..N-1 with _g{i} / _g{i}_host) and a closure that remaps those
suffixed kwargs back to the function's canonical parameter names
before delegating, plus a bound pool_window_left = sliding_window-1
(SWA) or -1 (full). Single-pool and non-opt-in backends retain
legacy single registration.
The host-prep call site (run_host_prepare_for_attention_forward) invokes
each registered function as host_function(**{arg: get_arg(arg)}); the
closure is required because the suffixed arg names become the kwarg
keys, but the underlying host-prep function only declares the
unsuffixed parameter names.
_process_metadata_host is moved from the pre-pass-1 location to after
register_window_groups in each of the three group-setup branches, so
handler_groups is available at registration time.
Signed-off-by: Yueh-Ting Chen <[email protected]>
The fix for the GSM8K accuracy regression (0b76919 "per-pool host-prep for FlashInfer VSWA") landed, so the diagnostic instrumentation added for the investigation can come out. Reverts: - 20c0ff6 (ad_logger.debug swa-trace lines in ad_executor.py and attention_interface.py) - c5551ce (examples/auto_deploy/repro_swa_gemma3n.py) Signed-off-by: Yueh-Ting Chen <[email protected]>
|
/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1" |
1 similar comment
|
/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1" |
|
PR_Github #50542 [ run ] triggered by Bot. Commit: |
|
PR_Github #50507 [ run ] completed with state |
|
PR_Github #50542 [ run ] completed with state
|
|
/bot run --disable-fail-fast --extra-stage "DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, H100_PCIe-AutoDeploy-1" |
|
PR_Github #50646 [ run ] triggered by Bot. Commit: |
|
PR_Github #50646 [ run ] completed with state
|
|
/bot skip --comment "The only failing pipeline |
|
PR_Github #50741 [ skip ] triggered by Bot. Commit: |
|
PR_Github #50741 [ skip ] completed with state |
…-window models - PR NVIDIA#13745 made trtllm `get_cache_initializers` propagate `sliding_window`, so models with non-uniform windows (e.g. gpt-oss-120b: 128/4096) form >1 KV pool. - trtllm enforces a single uniform pool (`requires_uniform_kv_caches`) → crash at cache_init: "KV resources are not uniform". - Temporary fix: trtllm-only revert of the pool-alloc change (handler `sliding_window=0` → single full-seq pool, SWA layers over-allocate KV); the sliding-window mask is still applied via the op's own `sliding_window` arg. Validated gsm8k[120b]=90.4. - Re-enabling multi-pool needs trtllm-native VSWA: per-pool `block_offsets` tables + real `pool_mapping` routing — the degenerate per-layer-pointer path can't address multiple pools, and flashinfer-style host-sliced views corrupt the cyclic-window trtllm kernel (naive attempt → gsm8k 6%). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Taylor Yeonbok Lee <[email protected]>
…-window models - PR NVIDIA#13745 made trtllm `get_cache_initializers` propagate `sliding_window`, so models with non-uniform windows (e.g. gpt-oss-120b: 128/4096) form >1 KV pool. - trtllm enforces a single uniform pool (`requires_uniform_kv_caches`) → crash at cache_init: "KV resources are not uniform". - Temporary fix: trtllm-only revert of the pool-alloc change (handler `sliding_window=0` → single full-seq pool, SWA layers over-allocate KV); the sliding-window mask is still applied via the op's own `sliding_window` arg. Validated gsm8k[120b]=90.4. - Re-enabling multi-pool needs trtllm-native VSWA: per-pool `block_offsets` tables + real `pool_mapping` routing — the degenerate per-layer-pointer path can't address multiple pools, and flashinfer-style host-sliced views corrupt the cyclic-window trtllm kernel (naive attempt → gsm8k 6%). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Taylor Yeonbok Lee <[email protected]>
…-window models - PR NVIDIA#13745 made trtllm `get_cache_initializers` propagate `sliding_window`, so models with non-uniform windows (e.g. gpt-oss-120b: 128/4096) form >1 KV pool. - trtllm enforces a single uniform pool (`requires_uniform_kv_caches`) → crash at cache_init: "KV resources are not uniform". - Temporary fix: trtllm-only revert of the pool-alloc change (handler `sliding_window=0` → single full-seq pool, SWA layers over-allocate KV); the sliding-window mask is still applied via the op's own `sliding_window` arg. Validated gsm8k[120b]=90.4. - Re-enabling multi-pool needs trtllm-native VSWA: per-pool `block_offsets` tables + real `pool_mapping` routing — the degenerate per-layer-pointer path can't address multiple pools, and flashinfer-style host-sliced views corrupt the cyclic-window trtllm kernel (naive attempt → gsm8k 6%). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]> Signed-off-by: Taylor Yeonbok Lee <[email protected]>
Summary
head_dim=256+ global-attention layers withhead_dim=512)Test coverage