[None][feat] KV cache-aware ADP router for prefix-affinity request routing#12315
Conversation
Expose countReusableBlocks via nanobind and implement cache-aware ADP routing for the C++ (v1) KV cache manager path. Requests are routed to the DP rank with the most prefix cache hits, reducing redundant prefill computation. Changes: - C++ nanobind: expose countReusableBlocks on BaseKVCacheManager - Python: add KVCacheAwareADPRouter to adp_router.py - Python: add probe_prefix_match_length to v1 KVCacheManager - Python: wire up router creation in _util.py and py_executor.py - Tests: 20 unit tests covering router logic and v1 probe guards Signed-off-by: Lance Liao <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
The original scoring formula used raw active_tokens as the load term, which at tens of thousands of tokens overwhelmed the cache benefit term (hundreds-thousands), causing a negative feedback loop where the router would avoid cached ranks due to their higher load. Fix: normalize the load term by total active tokens across eligible ranks so both terms remain on the same [0, req_tokens] scale. Also add prefix-affinity sorting to group same-conversation requests together before routing. Benchmark (DeepSeek-R1 FP4, B200x8, 256 concurrency, multi-turn): - B (cache router) vs C (default router): +5.4% throughput, -6.8% TTFT - Cache hit rate: B=74.0% vs C=72.6% Signed-off-by: Larry Liao <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
…onDpConfig Move ADP router selection from implicit capability detection in _util.py to an explicit `enable_kv_cache_aware_routing` field in AttentionDpConfig. Add ADPRouter.create() factory method that owns the selection logic. Remove the TRTLLM_FORCE_DEFAULT_ADP_ROUTER env var hack. Signed-off-by: Lanyu Liao <[email protected]> Made-with: Cursor
Signed-off-by: Lanyu Liao <[email protected]>
📝 WalkthroughWalkthroughThis pull request introduces KV cache-aware routing for distributed attention processing. It exposes a Changes
Sequence Diagram(s)sequenceDiagram
participant PyExecutor
participant ADPRouter
participant KVCacheManager
participant Distributed
PyExecutor->>ADPRouter: create(dist, kv_cache_manager, config)
ADPRouter->>ADPRouter: Instantiate KVCacheAwareADPRouter
ADPRouter-->>PyExecutor: Return configured router
PyExecutor->>PyExecutor: _fetch_new_requests()
alt needs_prefix_matches
PyExecutor->>ADPRouter: gather_prefix_matches(new_requests)
ADPRouter->>KVCacheManager: probe_prefix_match_length() per request
KVCacheManager-->>ADPRouter: cached token counts
ADPRouter->>Distributed: allgather() prefix matches across ranks
ADPRouter->>ADPRouter: Store _all_ranks_prefix_matches
end
PyExecutor->>ADPRouter: route_requests(all_rank_states, new_requests, max_active)
ADPRouter->>ADPRouter: Score requests by cache affinity + load
ADPRouter->>ADPRouter: Assign requests to best-ranked destinations
ADPRouter-->>PyExecutor: Return routed requests dict
sequenceDiagram
participant Request
participant KVCacheAwareADPRouter
participant RankState
participant ScoringLogic
Request->>KVCacheAwareADPRouter: route_requests(rank_states, requests)
KVCacheAwareADPRouter->>KVCacheAwareADPRouter: Compute expected active count
loop For each Request
KVCacheAwareADPRouter->>ScoringLogic: Has explicit dp_rank hint?
alt Explicit dp_rank + capacity available
ScoringLogic-->>KVCacheAwareADPRouter: Assign to hinted rank
else No hint
KVCacheAwareADPRouter->>ScoringLogic: Score each rank (prefix_match, load_balance)
Note over ScoringLogic: Score = effective_tokens + beta * normalized_load
ScoringLogic-->>KVCacheAwareADPRouter: Best rank
KVCacheAwareADPRouter->>RankState: Update active counts & tokens
end
end
KVCacheAwareADPRouter-->>Request: Return assignments per rank
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required. |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
1-16:⚠️ Potential issue | 🟡 MinorUpdate the NVIDIA header year to 2026.
This file is modified in this PR, but the header still ends at 2025.
As per coding guidelines, "Add NVIDIA copyright header on ALL new files, and update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp` around lines 1 - 16, Update the NVIDIA copyright header at the top of kvCacheManager.cpp by changing the year range from "2022-2025" to "2022-2026"; modify the SPDX header block (the comment spanning the top of the file including the "SPDX-FileCopyrightText" line) to reflect 2026 so the file header matches the repository guideline for modified files.tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py (1)
1-1:⚠️ Potential issue | 🟡 MinorUpdate copyright year to include 2026.
The file is being modified in 2026, but the copyright header only covers 2022-2025. As per coding guidelines, the year should reflect the latest meaningful modification.
Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py` at line 1, Update the SPDX copyright header at the top of the module (the SPDX line / file header in tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py) to include 2026 (e.g., change "2022-2025" to "2022-2026") so the copyright years reflect the current modification year.
🧹 Nitpick comments (4)
tensorrt_llm/llmapi/llm_args.py (1)
538-543: Consider clarifying the dependency onenable_block_reusein the description.The description could mention that this feature requires
kv_cache_config.enable_block_reuse=Trueto be effective. Based onADPRouter.createinadp_router.py(lines 85-93), when block reuse is disabled, the router silently falls back toDefaultADPRouter. Users might enable this flag and expect cache-aware routing without realizing it's not active.Also, there's a trailing space at the end of the description string.
Proposed improvement
enable_kv_cache_aware_routing: bool = Field( default=False, description= "Use KV cache-aware routing for attention DP request distribution. " "When enabled, routes requests to ranks that already have matching " - "prefix KV cache, reducing redundant prefill computation. ") + "prefix KV cache, reducing redundant prefill computation. " + "Requires kv_cache_config.enable_block_reuse=True to take effect.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/llmapi/llm_args.py` around lines 538 - 543, Update the Field description for enable_kv_cache_aware_routing to explicitly state it only takes effect when kv_cache_config.enable_block_reuse is True (otherwise ADPRouter.create falls back to DefaultADPRouter), and remove the trailing space at the end of the description string; reference enable_kv_cache_aware_routing, kv_cache_config.enable_block_reuse, ADPRouter.create and DefaultADPRouter so users understand the dependency and silent fallback.tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
577-580: Keep the binding imports namespaced here.Line 577 shadows the module-level
SamplingConfigimported at Line 50. Using module aliases also makes it clearer that these are binding types rather than the Python request wrappers.♻️ Suggested rewrite
- from tensorrt_llm.bindings import SamplingConfig - from tensorrt_llm.bindings.internal.batch_manager import BlockKey - from tensorrt_llm.bindings.internal.batch_manager import \ - LlmRequest as CppLlmRequest - block_key = BlockKey(tokens=input_tokens, lora_task_id=lora_task_id) + import tensorrt_llm.bindings as bindings + import tensorrt_llm.bindings.internal.batch_manager as batch_manager + + block_key = batch_manager.BlockKey( + tokens=input_tokens, lora_task_id=lora_task_id + ) unique_tokens = block_key.unique_tokens - dummy_req = CppLlmRequest(request_id=0, - max_new_tokens=0, - input_tokens=input_tokens, - sampling_config=SamplingConfig(), - is_streaming=False, - lora_task_id=lora_task_id) + dummy_req = batch_manager.LlmRequest( + request_id=0, + max_new_tokens=0, + input_tokens=input_tokens, + sampling_config=bindings.SamplingConfig(), + is_streaming=False, + lora_task_id=lora_task_id, + )As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions" and "Avoid shadowing variables declared in an outer scope in Python".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 577 - 580, The local imports in resource_manager.py shadow the module-level SamplingConfig and break the "import module, not names" guideline; replace the three "from tensorrt_llm.bindings ..." style imports with a namespaced import (e.g., import tensorrt_llm.bindings as bindings or import tensorrt_llm.bindings.internal.batch_manager as batch_manager) and update usages to bindings.SamplingConfig, batch_manager.BlockKey and batch_manager.LlmRequest (aliasing LlmRequest to CppLlmRequest if you need that name) so the binding types remain clearly namespaced and do not shadow the existing SamplingConfig.tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
376-380: Log which router the factory selected.Now that router selection moved inside
PyExecutor, a one-time info log oftype(self.adp_router).__name__would make it obvious whether KV-cache-aware routing was actually enabled or whether the factory fell back to the default router because a prerequisite was missing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 376 - 380, After ADPRouter.create is called in PyExecutor (the block that assigns self.adp_router via ADPRouter.create with dist, kv_cache_manager, and attention_dp_config), add a one-time info log that records the concrete router class selected by the factory: log type(self.adp_router).__name__ (or the equivalent) using the existing logger so it's clear whether the KV-cache-aware router was used or a fallback; ensure the log runs immediately after the self.adp_router assignment in the PyExecutor initialization flow.
70-70: Keep the router import namespaced.Please import the module and reference
adp_router.ADPRouterinstead of importingADPRouterdirectly. This file already pulls in a large symbol surface, and the repo rule is to keep Python imports explicit.As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions (e.g., use
from package.subpackage import foothenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` at line 70, The import of ADPRouter should be namespaced: replace the direct class import (ADPRouter) with a module import (adp_router) and update all usages in this file to reference adp_router.ADPRouter; specifically change the import line that currently brings in ADPRouter and then update any instantiation or type references of ADPRouter to use adp_router.ADPRouter so the module namespace is preserved and the symbol surface remains explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2526-2527: The PyExecutor currently calls
self.adp_router.gather_prefix_matches(new_requests) whenever
needs_prefix_matches is true, which triggers a tp_allgather even when
new_requests is empty; change the call site so the probe only runs when there
are actual requests (guard with if new_requests) and remove subclass-specific
setup from PyExecutor by adding an ADPRouter hook (e.g., a method on ADPRouter
like maybe_gather_prefix_matches(new_requests) or move the needs_prefix_matches
check into ADPRouter) so PyExecutor simply calls a single router method and the
router itself decides whether to call gather_prefix_matches/local tp_allgather
based on new_requests and needs_prefix_matches.
In `@tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py`:
- Around line 88-96: The factory currently returns KVCacheAwareADPRouter when
kv_cache_manager.enable_block_reuse is true but some managers (e.g.,
KVCacheManagerV2) lack the probe_prefix_match_length() API; update the selection
logic in the block that constructs KVCacheAwareADPRouter to also verify that
kv_cache_manager implements the required probing API (e.g.,
hasattr(kv_cache_manager, "probe_prefix_match_length") or an isinstance check
against the KVCacheManager type) before returning KVCacheAwareADPRouter, and
otherwise fall back to DefaultADPRouter so gather_prefix_matches() won't raise
an AttributeError.
- Around line 429-437: When you assign a request to a target data-parallel rank
in the block that checks scheduling_params.attention_dp_rank, also record
explicit per-request placement in the token-load tracker so future score
calculations see this placement; update the same branch that increments
all_ranks_num_active_requests and appends to all_ranks_new_requests to also
register the placement for req_item with the token-load tracker (the component
that tracks token load), using the scheduling_params.attention_dp_rank as the
placement rank so later scoring no longer treats that rank as lightly loaded.
---
Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Around line 1-16: Update the NVIDIA copyright header at the top of
kvCacheManager.cpp by changing the year range from "2022-2025" to "2022-2026";
modify the SPDX header block (the comment spanning the top of the file including
the "SPDX-FileCopyrightText" line) to reflect 2026 so the file header matches
the repository guideline for modified files.
In `@tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py`:
- Line 1: Update the SPDX copyright header at the top of the module (the SPDX
line / file header in tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py) to
include 2026 (e.g., change "2022-2025" to "2022-2026") so the copyright years
reflect the current modification year.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 376-380: After ADPRouter.create is called in PyExecutor (the block
that assigns self.adp_router via ADPRouter.create with dist, kv_cache_manager,
and attention_dp_config), add a one-time info log that records the concrete
router class selected by the factory: log type(self.adp_router).__name__ (or the
equivalent) using the existing logger so it's clear whether the KV-cache-aware
router was used or a fallback; ensure the log runs immediately after the
self.adp_router assignment in the PyExecutor initialization flow.
- Line 70: The import of ADPRouter should be namespaced: replace the direct
class import (ADPRouter) with a module import (adp_router) and update all usages
in this file to reference adp_router.ADPRouter; specifically change the import
line that currently brings in ADPRouter and then update any instantiation or
type references of ADPRouter to use adp_router.ADPRouter so the module namespace
is preserved and the symbol surface remains explicit.
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 577-580: The local imports in resource_manager.py shadow the
module-level SamplingConfig and break the "import module, not names" guideline;
replace the three "from tensorrt_llm.bindings ..." style imports with a
namespaced import (e.g., import tensorrt_llm.bindings as bindings or import
tensorrt_llm.bindings.internal.batch_manager as batch_manager) and update usages
to bindings.SamplingConfig, batch_manager.BlockKey and batch_manager.LlmRequest
(aliasing LlmRequest to CppLlmRequest if you need that name) so the binding
types remain clearly namespaced and do not shadow the existing SamplingConfig.
In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 538-543: Update the Field description for
enable_kv_cache_aware_routing to explicitly state it only takes effect when
kv_cache_config.enable_block_reuse is True (otherwise ADPRouter.create falls
back to DefaultADPRouter), and remove the trailing space at the end of the
description string; reference enable_kv_cache_aware_routing,
kv_cache_config.enable_block_reuse, ADPRouter.create and DefaultADPRouter so
users understand the dependency and silent fallback.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 50b1a799-f697-417e-8ede-dad0dc66c02d
📒 Files selected for processing (8)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpptensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/scheduler/__init__.pytensorrt_llm/_torch/pyexecutor/scheduler/adp_router.pytensorrt_llm/llmapi/llm_args.pytests/unittest/_torch/executor/test_kvcache_aware_router.py
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #39447 [ run ] triggered by Bot. Commit: |
Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
|
/bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #40343 [ run ] triggered by Bot. Commit: |
|
PR_Github #40343 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40403 [ run ] triggered by Bot. Commit: |
|
PR_Github #40403 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40424 [ run ] triggered by Bot. Commit: |
|
PR_Github #40424 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40457 [ run ] triggered by Bot. Commit: |
|
PR_Github #40457 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40474 [ run ] triggered by Bot. Commit: |
|
PR_Github #40474 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40486 [ run ] triggered by Bot. Commit: |
|
PR_Github #40486 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40493 [ run ] triggered by Bot. Commit: |
|
PR_Github #40493 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40495 [ run ] triggered by Bot. Commit: |
|
PR_Github #40495 [ run ] completed with state
|
|
/bot skip --comment "Flaky multi-GPU nemotron test" |
|
PR_Github #40525 [ skip ] triggered by Bot. Commit: |
|
PR_Github #40525 [ skip ] completed with state |
Summary
Add a KV cache-aware request router (
KVCacheAwareADPRouter) for attention dataparallelism (ADP). When enabled, new requests are routed to the DP rank that
already holds the longest matching prefix in its radix tree, reducing redundant
prefill computation in multi-turn conversation workloads.
Motivation
With the default load-balanced ADP router, requests from the same conversation
may land on different DP ranks across turns, causing each rank to recompute the
full prefix. By probing each rank's KV cache radix tree before routing, we can
steer requests to ranks that already cache their prefix, significantly improving
KV cache hit rate and throughput.
Changes
adp_router.py: AddKVCacheAwareADPRouterwith:probe_prefix_match_length+ allgatherscore = effective_tokens + β * normalized_load(lower = better)ADPRouter.create()factory that selects the router based on configllm_args.py: Addenable_kv_cache_aware_routingfield toAttentionDpConfigpy_executor.py: UseADPRouter.create()factory; callgather_prefix_matchesbefore
route_requestswhen the router needs prefix datascheduler/__init__.py: ExportKVCacheAwareADPRouteredge cases, and factory method selection
Benchmark
DeepSeek-V3.2 FP4, 8×B200, EP8+DP8, 256 concurrency, multi-turn scenario:
Usage
attention_dp_config:
enable_kv_cache_aware_routing: true
Limitations / Future Work
effective_tokens + β * normalized_load) is asimple baseline. The load normalization and the weight β (default 1.0, tunable
via
TRTLLM_CACHE_ROUTER_BETA) were chosen empirically and may not be optimalfor all workload patterns (e.g., skewed conversation lengths, bursty arrivals).
More sophisticated approaches — such as adaptive β based on system utilization,
or incorporating predicted generation length — are left for future iterations.