[None][feat] KV cache-aware ADP router for prefix-affinity request routing by lancelly · Pull Request #12315 · NVIDIA/TensorRT-LLM

lancelly · 2026-03-18T08:47:04Z

Summary

Add a KV cache-aware request router (KVCacheAwareADPRouter) for attention data
parallelism (ADP). When enabled, new requests are routed to the DP rank that
already holds the longest matching prefix in its radix tree, reducing redundant
prefill computation in multi-turn conversation workloads.

Motivation

With the default load-balanced ADP router, requests from the same conversation
may land on different DP ranks across turns, causing each rank to recompute the
full prefix. By probing each rank's KV cache radix tree before routing, we can
steer requests to ranks that already cache their prefix, significantly improving
KV cache hit rate and throughput.

Changes

adp_router.py: Add KVCacheAwareADPRouter with:
- Per-request prefix probing via probe_prefix_match_length + allgather
- Scoring formula: score = effective_tokens + β * normalized_load (lower = better)
- Prefix-affinity sorting to group same-conversation requests before routing
- ADPRouter.create() factory that selects the router based on config
llm_args.py: Add enable_kv_cache_aware_routing field to AttentionDpConfig
py_executor.py: Use ADPRouter.create() factory; call gather_prefix_matches
before route_requests when the router needs prefix data
scheduler/__init__.py: Export KVCacheAwareADPRouter
Tests: 20 unit tests covering scoring, load balancing, prefix affinity,
edge cases, and factory method selection

Benchmark

DeepSeek-V3.2 FP4, 8×B200, EP8+DP8, 256 concurrency, multi-turn scenario:

Metric	Default Router	KV Cache-Aware Router	Delta
Weighted cache hit rate	82.1%	93.1%	+11pp
TTFT (mean)	1.13s	0.70s	-38%
ITL (mean)	51.3ms	32.6ms	-37%

Usage

attention_dp_config:
enable_kv_cache_aware_routing: true

Limitations / Future Work

The current scoring formula (effective_tokens + β * normalized_load) is a
simple baseline. The load normalization and the weight β (default 1.0, tunable
via TRTLLM_CACHE_ROUTER_BETA) were chosen empirically and may not be optimal
for all workload patterns (e.g., skewed conversation lengths, bursty arrivals).
More sophisticated approaches — such as adaptive β based on system utilization,
or incorporating predicted generation length — are left for future iterations.

Expose countReusableBlocks via nanobind and implement cache-aware ADP routing for the C++ (v1) KV cache manager path. Requests are routed to the DP rank with the most prefix cache hits, reducing redundant prefill computation. Changes: - C++ nanobind: expose countReusableBlocks on BaseKVCacheManager - Python: add KVCacheAwareADPRouter to adp_router.py - Python: add probe_prefix_match_length to v1 KVCacheManager - Python: wire up router creation in _util.py and py_executor.py - Tests: 20 unit tests covering router logic and v1 probe guards Signed-off-by: Lance Liao <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

The original scoring formula used raw active_tokens as the load term, which at tens of thousands of tokens overwhelmed the cache benefit term (hundreds-thousands), causing a negative feedback loop where the router would avoid cached ranks due to their higher load. Fix: normalize the load term by total active tokens across eligible ranks so both terms remain on the same [0, req_tokens] scale. Also add prefix-affinity sorting to group same-conversation requests together before routing. Benchmark (DeepSeek-R1 FP4, B200x8, 256 concurrency, multi-turn): - B (cache router) vs C (default router): +5.4% throughput, -6.8% TTFT - Cache hit rate: B=74.0% vs C=72.6% Signed-off-by: Larry Liao <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

…onDpConfig Move ADP router selection from implicit capability detection in _util.py to an explicit `enable_kv_cache_aware_routing` field in AttentionDpConfig. Add ADPRouter.create() factory method that owns the selection logic. Remove the TRTLLM_FORCE_DEFAULT_ADP_ROUTER env var hack. Signed-off-by: Lanyu Liao <[email protected]> Made-with: Cursor

Signed-off-by: Lanyu Liao <[email protected]>

…ensorRT-LLM into kv_cache_aware_router

coderabbitai · 2026-03-18T09:26:37Z

📝 Walkthrough

Walkthrough

This pull request introduces KV cache-aware routing for distributed attention processing. It exposes a countReusableBlocks binding in the C++ layer, adds a factory method to select routing strategies, introduces the KVCacheAwareADPRouter class for cache-aware request distribution, and refactors PyExecutor to use internal router creation with prefix-match gathering. Supporting configuration and comprehensive tests are also added.

Changes

Cohort / File(s)	Summary
C++ Binding Exposure `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`	Increased trampoline arity from 30 to 36, added `countReusableBlocks` public override method, and exposed Python binding `count_reusable_blocks` with optional `only_allocated` parameter and GIL release guard.
PyExecutor Refactoring `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tensorrt_llm/_torch/pyexecutor/_util.py`	Removed `adp_router` parameter from `PyExecutor.__init__`, replaced with internal `ADPRouter.create()` factory call; added prefix-match gathering step in `_fetch_new_requests` when needed; updated routing call to include `max_num_active_requests` argument; removed `DefaultADPRouter` import.
KV Cache Manager Extension `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Added `probe_prefix_match_length()` method to compute cached prefix token count via reusable block enumeration, with short-circuiting for disabled block reuse or variable window configurations.
Router Factory & Strategy Selection `tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py`, `tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py`	Added `ADPRouter.create()` classmethod to instantiate either `KVCacheAwareADPRouter` or `DefaultADPRouter` based on config; introduced `KVCacheAwareADPRouter` class with prefix-match gathering, cache-aware rank state creation, and scoring-based request routing; refactored `DefaultADPRouter` to use internal `_balance_requests_across_ranks()` helper; added `needs_prefix_matches` flag to `ADPRouter`; exported new router class.
Configuration `tensorrt_llm/llmapi/llm_args.py`	Added `enable_kv_cache_aware_routing` boolean field to `AttentionDpConfig` with default `False` to control KV cache-aware routing activation.
Test Suite `tests/unittest/_torch/executor/test_kvcache_aware_router.py`	New comprehensive test file covering `KVCacheAwareADPRouter` rank state creation, prefix-match gathering (single/multi-rank and LORA variants), route decision logic (cache preference, load balancing, explicit hints), edge cases, and KV cache probe behavior on block reuse and variable window scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant PyExecutor
    participant ADPRouter
    participant KVCacheManager
    participant Distributed
    
    PyExecutor->>ADPRouter: create(dist, kv_cache_manager, config)
    ADPRouter->>ADPRouter: Instantiate KVCacheAwareADPRouter
    ADPRouter-->>PyExecutor: Return configured router
    
    PyExecutor->>PyExecutor: _fetch_new_requests()
    
    alt needs_prefix_matches
        PyExecutor->>ADPRouter: gather_prefix_matches(new_requests)
        ADPRouter->>KVCacheManager: probe_prefix_match_length() per request
        KVCacheManager-->>ADPRouter: cached token counts
        ADPRouter->>Distributed: allgather() prefix matches across ranks
        ADPRouter->>ADPRouter: Store _all_ranks_prefix_matches
    end
    
    PyExecutor->>ADPRouter: route_requests(all_rank_states, new_requests, max_active)
    ADPRouter->>ADPRouter: Score requests by cache affinity + load
    ADPRouter->>ADPRouter: Assign requests to best-ranked destinations
    ADPRouter-->>PyExecutor: Return routed requests dict

sequenceDiagram
    participant Request
    participant KVCacheAwareADPRouter
    participant RankState
    participant ScoringLogic
    
    Request->>KVCacheAwareADPRouter: route_requests(rank_states, requests)
    
    KVCacheAwareADPRouter->>KVCacheAwareADPRouter: Compute expected active count
    
    loop For each Request
        KVCacheAwareADPRouter->>ScoringLogic: Has explicit dp_rank hint?
        alt Explicit dp_rank + capacity available
            ScoringLogic-->>KVCacheAwareADPRouter: Assign to hinted rank
        else No hint
            KVCacheAwareADPRouter->>ScoringLogic: Score each rank (prefix_match, load_balance)
            Note over ScoringLogic: Score = effective_tokens + beta * normalized_load
            ScoringLogic-->>KVCacheAwareADPRouter: Best rank
            KVCacheAwareADPRouter->>RankState: Update active counts & tokens
        end
    end
    
    KVCacheAwareADPRouter-->>Request: Return assignments per rank

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.65% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: introducing a KV cache-aware ADP router for prefix-affinity request routing, which directly reflects the changeset's primary purpose.
Description check	✅ Passed	The PR description follows the template structure with clear sections: Summary, Motivation, Changes, Benchmark, Usage, and Limitations. All required elements are present and well-articulated.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.

OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
1-16: ⚠️ Potential issue | 🟡 Minor

Update the NVIDIA header year to 2026.

This file is modified in this PR, but the header still ends at 2025.

As per coding guidelines, "Add NVIDIA copyright header on ALL new files, and update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp` around lines 1 -
16, Update the NVIDIA copyright header at the top of kvCacheManager.cpp by
changing the year range from "2022-2025" to "2022-2026"; modify the SPDX header
block (the comment spanning the top of the file including the
"SPDX-FileCopyrightText" line) to reflect 2026 so the file header matches the
repository guideline for modified files.
tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year to include 2026.

The file is being modified in 2026, but the copyright header only covers 2022-2025. As per coding guidelines, the year should reflect the latest meaningful modification.
Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py` at line 1, Update the
SPDX copyright header at the top of the module (the SPDX line / file header in
tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py) to include 2026 (e.g.,
change "2022-2025" to "2022-2026") so the copyright years reflect the current
modification year.

🧹 Nitpick comments (4)

tensorrt_llm/llmapi/llm_args.py (1)

538-543: Consider clarifying the dependency on enable_block_reuse in the description.

The description could mention that this feature requires kv_cache_config.enable_block_reuse=True to be effective. Based on ADPRouter.create in adp_router.py (lines 85-93), when block reuse is disabled, the router silently falls back to DefaultADPRouter. Users might enable this flag and expect cache-aware routing without realizing it's not active.

Also, there's a trailing space at the end of the description string.
Proposed improvement
     enable_kv_cache_aware_routing: bool = Field(
         default=False,
         description=
         "Use KV cache-aware routing for attention DP request distribution. "
         "When enabled, routes requests to ranks that already have matching "
-        "prefix KV cache, reducing redundant prefill computation. ")
+        "prefix KV cache, reducing redundant prefill computation. "
+        "Requires kv_cache_config.enable_block_reuse=True to take effect.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/llm_args.py` around lines 538 - 543, Update the Field
description for enable_kv_cache_aware_routing to explicitly state it only takes
effect when kv_cache_config.enable_block_reuse is True (otherwise
ADPRouter.create falls back to DefaultADPRouter), and remove the trailing space
at the end of the description string; reference enable_kv_cache_aware_routing,
kv_cache_config.enable_block_reuse, ADPRouter.create and DefaultADPRouter so
users understand the dependency and silent fallback.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

577-580: Keep the binding imports namespaced here.

Line 577 shadows the module-level SamplingConfig imported at Line 50. Using module aliases also makes it clearer that these are binding types rather than the Python request wrappers.

♻️ Suggested rewrite

-        from tensorrt_llm.bindings import SamplingConfig
-        from tensorrt_llm.bindings.internal.batch_manager import BlockKey
-        from tensorrt_llm.bindings.internal.batch_manager import \
-            LlmRequest as CppLlmRequest
-        block_key = BlockKey(tokens=input_tokens, lora_task_id=lora_task_id)
+        import tensorrt_llm.bindings as bindings
+        import tensorrt_llm.bindings.internal.batch_manager as batch_manager
+
+        block_key = batch_manager.BlockKey(
+            tokens=input_tokens, lora_task_id=lora_task_id
+        )
         unique_tokens = block_key.unique_tokens
-        dummy_req = CppLlmRequest(request_id=0,
-                                  max_new_tokens=0,
-                                  input_tokens=input_tokens,
-                                  sampling_config=SamplingConfig(),
-                                  is_streaming=False,
-                                  lora_task_id=lora_task_id)
+        dummy_req = batch_manager.LlmRequest(
+            request_id=0,
+            max_new_tokens=0,
+            input_tokens=input_tokens,
+            sampling_config=bindings.SamplingConfig(),
+            is_streaming=False,
+            lora_task_id=lora_task_id,
+        )

As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions" and "Avoid shadowing variables declared in an outer scope in Python".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 577 - 580,
The local imports in resource_manager.py shadow the module-level SamplingConfig
and break the "import module, not names" guideline; replace the three "from
tensorrt_llm.bindings ..." style imports with a namespaced import (e.g., import
tensorrt_llm.bindings as bindings or import
tensorrt_llm.bindings.internal.batch_manager as batch_manager) and update usages
to bindings.SamplingConfig, batch_manager.BlockKey and batch_manager.LlmRequest
(aliasing LlmRequest to CppLlmRequest if you need that name) so the binding
types remain clearly namespaced and do not shadow the existing SamplingConfig.

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

376-380: Log which router the factory selected.

Now that router selection moved inside PyExecutor, a one-time info log of type(self.adp_router).__name__ would make it obvious whether KV-cache-aware routing was actually enabled or whether the factory fell back to the default router because a prerequisite was missing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 376 - 380, After
ADPRouter.create is called in PyExecutor (the block that assigns self.adp_router
via ADPRouter.create with dist, kv_cache_manager, and attention_dp_config), add
a one-time info log that records the concrete router class selected by the
factory: log type(self.adp_router).__name__ (or the equivalent) using the
existing logger so it's clear whether the KV-cache-aware router was used or a
fallback; ensure the log runs immediately after the self.adp_router assignment
in the PyExecutor initialization flow.
70-70: Keep the router import namespaced.

Please import the module and reference adp_router.ADPRouter instead of importing ADPRouter directly. This file already pulls in a large symbol surface, and the repo rule is to keep Python imports explicit.

As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions (e.g., use from package.subpackage import foo then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` at line 70, The import of
ADPRouter should be namespaced: replace the direct class import (ADPRouter) with
a module import (adp_router) and update all usages in this file to reference
adp_router.ADPRouter; specifically change the import line that currently brings
in ADPRouter and then update any instantiation or type references of ADPRouter
to use adp_router.ADPRouter so the module namespace is preserved and the symbol
surface remains explicit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2526-2527: The PyExecutor currently calls
self.adp_router.gather_prefix_matches(new_requests) whenever
needs_prefix_matches is true, which triggers a tp_allgather even when
new_requests is empty; change the call site so the probe only runs when there
are actual requests (guard with if new_requests) and remove subclass-specific
setup from PyExecutor by adding an ADPRouter hook (e.g., a method on ADPRouter
like maybe_gather_prefix_matches(new_requests) or move the needs_prefix_matches
check into ADPRouter) so PyExecutor simply calls a single router method and the
router itself decides whether to call gather_prefix_matches/local tp_allgather
based on new_requests and needs_prefix_matches.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py`:
- Around line 88-96: The factory currently returns KVCacheAwareADPRouter when
kv_cache_manager.enable_block_reuse is true but some managers (e.g.,
KVCacheManagerV2) lack the probe_prefix_match_length() API; update the selection
logic in the block that constructs KVCacheAwareADPRouter to also verify that
kv_cache_manager implements the required probing API (e.g.,
hasattr(kv_cache_manager, "probe_prefix_match_length") or an isinstance check
against the KVCacheManager type) before returning KVCacheAwareADPRouter, and
otherwise fall back to DefaultADPRouter so gather_prefix_matches() won't raise
an AttributeError.
- Around line 429-437: When you assign a request to a target data-parallel rank
in the block that checks scheduling_params.attention_dp_rank, also record
explicit per-request placement in the token-load tracker so future score
calculations see this placement; update the same branch that increments
all_ranks_num_active_requests and appends to all_ranks_new_requests to also
register the placement for req_item with the token-load tracker (the component
that tracks token load), using the scheduling_params.attention_dp_rank as the
placement rank so later scoring no longer treats that rank as lightly loaded.

---

Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Around line 1-16: Update the NVIDIA copyright header at the top of
kvCacheManager.cpp by changing the year range from "2022-2025" to "2022-2026";
modify the SPDX header block (the comment spanning the top of the file including
the "SPDX-FileCopyrightText" line) to reflect 2026 so the file header matches
the repository guideline for modified files.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py`:
- Line 1: Update the SPDX copyright header at the top of the module (the SPDX
line / file header in tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py) to
include 2026 (e.g., change "2022-2025" to "2022-2026") so the copyright years
reflect the current modification year.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 376-380: After ADPRouter.create is called in PyExecutor (the block
that assigns self.adp_router via ADPRouter.create with dist, kv_cache_manager,
and attention_dp_config), add a one-time info log that records the concrete
router class selected by the factory: log type(self.adp_router).__name__ (or the
equivalent) using the existing logger so it's clear whether the KV-cache-aware
router was used or a fallback; ensure the log runs immediately after the
self.adp_router assignment in the PyExecutor initialization flow.
- Line 70: The import of ADPRouter should be namespaced: replace the direct
class import (ADPRouter) with a module import (adp_router) and update all usages
in this file to reference adp_router.ADPRouter; specifically change the import
line that currently brings in ADPRouter and then update any instantiation or
type references of ADPRouter to use adp_router.ADPRouter so the module namespace
is preserved and the symbol surface remains explicit.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 577-580: The local imports in resource_manager.py shadow the
module-level SamplingConfig and break the "import module, not names" guideline;
replace the three "from tensorrt_llm.bindings ..." style imports with a
namespaced import (e.g., import tensorrt_llm.bindings as bindings or import
tensorrt_llm.bindings.internal.batch_manager as batch_manager) and update usages
to bindings.SamplingConfig, batch_manager.BlockKey and batch_manager.LlmRequest
(aliasing LlmRequest to CppLlmRequest if you need that name) so the binding
types remain clearly namespaced and do not shadow the existing SamplingConfig.

In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 538-543: Update the Field description for
enable_kv_cache_aware_routing to explicitly state it only takes effect when
kv_cache_config.enable_block_reuse is True (otherwise ADPRouter.create falls
back to DefaultADPRouter), and remove the trailing space at the end of the
description string; reference enable_kv_cache_aware_routing,
kv_cache_config.enable_block_reuse, ADPRouter.create and DefaultADPRouter so
users understand the dependency and silent fallback.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 50b1a799-f697-417e-8ede-dad0dc66c02d

📥 Commits

Reviewing files that changed from the base of the PR and between e71a200 and 8d0852a.

📒 Files selected for processing (8)

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
tensorrt_llm/_torch/pyexecutor/_util.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py
tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py
tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
tensorrt_llm/llmapi/llm_args.py
tests/unittest/_torch/executor/test_kvcache_aware_router.py

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-18T10:57:46Z

/bot run --disable-fail-fast

lancelly · 2026-03-18T11:08:45Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-18T11:15:37Z

PR_Github #39447 [ run ] triggered by Bot. Commit: d42ab67 Link to invocation

Signed-off-by: Lanyu Liao <[email protected]>

QiJune

LGTM

Signed-off-by: Lanyu Liao <[email protected]>

lancelly · 2026-03-18T13:47:26Z

/bot run --disable-fail-fast

lancelly · 2026-03-25T14:36:54Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-25T14:43:50Z

PR_Github #40343 [ run ] triggered by Bot. Commit: ec5bdf0 Link to invocation

tensorrt-cicd · 2026-03-25T19:35:55Z

PR_Github #40343 [ run ] completed with state FAILURE. Commit: ec5bdf0
/LLM/main/L0_MergeRequest_PR pipeline #31448 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-26T02:04:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-26T02:10:21Z

PR_Github #40403 [ run ] triggered by Bot. Commit: ec5bdf0 Link to invocation

tensorrt-cicd · 2026-03-26T05:31:59Z

PR_Github #40403 [ run ] completed with state FAILURE. Commit: ec5bdf0
/LLM/main/L0_MergeRequest_PR pipeline #31498 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-26T05:38:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-26T05:48:40Z

PR_Github #40424 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

tensorrt-cicd · 2026-03-26T14:13:23Z

PR_Github #40424 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31517 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-26T14:17:39Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-26T14:23:21Z

PR_Github #40457 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

tensorrt-cicd · 2026-03-26T18:21:46Z

PR_Github #40457 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31546 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-27T02:19:24Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T02:25:22Z

PR_Github #40474 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

tensorrt-cicd · 2026-03-27T05:20:22Z

PR_Github #40474 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31564 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-27T05:30:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T05:36:51Z

PR_Github #40486 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

tensorrt-cicd · 2026-03-27T08:03:33Z

PR_Github #40486 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31574 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-27T08:08:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T08:14:25Z

PR_Github #40493 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

tensorrt-cicd · 2026-03-27T09:58:24Z

PR_Github #40493 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31581 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-27T10:26:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-27T10:31:59Z

PR_Github #40495 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

tensorrt-cicd · 2026-03-27T12:16:48Z

PR_Github #40495 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31584 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lancelly · 2026-03-28T02:51:28Z

/bot skip --comment "Flaky multi-GPU nemotron test"

tensorrt-cicd · 2026-03-28T02:58:35Z

PR_Github #40525 [ skip ] triggered by Bot. Commit: 1369963 Link to invocation

tensorrt-cicd · 2026-03-28T03:07:56Z

PR_Github #40525 [ skip ] completed with state SUCCESS. Commit: 1369963
Skipping testing for commit 1369963

Link to invocation

lancelly and others added 6 commits March 13, 2026 09:55

Merge branch 'main' into kvcache_aware_router_v1

46d3505

Merge branch 'NVIDIA:main' into kv_cache_aware_router

c684a71

Merge branch 'NVIDIA:main' into kv_cache_aware_router

e50d957

github-actions Bot assigned lancelly Mar 18, 2026

lancelly added 2 commits March 18, 2026 02:02

remove stale logs and fix pre commit

ffbf3a4

Signed-off-by: Lanyu Liao <[email protected]>

Merge branch 'kv_cache_aware_router' of https://github.com/lancelly/T…

8d0852a

…ensorRT-LLM into kv_cache_aware_router

lancelly marked this pull request as ready for review March 18, 2026 09:06

lancelly requested review from a team as code owners March 18, 2026 09:06

lancelly requested a review from syuoni March 18, 2026 09:06

lancelly changed the title ~~[None][feat] Introduce a kv cache awareness router for better kv cache hit rate~~ [None][feat] KV cache-aware ADP router for prefix-affinity request routing Mar 18, 2026

lancelly requested a review from SimengLiu-nv March 18, 2026 09:08

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py

Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py

Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py

add ITs and make load_weight configurable

87d9e90

Signed-off-by: Lanyu Liao <[email protected]>

lancelly requested a review from a team as a code owner March 18, 2026 09:32

fix pre-commit

d42ab67

Signed-off-by: Lanyu Liao <[email protected]>

QiJune reviewed Mar 18, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py

Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py

QiJune reviewed Mar 18, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py

Comment thread tensorrt_llm/llmapi/llm_args.py Outdated

fix comments

76c73a3

Signed-off-by: Lanyu Liao <[email protected]>

QiJune approved these changes Mar 18, 2026

View reviewed changes

fix pre-commit

db31a6d

Signed-off-by: Lanyu Liao <[email protected]>

Merge branch 'main' into kv_cache_aware_router

1369963

LarryXFly approved these changes Mar 27, 2026

View reviewed changes

lancelly merged commit 3318aca into NVIDIA:main Mar 28, 2026
5 checks passed

Conversation

lancelly commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Benchmark

Usage

Limitations / Future Work

Uh oh!

coderabbitai Bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lancelly commented Mar 18, 2026

Uh oh!

lancelly commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

lancelly commented Mar 18, 2026

Uh oh!

lancelly commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

lancelly commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

lancelly commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

lancelly commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

lancelly commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

lancelly commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

lancelly commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

Uh oh!

tensorrt-cicd commented Mar 27, 2026

lancelly commented Mar 18, 2026 •

edited

Loading

coderabbitai Bot commented Mar 18, 2026 •

edited

Loading