Skip to content

[None][feat] KV cache-aware ADP router for prefix-affinity request routing#12315

Merged
lancelly merged 15 commits into
NVIDIA:mainfrom
lancelly:kv_cache_aware_router
Mar 28, 2026
Merged

[None][feat] KV cache-aware ADP router for prefix-affinity request routing#12315
lancelly merged 15 commits into
NVIDIA:mainfrom
lancelly:kv_cache_aware_router

Conversation

@lancelly
Copy link
Copy Markdown
Collaborator

@lancelly lancelly commented Mar 18, 2026

Summary

Add a KV cache-aware request router (KVCacheAwareADPRouter) for attention data
parallelism (ADP). When enabled, new requests are routed to the DP rank that
already holds the longest matching prefix in its radix tree, reducing redundant
prefill computation in multi-turn conversation workloads.

Motivation

With the default load-balanced ADP router, requests from the same conversation
may land on different DP ranks across turns, causing each rank to recompute the
full prefix. By probing each rank's KV cache radix tree before routing, we can
steer requests to ranks that already cache their prefix, significantly improving
KV cache hit rate and throughput.

Changes

  • adp_router.py: Add KVCacheAwareADPRouter with:
    • Per-request prefix probing via probe_prefix_match_length + allgather
    • Scoring formula: score = effective_tokens + β * normalized_load (lower = better)
    • Prefix-affinity sorting to group same-conversation requests before routing
    • ADPRouter.create() factory that selects the router based on config
  • llm_args.py: Add enable_kv_cache_aware_routing field to AttentionDpConfig
  • py_executor.py: Use ADPRouter.create() factory; call gather_prefix_matches
    before route_requests when the router needs prefix data
  • scheduler/__init__.py: Export KVCacheAwareADPRouter
  • Tests: 20 unit tests covering scoring, load balancing, prefix affinity,
    edge cases, and factory method selection

Benchmark

DeepSeek-V3.2 FP4, 8×B200, EP8+DP8, 256 concurrency, multi-turn scenario:

Metric Default Router KV Cache-Aware Router Delta
Weighted cache hit rate 82.1% 93.1% +11pp
TTFT (mean) 1.13s 0.70s -38%
ITL (mean) 51.3ms 32.6ms -37%

Usage

attention_dp_config:
enable_kv_cache_aware_routing: true

Limitations / Future Work

  • The current scoring formula (effective_tokens + β * normalized_load) is a
    simple baseline. The load normalization and the weight β (default 1.0, tunable
    via TRTLLM_CACHE_ROUTER_BETA) were chosen empirically and may not be optimal
    for all workload patterns (e.g., skewed conversation lengths, bursty arrivals).
    More sophisticated approaches — such as adaptive β based on system utilization,
    or incorporating predicted generation length — are left for future iterations.

lancelly and others added 6 commits March 13, 2026 09:55
Expose countReusableBlocks via nanobind and implement cache-aware
ADP routing for the C++ (v1) KV cache manager path. Requests are
routed to the DP rank with the most prefix cache hits, reducing
redundant prefill computation.

Changes:
- C++ nanobind: expose countReusableBlocks on BaseKVCacheManager
- Python: add KVCacheAwareADPRouter to adp_router.py
- Python: add probe_prefix_match_length to v1 KVCacheManager
- Python: wire up router creation in _util.py and py_executor.py
- Tests: 20 unit tests covering router logic and v1 probe guards

Signed-off-by: Lance Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
The original scoring formula used raw active_tokens as the load term,
which at tens of thousands of tokens overwhelmed the cache benefit term
(hundreds-thousands), causing a negative feedback loop where the router
would avoid cached ranks due to their higher load.

Fix: normalize the load term by total active tokens across eligible
ranks so both terms remain on the same [0, req_tokens] scale. Also add
prefix-affinity sorting to group same-conversation requests together
before routing.

Benchmark (DeepSeek-R1 FP4, B200x8, 256 concurrency, multi-turn):
- B (cache router) vs C (default router): +5.4% throughput, -6.8% TTFT
- Cache hit rate: B=74.0% vs C=72.6%

Signed-off-by: Larry Liao <[email protected]>
Signed-off-by: Lanyu Liao <[email protected]>
…onDpConfig

Move ADP router selection from implicit capability detection in _util.py
to an explicit `enable_kv_cache_aware_routing` field in AttentionDpConfig.
Add ADPRouter.create() factory method that owns the selection logic.
Remove the TRTLLM_FORCE_DEFAULT_ADP_ROUTER env var hack.

Signed-off-by: Lanyu Liao <[email protected]>
Made-with: Cursor
@lancelly lancelly marked this pull request as ready for review March 18, 2026 09:06
@lancelly lancelly requested review from a team as code owners March 18, 2026 09:06
@lancelly lancelly requested a review from syuoni March 18, 2026 09:06
@lancelly lancelly changed the title [None][feat] Introduce a kv cache awareness router for better kv cache hit rate [None][feat] KV cache-aware ADP router for prefix-affinity request routing Mar 18, 2026
@lancelly lancelly requested a review from SimengLiu-nv March 18, 2026 09:08
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 18, 2026

📝 Walkthrough

Walkthrough

This pull request introduces KV cache-aware routing for distributed attention processing. It exposes a countReusableBlocks binding in the C++ layer, adds a factory method to select routing strategies, introduces the KVCacheAwareADPRouter class for cache-aware request distribution, and refactors PyExecutor to use internal router creation with prefix-match gathering. Supporting configuration and comprehensive tests are also added.

Changes

Cohort / File(s) Summary
C++ Binding Exposure
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
Increased trampoline arity from 30 to 36, added countReusableBlocks public override method, and exposed Python binding count_reusable_blocks with optional only_allocated parameter and GIL release guard.
PyExecutor Refactoring
tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/pyexecutor/_util.py
Removed adp_router parameter from PyExecutor.__init__, replaced with internal ADPRouter.create() factory call; added prefix-match gathering step in _fetch_new_requests when needed; updated routing call to include max_num_active_requests argument; removed DefaultADPRouter import.
KV Cache Manager Extension
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Added probe_prefix_match_length() method to compute cached prefix token count via reusable block enumeration, with short-circuiting for disabled block reuse or variable window configurations.
Router Factory & Strategy Selection
tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py, tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py
Added ADPRouter.create() classmethod to instantiate either KVCacheAwareADPRouter or DefaultADPRouter based on config; introduced KVCacheAwareADPRouter class with prefix-match gathering, cache-aware rank state creation, and scoring-based request routing; refactored DefaultADPRouter to use internal _balance_requests_across_ranks() helper; added needs_prefix_matches flag to ADPRouter; exported new router class.
Configuration
tensorrt_llm/llmapi/llm_args.py
Added enable_kv_cache_aware_routing boolean field to AttentionDpConfig with default False to control KV cache-aware routing activation.
Test Suite
tests/unittest/_torch/executor/test_kvcache_aware_router.py
New comprehensive test file covering KVCacheAwareADPRouter rank state creation, prefix-match gathering (single/multi-rank and LORA variants), route decision logic (cache preference, load balancing, explicit hints), edge cases, and KV cache probe behavior on block reuse and variable window scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant PyExecutor
    participant ADPRouter
    participant KVCacheManager
    participant Distributed
    
    PyExecutor->>ADPRouter: create(dist, kv_cache_manager, config)
    ADPRouter->>ADPRouter: Instantiate KVCacheAwareADPRouter
    ADPRouter-->>PyExecutor: Return configured router
    
    PyExecutor->>PyExecutor: _fetch_new_requests()
    
    alt needs_prefix_matches
        PyExecutor->>ADPRouter: gather_prefix_matches(new_requests)
        ADPRouter->>KVCacheManager: probe_prefix_match_length() per request
        KVCacheManager-->>ADPRouter: cached token counts
        ADPRouter->>Distributed: allgather() prefix matches across ranks
        ADPRouter->>ADPRouter: Store _all_ranks_prefix_matches
    end
    
    PyExecutor->>ADPRouter: route_requests(all_rank_states, new_requests, max_active)
    ADPRouter->>ADPRouter: Score requests by cache affinity + load
    ADPRouter->>ADPRouter: Assign requests to best-ranked destinations
    ADPRouter-->>PyExecutor: Return routed requests dict
Loading
sequenceDiagram
    participant Request
    participant KVCacheAwareADPRouter
    participant RankState
    participant ScoringLogic
    
    Request->>KVCacheAwareADPRouter: route_requests(rank_states, requests)
    
    KVCacheAwareADPRouter->>KVCacheAwareADPRouter: Compute expected active count
    
    loop For each Request
        KVCacheAwareADPRouter->>ScoringLogic: Has explicit dp_rank hint?
        alt Explicit dp_rank + capacity available
            ScoringLogic-->>KVCacheAwareADPRouter: Assign to hinted rank
        else No hint
            KVCacheAwareADPRouter->>ScoringLogic: Score each rank (prefix_match, load_balance)
            Note over ScoringLogic: Score = effective_tokens + beta * normalized_load
            ScoringLogic-->>KVCacheAwareADPRouter: Best rank
            KVCacheAwareADPRouter->>RankState: Update active counts & tokens
        end
    end
    
    KVCacheAwareADPRouter-->>Request: Return assignments per rank
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: introducing a KV cache-aware ADP router for prefix-affinity request routing, which directly reflects the changeset's primary purpose.
Description check ✅ Passed The PR description follows the template structure with clear sections: Summary, Motivation, Changes, Benchmark, Usage, and Limitations. All required elements are present and well-articulated.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can scan for known vulnerabilities in your dependencies using OSV Scanner.

OSV Scanner will automatically detect and report security vulnerabilities in your project's dependencies. No additional configuration is required.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

1-16: ⚠️ Potential issue | 🟡 Minor

Update the NVIDIA header year to 2026.

This file is modified in this PR, but the header still ends at 2025.

As per coding guidelines, "Add NVIDIA copyright header on ALL new files, and update year on modified files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp` around lines 1 -
16, Update the NVIDIA copyright header at the top of kvCacheManager.cpp by
changing the year range from "2022-2025" to "2022-2026"; modify the SPDX header
block (the comment spanning the top of the file including the
"SPDX-FileCopyrightText" line) to reflect 2026 so the file header matches the
repository guideline for modified files.
tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year to include 2026.

The file is being modified in 2026, but the copyright header only covers 2022-2025. As per coding guidelines, the year should reflect the latest meaningful modification.

Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py` at line 1, Update the
SPDX copyright header at the top of the module (the SPDX line / file header in
tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py) to include 2026 (e.g.,
change "2022-2025" to "2022-2026") so the copyright years reflect the current
modification year.
🧹 Nitpick comments (4)
tensorrt_llm/llmapi/llm_args.py (1)

538-543: Consider clarifying the dependency on enable_block_reuse in the description.

The description could mention that this feature requires kv_cache_config.enable_block_reuse=True to be effective. Based on ADPRouter.create in adp_router.py (lines 85-93), when block reuse is disabled, the router silently falls back to DefaultADPRouter. Users might enable this flag and expect cache-aware routing without realizing it's not active.

Also, there's a trailing space at the end of the description string.

Proposed improvement
     enable_kv_cache_aware_routing: bool = Field(
         default=False,
         description=
         "Use KV cache-aware routing for attention DP request distribution. "
         "When enabled, routes requests to ranks that already have matching "
-        "prefix KV cache, reducing redundant prefill computation. ")
+        "prefix KV cache, reducing redundant prefill computation. "
+        "Requires kv_cache_config.enable_block_reuse=True to take effect.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/llm_args.py` around lines 538 - 543, Update the Field
description for enable_kv_cache_aware_routing to explicitly state it only takes
effect when kv_cache_config.enable_block_reuse is True (otherwise
ADPRouter.create falls back to DefaultADPRouter), and remove the trailing space
at the end of the description string; reference enable_kv_cache_aware_routing,
kv_cache_config.enable_block_reuse, ADPRouter.create and DefaultADPRouter so
users understand the dependency and silent fallback.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

577-580: Keep the binding imports namespaced here.

Line 577 shadows the module-level SamplingConfig imported at Line 50. Using module aliases also makes it clearer that these are binding types rather than the Python request wrappers.

♻️ Suggested rewrite
-        from tensorrt_llm.bindings import SamplingConfig
-        from tensorrt_llm.bindings.internal.batch_manager import BlockKey
-        from tensorrt_llm.bindings.internal.batch_manager import \
-            LlmRequest as CppLlmRequest
-        block_key = BlockKey(tokens=input_tokens, lora_task_id=lora_task_id)
+        import tensorrt_llm.bindings as bindings
+        import tensorrt_llm.bindings.internal.batch_manager as batch_manager
+
+        block_key = batch_manager.BlockKey(
+            tokens=input_tokens, lora_task_id=lora_task_id
+        )
         unique_tokens = block_key.unique_tokens
-        dummy_req = CppLlmRequest(request_id=0,
-                                  max_new_tokens=0,
-                                  input_tokens=input_tokens,
-                                  sampling_config=SamplingConfig(),
-                                  is_streaming=False,
-                                  lora_task_id=lora_task_id)
+        dummy_req = batch_manager.LlmRequest(
+            request_id=0,
+            max_new_tokens=0,
+            input_tokens=input_tokens,
+            sampling_config=bindings.SamplingConfig(),
+            is_streaming=False,
+            lora_task_id=lora_task_id,
+        )

As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions" and "Avoid shadowing variables declared in an outer scope in Python".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 577 - 580,
The local imports in resource_manager.py shadow the module-level SamplingConfig
and break the "import module, not names" guideline; replace the three "from
tensorrt_llm.bindings ..." style imports with a namespaced import (e.g., import
tensorrt_llm.bindings as bindings or import
tensorrt_llm.bindings.internal.batch_manager as batch_manager) and update usages
to bindings.SamplingConfig, batch_manager.BlockKey and batch_manager.LlmRequest
(aliasing LlmRequest to CppLlmRequest if you need that name) so the binding
types remain clearly namespaced and do not shadow the existing SamplingConfig.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

376-380: Log which router the factory selected.

Now that router selection moved inside PyExecutor, a one-time info log of type(self.adp_router).__name__ would make it obvious whether KV-cache-aware routing was actually enabled or whether the factory fell back to the default router because a prerequisite was missing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 376 - 380, After
ADPRouter.create is called in PyExecutor (the block that assigns self.adp_router
via ADPRouter.create with dist, kv_cache_manager, and attention_dp_config), add
a one-time info log that records the concrete router class selected by the
factory: log type(self.adp_router).__name__ (or the equivalent) using the
existing logger so it's clear whether the KV-cache-aware router was used or a
fallback; ensure the log runs immediately after the self.adp_router assignment
in the PyExecutor initialization flow.

70-70: Keep the router import namespaced.

Please import the module and reference adp_router.ADPRouter instead of importing ADPRouter directly. This file already pulls in a large symbol surface, and the repo rule is to keep Python imports explicit.

As per coding guidelines, "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions (e.g., use from package.subpackage import foo then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` at line 70, The import of
ADPRouter should be namespaced: replace the direct class import (ADPRouter) with
a module import (adp_router) and update all usages in this file to reference
adp_router.ADPRouter; specifically change the import line that currently brings
in ADPRouter and then update any instantiation or type references of ADPRouter
to use adp_router.ADPRouter so the module namespace is preserved and the symbol
surface remains explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2526-2527: The PyExecutor currently calls
self.adp_router.gather_prefix_matches(new_requests) whenever
needs_prefix_matches is true, which triggers a tp_allgather even when
new_requests is empty; change the call site so the probe only runs when there
are actual requests (guard with if new_requests) and remove subclass-specific
setup from PyExecutor by adding an ADPRouter hook (e.g., a method on ADPRouter
like maybe_gather_prefix_matches(new_requests) or move the needs_prefix_matches
check into ADPRouter) so PyExecutor simply calls a single router method and the
router itself decides whether to call gather_prefix_matches/local tp_allgather
based on new_requests and needs_prefix_matches.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py`:
- Around line 88-96: The factory currently returns KVCacheAwareADPRouter when
kv_cache_manager.enable_block_reuse is true but some managers (e.g.,
KVCacheManagerV2) lack the probe_prefix_match_length() API; update the selection
logic in the block that constructs KVCacheAwareADPRouter to also verify that
kv_cache_manager implements the required probing API (e.g.,
hasattr(kv_cache_manager, "probe_prefix_match_length") or an isinstance check
against the KVCacheManager type) before returning KVCacheAwareADPRouter, and
otherwise fall back to DefaultADPRouter so gather_prefix_matches() won't raise
an AttributeError.
- Around line 429-437: When you assign a request to a target data-parallel rank
in the block that checks scheduling_params.attention_dp_rank, also record
explicit per-request placement in the token-load tracker so future score
calculations see this placement; update the same branch that increments
all_ranks_num_active_requests and appends to all_ranks_new_requests to also
register the placement for req_item with the token-load tracker (the component
that tracks token load), using the scheduling_params.attention_dp_rank as the
placement rank so later scoring no longer treats that rank as lightly loaded.

---

Outside diff comments:
In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp`:
- Around line 1-16: Update the NVIDIA copyright header at the top of
kvCacheManager.cpp by changing the year range from "2022-2025" to "2022-2026";
modify the SPDX header block (the comment spanning the top of the file including
the "SPDX-FileCopyrightText" line) to reflect 2026 so the file header matches
the repository guideline for modified files.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py`:
- Line 1: Update the SPDX copyright header at the top of the module (the SPDX
line / file header in tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py) to
include 2026 (e.g., change "2022-2025" to "2022-2026") so the copyright years
reflect the current modification year.

---

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 376-380: After ADPRouter.create is called in PyExecutor (the block
that assigns self.adp_router via ADPRouter.create with dist, kv_cache_manager,
and attention_dp_config), add a one-time info log that records the concrete
router class selected by the factory: log type(self.adp_router).__name__ (or the
equivalent) using the existing logger so it's clear whether the KV-cache-aware
router was used or a fallback; ensure the log runs immediately after the
self.adp_router assignment in the PyExecutor initialization flow.
- Line 70: The import of ADPRouter should be namespaced: replace the direct
class import (ADPRouter) with a module import (adp_router) and update all usages
in this file to reference adp_router.ADPRouter; specifically change the import
line that currently brings in ADPRouter and then update any instantiation or
type references of ADPRouter to use adp_router.ADPRouter so the module namespace
is preserved and the symbol surface remains explicit.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 577-580: The local imports in resource_manager.py shadow the
module-level SamplingConfig and break the "import module, not names" guideline;
replace the three "from tensorrt_llm.bindings ..." style imports with a
namespaced import (e.g., import tensorrt_llm.bindings as bindings or import
tensorrt_llm.bindings.internal.batch_manager as batch_manager) and update usages
to bindings.SamplingConfig, batch_manager.BlockKey and batch_manager.LlmRequest
(aliasing LlmRequest to CppLlmRequest if you need that name) so the binding
types remain clearly namespaced and do not shadow the existing SamplingConfig.

In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 538-543: Update the Field description for
enable_kv_cache_aware_routing to explicitly state it only takes effect when
kv_cache_config.enable_block_reuse is True (otherwise ADPRouter.create falls
back to DefaultADPRouter), and remove the trailing space at the end of the
description string; reference enable_kv_cache_aware_routing,
kv_cache_config.enable_block_reuse, ADPRouter.create and DefaultADPRouter so
users understand the dependency and silent fallback.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 50b1a799-f697-417e-8ede-dad0dc66c02d

📥 Commits

Reviewing files that changed from the base of the PR and between e71a200 and 8d0852a.

📒 Files selected for processing (8)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/pyexecutor/scheduler/__init__.py
  • tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/_torch/executor/test_kvcache_aware_router.py

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
@lancelly lancelly requested a review from a team as a code owner March 18, 2026 09:32
Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39447 [ run ] triggered by Bot. Commit: d42ab67 Link to invocation

Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
Comment thread tensorrt_llm/_torch/pyexecutor/scheduler/adp_router.py
Comment thread tensorrt_llm/llmapi/llm_args.py Outdated
Signed-off-by: Lanyu Liao <[email protected]>
Copy link
Copy Markdown
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Lanyu Liao <[email protected]>
@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40343 [ run ] triggered by Bot. Commit: ec5bdf0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40343 [ run ] completed with state FAILURE. Commit: ec5bdf0
/LLM/main/L0_MergeRequest_PR pipeline #31448 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40403 [ run ] triggered by Bot. Commit: ec5bdf0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40403 [ run ] completed with state FAILURE. Commit: ec5bdf0
/LLM/main/L0_MergeRequest_PR pipeline #31498 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40424 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40424 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31517 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40457 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40457 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31546 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40474 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40474 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31564 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40486 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40486 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31574 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40493 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40493 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31581 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40495 [ run ] triggered by Bot. Commit: 1369963 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40495 [ run ] completed with state SUCCESS. Commit: 1369963
/LLM/main/L0_MergeRequest_PR pipeline #31584 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@lancelly
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "Flaky multi-GPU nemotron test"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40525 [ skip ] triggered by Bot. Commit: 1369963 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40525 [ skip ] completed with state SUCCESS. Commit: 1369963
Skipping testing for commit 1369963

Link to invocation

@lancelly lancelly merged commit 3318aca into NVIDIA:main Mar 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants