[None][feat] Enable joint optimization of agent applications and TensorRT-LLM with Scaffolding#11173
Conversation
📝 WalkthroughWalkthroughThe changes implement a comprehensive agent-tree-based scheduling framework for batch request prioritization, introduce KV cache truncation capabilities with extended metrics tracking, enhance the scaffolding framework with multi-turn chat support and tool integration, and add benchmarking infrastructure. New agent hierarchy hierarchies enable prioritized batching, while expanded KV cache management and performance metrics provide observability and optimization. Supporting implementations include Deep Research workflows, MCP integration, and example client scripts. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client Request
participant Scheduler as CapacityScheduler
participant Policy as AgentTreePolicy
participant Tree as AgentTreeNode
participant Queue as Request Queue
Client->>Scheduler: Submit Requests
Scheduler->>Policy: resortRequests(activeRequests)
Policy->>Tree: insertRequest() per request
Tree->>Tree: routeByHierarchy(level, nodeType)
Note over Tree: Insert into appropriate node
Tree->>Tree: getRequests() aggregate from tree
Tree->>Tree: mergeNodesSequence() per node type
Note over Tree: Apply node-specific merge logic<br/>(ratio mixing, reordering, etc.)
Policy->>Policy: Partition gen vs non-gen
Policy->>Policy: truncateByCapacity()
Policy->>Queue: Return reordered requests
Scheduler->>Scheduler: Schedule from requestsToSchedule
sequenceDiagram
participant User as User Input
participant Controller as ChatWithMCPController
participant Generator as Generation Controller
participant MCP as MCPWorker
participant Task as Task Output
User->>Controller: generate(prompt)
Controller->>Controller: Create ChatTask from prompt
loop Iteration <= max_iterations
Controller->>Generator: process([ChatTask])
Generator->>Generator: Generate with tools enabled
Generator->>Task: Yield ChatTask with AssistantMessage
alt Tool calls present
Controller->>Controller: Extract tool_calls
Controller->>MCP: MCPCallTask for each tool
MCP->>MCP: Execute via web service
MCP->>Task: Return result_str
Controller->>Controller: Append UserMessage with result
else No tool calls
Controller->>Task: Break loop
end
end
Controller->>Task: Return final GenerationResult
sequenceDiagram
participant Request as LlmRequest
participant KVMgr as KVCacheManager
participant BlockMgr as BlockManager
participant EvictPolicy as Eviction Policy
Request->>KVMgr: truncateBlocks(targetTokens, numKeep)
KVMgr->>KVMgr: Iterate window sizes
KVMgr->>BlockMgr: truncateBlocks per window
BlockMgr->>BlockMgr: Locate blocks exceeding numKeep
BlockMgr->>BlockMgr: Detach excess blocks
BlockMgr->>EvictPolicy: Release to free pool
Note over EvictPolicy: Mark blocks as available
BlockMgr->>Request: Update token counts
KVMgr->>Request: setKvCacheSystemStats()
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 2
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (9)
cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp (1)
1-16:⚠️ Potential issue | 🟡 MinorUpdate copyright year to reflect latest modification (2026).
This file was modified in 2026; the header still lists 2025. Please update the year to the latest meaningful modification.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/inputs/data.py (1)
1-2:⚠️ Potential issue | 🟠 MajorAdd NVIDIA copyright header.
Source files should include the standard NVIDIA header with the latest modification year (2026).
As per coding guidelines, “All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.”✅ Suggested header
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + # Adapt from # https://github.com/vllm-project/vllm/blob/2e33fe419186c65a18da6668972d61d7bbc31564/vllm/inputs/data.pyexamples/scaffolding/contrib/open_deep_research/TavilyMCP/travily.py (2)
1-13:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification. As per coding guidelines: "All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification."
📝 Proposed fix to add copyright header
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import asyncio import logging
61-74:⚠️ Potential issue | 🟡 MinorFilename appears to have a typo.
The file is named
travily.pybut likely should betavily.pyto match the Tavily service it integrates with. This could cause confusion when searching for or referencing the file.tensorrt_llm/commands/serve.py (1)
604-633:⚠️ Potential issue | 🟠 MajorWire agent CLI options into LLM configuration and validate inputs.
The
agent_percentageandagent_typesparameters are accepted from the CLI but never used in the serve function, making these flags ineffective. The range validation foragent_percentageshould also be enforced at the CLI layer. However, the exact configuration key and structure for agent scheduling needs to be verified against the actual LLM backend implementation, asresort_policy_configdoes not appear in the codebase and the intended integration point is unclear.tensorrt_llm/executor/worker.py (1)
1-5:⚠️ Potential issue | 🟠 MajorAdd the NVIDIA copyright header.
This file is a TensorRT‑LLM source file and should carry the standard NVIDIA header with the year of latest meaningful modification (2026 here). Please copy the canonical header used elsewhere in the repo.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/llmapi/llm.py (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd the NVIDIA copyright header.
This file is a TensorRT‑LLM source file and should carry the standard NVIDIA header with the year of latest meaningful modification (2026 here). Please copy the canonical header used elsewhere in the repo.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/scaffolding/scaffolding_llm.py (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd the NVIDIA copyright header.
This file is a TensorRT‑LLM source file and should carry the standard NVIDIA header with the year of latest meaningful modification (2026 here). Please copy the canonical header used elsewhere in the repo.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
cpp/include/tensorrt_llm/executor/types.h (1)
1-2:⚠️ Potential issue | 🟡 MinorUpdate copyright year to include 2025.
The copyright header shows "2022-2024" but this file has been modified in 2025. As per coding guidelines, the copyright year should reflect the year of latest meaningful modification.
/* - * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved.
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/pyexecutor/_util.py`:
- Around line 881-890: The resort policy handling and creation of the
SimpleScheduler reference variables (resort_policy_config, capacity_scheduler,
mb_scheduler) that are only defined inside the else branch, causing a NameError
when Python scheduler path is taken; fix by moving the entire if
resort_policy_config is not None: ...
capacity_scheduler.impl.set_agent_tree_resort_policy(...) and the subsequent
scheduler = SimpleScheduler(capacity_scheduler, mb_scheduler) lines inside the
same else block where capacity_scheduler and mb_scheduler are defined (so that
llm_args.resort_policy_config, capacity_scheduler, mb_scheduler and
SimpleScheduler are all in scope).
In `@tensorrt_llm/serve/harmony_adapter.py`:
- Around line 1719-1739: The function _create_usage_info currently uses a Python
3.10+ union style hint for optional parameters; change the type annotations for
reasoning_text and tokenizer to use typing.Optional to remain Python 3.8+
compatible (import Optional from typing), e.g. annotate reasoning_text as
Optional[str] and tokenizer as Optional[<tokenizer-type>] in the
_create_usage_info signature and any related variable/type uses, and update any
other occurrences in this file that use the `X | None` syntax to Optional[X];
keep the function logic unchanged.
🟠 Major comments (32)
examples/scaffolding/bench_kv_cache_hints.py-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd the required NVIDIA copyright header.
This new source file is missing the mandatory NVIDIA copyright header with the latest modification year. Please add it before the module docstring.
📝 Proposed header
+# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# """Benchmark TensorRT-LLM with KV cache hints for multi-turn conversations.As per coding guidelines, all TensorRT-LLM source files (.py) should contain an NVIDIA copyright header with the year of latest meaningful modification.
examples/scaffolding/contrib/open_deep_research/run_deep_research.py-70-72 (1)
70-72:⚠️ Potential issue | 🟠 MajorUse
async_shutdown()forMCPWorkerto properly clean up async resources.The
MCPWorkerclass has anasync_shutdown()method that properly signals background tasks to stop and waits for their completion. Using synchronousshutdown()in an async context may not properly clean up the async resources.🐛 Proposed fix
llm.shutdown() generation_worker.shutdown() - mcp_worker.shutdown() + await mcp_worker.async_shutdown() returnexamples/scaffolding/benchmarks/benchmark_utils.py-1-2 (1)
1-2:⚠️ Potential issue | 🟠 MajorAdd the standard NVIDIA copyright header.
This new source file is missing the required NVIDIA copyright header with the year of latest meaningful modification. Please add the repo-standard header at the top.
As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
examples/scaffolding/benchmarks/multiround_chat_benchmark.py-1-12 (1)
1-12:⚠️ Potential issue | 🟠 MajorAdd NVIDIA copyright header.
The file is missing the required NVIDIA copyright/SPDX header.
📝 Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + """Multi-turn conversation benchmark for scaffolding.As per coding guidelines, "All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification".
tensorrt_llm/scaffolding/load_generation_strategy.py-1-4 (1)
1-4:⚠️ Potential issue | 🟠 MajorAdd NVIDIA copyright header.
The file is missing the required NVIDIA copyright/SPDX header.
📝 Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + """Load generation strategies for TensorRT-LLM Scaffolding Benchmark.As per coding guidelines, "All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification".
tensorrt_llm/scaffolding/load_generation_strategy.py-147-205 (1)
147-205:⚠️ Potential issue | 🟠 Majormax_concurrency is ignored in ThroughputStrategy and ConstantRateStrategy.
Both classes define
max_concurrencybut never overrideget_semaphore, so the limit is never enforced.🛠️ Suggested fix
class ThroughputStrategy(LoadGenerationStrategy): @@ max_concurrency: Optional[int] = Field(default=None, description="Maximum concurrency limit") + def get_semaphore(self): + """Return semaphore with specified concurrency limit if set.""" + import asyncio + + if self.max_concurrency is not None: + return asyncio.Semaphore(self.max_concurrency) + return _UnlimitedSemaphore() + async def _generate_request_times(self) -> AsyncGenerator[float, None]: """All requests sent at start time immediately.""" init_time = self.start_time while True: yield init_timeclass ConstantRateStrategy(LoadGenerationStrategy): @@ max_concurrency: Optional[int] = Field(default=None, description="Maximum concurrency limit") + def get_semaphore(self): + """Return semaphore with specified concurrency limit if set.""" + import asyncio + + if self.max_concurrency is not None: + return asyncio.Semaphore(self.max_concurrency) + return _UnlimitedSemaphore() + async def _generate_request_times(self) -> AsyncGenerator[float, None]: """Generate send times at fixed intervals.""" constant_increment = 1.0 / self.ratetensorrt_llm/scaffolding/scaffolding_llm.py-225-226 (1)
225-226:⚠️ Potential issue | 🟠 MajorRespect the
shutdown_workersflag for async shutdown.
async_shutdown()now runs unconditionally, which changes behavior even when callers passshutdown_workers=False. Gate it behind the flag (or rename the parameter).🛠️ Suggested fix
async def stop_task_on_loop(): await self.task_queue.put(None) await self.main_loop_stop_event.wait() - for worker in self.workers.values(): - await worker.async_shutdown() + if shutdown_workers: + for worker in self.workers.values(): + await worker.async_shutdown()examples/serve/kv_cache_hint.sh-38-52 (1)
38-52:⚠️ Potential issue | 🟠 MajorFix KV‑cache hint field names to match the server API.
The server expects
messages_to_retainandmessages(notcontext_to_retain/context_to_remove). The current payload will fail validation or be ignored.🛠️ Suggested fix
-d "{ \"model\": \"Qwen3-8B\", \"action\": \"truncate\", - \"messages\": [], - \"context_to_retain\": [ + \"messages_to_retain\": [ {\"role\": \"system\", \"content\": \"$prompt\"} ], - \"context_to_remove\": [ + \"messages\": [ {\"role\": \"user\", \"content\": \"$prompt\"} ], \"temperature\": 0.8, \"top_p\": 0.95 }"tensorrt_llm/scaffolding/contrib/mcp/chat_handler.py-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd the NVIDIA copyright header.
This file is a TensorRT‑LLM source file and should carry the standard NVIDIA header with the year of latest meaningful modification (2026 here). Please copy the canonical header used elsewhere in the repo.
As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/llmapi/llm.py-627-669 (1)
627-669:⚠️ Potential issue | 🟠 MajorValidate
actionand normalizesampling_paramsbefore tokenization.
input_processoroften expectsSamplingParams; passingNonecan lead to inconsistent tokenization or errors. Also, validatingactionhere gives a clearer API contract than deferring to lower layers.🛠️ Suggested fix
def set_kv_cache_hints( self, action: Literal["truncate"], messages_to_retain: Union[str, List[int]], messages: Union[str, List[int]], sampling_params: Optional[SamplingParams] = None) -> None: - '''Set KV cache hints. - ''' + '''Set KV cache hints. + + Args: + action: Only "truncate" is supported. + messages_to_retain: Prompt(s) to retain. + messages: Prompt(s) representing the full context. + sampling_params: Sampling parameters used for tokenization. + ''' + if action != "truncate": + raise ValueError(f"Invalid action: {action}") + sampling_params = self._prepare_sampling_params(sampling_params) if isinstance(messages, str) and messages == "": return if isinstance(messages, list) and len(messages) == 0: returnAs per coding guidelines: Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.
tensorrt_llm/executor/worker.py-334-335 (1)
334-335:⚠️ Potential issue | 🟠 MajorGuard KV-cache hint handling to avoid crashing the worker loop.
If
set_kv_cache_hintsraises, the leader loop will abort and stop serving. Wrap this branch and log errors to keep the worker alive.🛠️ Suggested fix
elif isinstance(req, KVCacheHintRequest): - worker.set_kv_cache_hints(req) + try: + worker.set_kv_cache_hints(req) + except Exception as e: + logger.error(f"set_kv_cache_hints failed: {e}") + logger.error(traceback.format_exc())cpp/include/tensorrt_llm/batch_manager/resortPolicy.h-44-49 (1)
44-49:⚠️ Potential issue | 🟠 MajorRemove the convenience overload or refactor to avoid unnecessary conversions.
The list-based
resortRequests()overload copies requests betweenstd::listandstd::vectorcontainers on every invocation. Since this is called fromCapacityScheduler::operator()(the main batch scheduling routine), the overhead of two container conversions per scheduling decision could be significant for large request batches. Callers should either use the vector-based API directly or the convenience method should be eliminated to avoid hidden performance costs.tensorrt_llm/scaffolding/contrib/open_deep_research/tools.py-1-4 (1)
1-4: 🛠️ Refactor suggestion | 🟠 MajorMissing NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
📝 Proposed fix to add copyright header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from tensorrt_llm.scaffolding.task import OpenAIToolDescriptionexamples/scaffolding/openai_chat_client_for_agent_tree.py-1-6 (1)
1-6: 🛠️ Refactor suggestion | 🟠 MajorMissing NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
📝 Proposed fix to add copyright header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from openai import OpenAItensorrt_llm/_torch/pyexecutor/py_executor.py-1665-1692 (1)
1665-1692:⚠️ Potential issue | 🟠 MajorPotential issues in control queue processing.
Missing null check for
kv_cache_manager: Line 1689 callsself.kv_cache_manager.truncate_blocks()without verifying thatkv_cache_manageris not None. This could raise an AttributeError if KV cache is not configured.Use
TypeErrorfor invalid type: Per the static analysis hint,TypeErroris more appropriate thanValueErrorfor invalid type checks.
qsize()is unreliable for thread safety: Theqsize()method is not reliable in multi-threaded contexts. Consider using a try/except pattern withget_nowait()instead.🐛 Proposed fix
def _sync_and_process_control_queue(self): """ Synchronizes and processes control queue items across all ranks. This method ensures that control queue items (like TruncateKVCacheRequest) are broadcast from rank 0 to all other ranks, so that all ranks execute the same control operations for consistency (e.g., KV cache truncation). """ # Rank 0 collects items from the control queue if self.dist.rank == 0: control_requests = [] - while self.control_queue.qsize() > 0: - request = self.control_queue.get_nowait() - if request is not None: - control_requests.append(request) + while True: + try: + request = self.control_queue.get_nowait() + if request is not None: + control_requests.append(request) + except queue.Empty: + break else: control_requests = None # Broadcast control requests to all ranks control_requests = self.dist.broadcast(control_requests, root=0) # All ranks process the control requests for request in control_requests: if isinstance(request, TruncateKVCacheRequest): + if self.kv_cache_manager is None: + logger.warning("KV cache manager is not configured, skipping truncate request") + continue self.kv_cache_manager.truncate_blocks( request.messages, len(request.messages_to_retain)) else: - raise ValueError(f"Invalid request type: {type(request)}.") + raise TypeError(f"Invalid request type: {type(request)}.")tensorrt_llm/scaffolding/contrib/open_deep_research/researcher.py-1-12 (1)
1-12:⚠️ Potential issue | 🟠 MajorAdd required NVIDIA copyright header.
This new source file is missing the required NVIDIA SPDX header.
As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.🛠️ Suggested fix
+# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + import logging from dataclasses import dataclass, field from typing import Listexamples/scaffolding/benchmarks/__init__.py-1-5 (1)
1-5:⚠️ Potential issue | 🟠 MajorAdd required NVIDIA copyright header.
This new module is missing the required NVIDIA SPDX header.
As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.🛠️ Suggested fix
+# SPDX-License-Identifier: Apache-2.0 +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + """Scaffolding benchmarks package.tensorrt_llm/serve/openai_server.py-493-528 (1)
493-528:⚠️ Potential issue | 🟠 MajorBug:
convert_messageshas incorrect return type handling and error propagation.The inner function
convert_messageshas several issues:
- Type hint mismatch: Declared return type is
List[int]but it returnsstr(prompt) on success- Error handling bug: On exception, it returns a
Responseobject, but callers at lines 519-520 use the result directly without checking for errors- Silent failure: If message conversion fails, the error Response is passed to
self.llm.set_kv_cache_hints()as if it were valid prompt dataAdditionally, line 493 has a style issue: missing space after
=indocuments =request.documents.🐛 Proposed fix for error handling
- documents =request.documents + documents = request.documents chat_template = request.chat_template chat_template_kwargs = request.chat_template_kwargs or {} from openai.types.chat import ChatCompletionMessageParam - def convert_messages(messages: List[ChatCompletionMessageParam]) -> List[int]: + def convert_messages(messages: List[ChatCompletionMessageParam]) -> Union[str, Response]: try: conversation: List[ConversationMessage] = [] conversation, _, __ = parse_chat_messages_coroutines(messages, self.model_config, None) prompt: str = apply_chat_template( model_type=self.model_config.model_type, tokenizer=self.tokenizer, processor=self.processor, conversation=conversation, add_generation_prompt=add_generation_prompt, mm_placeholder_counts=[], tools=tool_dicts, documents=documents, chat_template=chat_template, chat_template_kwargs=chat_template_kwargs, ) return prompt except Exception as e: logger.error(traceback.format_exc()) return self.create_error_response(str(e)) - messages_to_retain = convert_messages(request.messages_to_retain) if request.messages_to_retain else [] - messages = convert_messages(request.messages) if request.messages else [] + messages_to_retain_result = convert_messages(request.messages_to_retain) if request.messages_to_retain else "" + if isinstance(messages_to_retain_result, Response): + return messages_to_retain_result + messages_to_retain = messages_to_retain_result + + messages_result = convert_messages(request.messages) if request.messages else "" + if isinstance(messages_result, Response): + return messages_result + messages = messages_resulttensorrt_llm/scaffolding/contrib/open_deep_research/supervisor.py-52-57 (1)
52-57:⚠️ Potential issue | 🟠 MajorSupervisorTask.create_from_prompt doesn’t populate
input_str.Line 52 sets
user_prompt, butSupervisor.processreadsinput_str(Line 91). This leaves the prompt empty unless some other path sets it.Suggested fix
def create_from_prompt(prompt: str) -> "SupervisorTask": task = SupervisorTask() task.user_prompt = prompt + task.input_str = prompt task.research_brief = None task.final_report = None return taskcpp/tensorrt_llm/batch_manager/kvCacheManager.cpp-1457-1474 (1)
1457-1474:⚠️ Potential issue | 🟠 MajorTruncation threshold is checked before updating matched‑token count.
Line 1457 checks
numMatchedTokensbefore adding the current block’s tokens, which can keep one extra block (and release descendants from the wrong node). Update the counter first and use>=to trigger truncation at the boundary.Suggested fix
- if (numMatchedTokens > numTokensToKeep) - { - matchingBlock->setPriority(executor::KvCacheRetentionConfig::kMinRetentionPriority); - releaseChildren(matchingBlock); - break; - } - - if (mEventManager) - { - mEventManager->enqueueUpdatedEvent(tle::KVCacheUpdatedData(matchingBlock->getHash()) - .priorityUpdated(matchingBlock->getPriority(), - executor::KvCacheRetentionConfig::kMinRetentionPriority), - mWindowSize); - } - - numMatchedTokens += numMatched > 0 ? numMatched : blockKey.uniqueTokens.size(); + numMatchedTokens += numMatched > 0 ? numMatched : blockKey.uniqueTokens.size(); + if (numMatchedTokens >= numTokensToKeep) + { + matchingBlock->setPriority(executor::KvCacheRetentionConfig::kMinRetentionPriority); + releaseChildren(matchingBlock); + break; + } + + if (mEventManager) + { + mEventManager->enqueueUpdatedEvent(tle::KVCacheUpdatedData(matchingBlock->getHash()) + .priorityUpdated(matchingBlock->getPriority(), + executor::KvCacheRetentionConfig::kMinRetentionPriority), + mWindowSize); + }cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp-1429-1474 (1)
1429-1474:⚠️ Potential issue | 🟠 MajorGuard radix‑tree access with
mCachedBlocksRootMutex.Line 1441 reads/modifies
mCachedBlocksRootand then callsfindMatchingBlock/releaseChildrenwithout the mutex used in other radix‑tree operations, which risks races with concurrent cache updates.Suggested fix
void WindowBlockManager::truncateBlocks(LlmRequest::VecTokens const& targetTokens, SizeType32 numTokensToKeep) { + std::lock_guard<std::mutex> lock(mCachedBlocksRootMutex); auto numTokens = static_cast<SizeType32>(targetTokens.size()); auto blockedTokens = chopVectorIntoBlocks<TokenIdType>(targetTokens, numTokens, mTokensPerBlock, true); SizeType32 numMatchedTokens = 0;tensorrt_llm/scaffolding/contrib/open_deep_research/supervisor.py-143-155 (1)
143-155:⚠️ Potential issue | 🟠 Major
complete_researchonly exits the inner loop.Line 152 breaks the tool‑call loop, but the outer
for _ in range(self.max_research_iter)continues, so planning keeps running after completion. Break the outer loop whencomplete_researchis invoked.Suggested fix
- for tool_call in research_planning_task.messages[-1].tool_calls: + completed = False + for tool_call in research_planning_task.messages[-1].tool_calls: tool_name = tool_call.function.name arguments = json.loads(tool_call.function.arguments) @@ elif tool_name == "complete_research": - break + completed = True + break + + if completed: + breaktensorrt_llm/scaffolding/worker.py-87-116 (1)
87-116:⚠️ Potential issue | 🟠 MajorRemove blocking
requests.post()from async function and add timeout.The
send_kv_cache_hintmethod isasyncbut calls synchronousrequests.post()directly, which blocks the event loop indefinitely without a timeout. Offload this call to a thread usingasyncio.to_thread()and set a timeout.Suggested fix
- return requests.post( - url, - json=kv_cache_hint_params, - headers=headers, - ) + return await asyncio.to_thread( + requests.post, + url, + json=kv_cache_hint_params, + headers=headers, + timeout=30, + )tensorrt_llm/scaffolding/task_collection.py-131-132 (1)
131-132:⚠️ Potential issue | 🟠 MajorOverride
get_global_infoas a staticmethod to matchTaskCollection.Without
@staticmethod, callinginstance.get_global_info()raisesTypeErrorbecauseselfis implicitly passed.✅ Fix for all affected subclasses
class ChatTokenCounter(TaskCollection): @@ - def get_global_info() -> Any: + `@staticmethod` + def get_global_info() -> Any: return ChatTokenCounter.statistics @@ class TaskTimer(TaskCollection): @@ - def get_global_info() -> Any: + `@staticmethod` + def get_global_info() -> Any: return TaskTimer.statistics @@ class QueryCollector(TaskCollection): @@ - def get_global_info() -> Any: + `@staticmethod` + def get_global_info() -> Any: with open(QueryCollector.file_name, 'w', encoding='utf-8') as f: json.dump(QueryCollector.query_dict, f, indent=4, ensure_ascii=False) return NoneAlso applies to: 167-168, 186-192
cpp/include/tensorrt_llm/batch_manager/agentTree.h-17-18 (1)
17-18:⚠️ Potential issue | 🟠 MajorReplace
#pragma oncewith the required TRTLLM_ include guard.*The guidelines mandate a preprocessor guard with the
TRTLLM_<FILENAME>_Hpattern.🛡️ Required include guard
-#pragma once +#ifndef TRTLLM_AGENTTREE_H +#define TRTLLM_AGENTTREE_H @@ -} // namespace agent_tree -} // namespace tensorrt_llm::batch_manager +} // namespace agent_tree +} // namespace tensorrt_llm::batch_manager + +#endif // TRTLLM_AGENTTREE_HAs per coding guidelines: Use a preprocessor guard in C++ header files with the format
TRTLLM_<FILENAME>_H, where the filename is in uppercase with no underscores, no prefix underscores, and no trailing underscores.Also applies to: 337-338
tensorrt_llm/scaffolding/task_collection.py-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd the required NVIDIA copyright header.
This file currently has no header.
🔧 Suggested header
+# +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# import jsonAs per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
cpp/tensorrt_llm/batch_manager/agentTree.cpp-293-301 (1)
293-301:⚠️ Potential issue | 🟠 MajorHandle
agentPercentage = -1.0before constructingAgentChatbotNodeParams.
AgentTreeConfigdocuments-1.0as a sentinel, butAgentChatbotNodeParamsasserts ratio ∈ [0,1]. With the default, this path fails immediately.✅ One possible fix (map sentinel to a policy ratio)
- auto nodeParams = std::make_shared<AgentChatbotNodeParams>(config.agentPercentage, childNodeTypes); + auto agentRatio = config.agentPercentage; + if (agentRatio < 0.0F) + { + agentRatio = 0.5F; // TODO: pick a policy for "random" scheduling + } + auto nodeParams = std::make_shared<AgentChatbotNodeParams>(agentRatio, childNodeTypes);tensorrt_llm/scaffolding/task.py-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorAdd the required NVIDIA copyright header.
This file currently has no header.
🔧 Suggested header
+# +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# import jsonAs per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
cpp/tensorrt_llm/batch_manager/agentTree.cpp-1-15 (1)
1-15:⚠️ Potential issue | 🟠 MajorUpdate the copyright year to include 2026.
The file is newly added/modified in 2026 but the header stops at 2024.
🔧 Suggested update
- * Copyright (c) 2023-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2023-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/scaffolding/__init__.py-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorUpdate the copyright year to include 2026.
This file has 2026 changes but the header still lists 2025.
🔧 Suggested update
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2025-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
cpp/tensorrt_llm/batch_manager/agentTree.cpp-158-181 (1)
158-181:⚠️ Potential issue | 🟠 MajorAvoid duplicating remaining requests when one list finishes.
When
activeVectors == 1, you append the remaining requests but keep iterating the priority queue, which re‑adds the first of those remaining entries.🐛 Suggested fix
if (activeVectors == 1 && !pq.empty()) { auto [reqId, listIdx, itemIdx, req] = pq.top(); auto const* lastVec = nonEmptyVectors[listIdx]; result.insert(result.end(), lastVec->begin() + itemIdx, lastVec->end()); + break; }cpp/include/tensorrt_llm/batch_manager/agentTree.h-1-15 (1)
1-15:⚠️ Potential issue | 🟠 MajorUpdate the copyright year to include 2026.
The header is 2023–2024 even though the file is newly added in 2026.
🔧 Suggested update
- * Copyright (c) 2023-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2023-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines: All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
🟡 Minor comments (24)
tests/unittest/scaffolding/test_mcp_worker.py-337-365 (1)
337-365:⚠️ Potential issue | 🟡 MinorMissing try/finally for worker cleanup.
Unlike other tests,
test_multiple_calls_to_same_toolcallsworker.shutdown()outside a try/finally block. If any assertion fails, the worker won't be properly cleaned up, potentially causing resource leaks or affecting subsequent tests.Proposed fix
`@pytest.mark.asyncio` async def test_multiple_calls_to_same_tool(mcp_server): sse_url = mcp_server.get_sse_url() worker = MCPWorker(urls=[sse_url]) await worker.init_in_asyncio_event_loop() - task1 = MCPCallTask() - task1.tool_name = "add_numbers" - task1.args = json.dumps({"a": 10, "b": 20}) - - task2 = MCPCallTask() - task2.tool_name = "add_numbers" - task2.args = json.dumps({"a": 100, "b": 200}) - - status1 = await worker.run_task(task1) - status2 = await worker.run_task(task2) - assert status1 == TaskStatus.SUCCESS - assert status2 == TaskStatus.SUCCESS - result1 = int(task1.result_str) - result2 = int(task2.result_str) - assert result1 == 30, f"Expected 30, but got {result1}" - assert result2 == 300, f"Expected 300, but got {result2}" - - print("✓ Test passed: Multiple calls to same tool successful") - print(f" - Call 1: add_numbers(10, 20) = {result1}") - print(f" - Call 2: add_numbers(100, 200) = {result2}") - - worker.shutdown() + try: + task1 = MCPCallTask() + task1.tool_name = "add_numbers" + task1.args = json.dumps({"a": 10, "b": 20}) + + task2 = MCPCallTask() + task2.tool_name = "add_numbers" + task2.args = json.dumps({"a": 100, "b": 200}) + + status1 = await worker.run_task(task1) + status2 = await worker.run_task(task2) + assert status1 == TaskStatus.SUCCESS + assert status2 == TaskStatus.SUCCESS + result1 = int(task1.result_str) + result2 = int(task2.result_str) + assert result1 == 30, f"Expected 30, but got {result1}" + assert result2 == 300, f"Expected 300, but got {result2}" + + print("✓ Test passed: Multiple calls to same tool successful") + print(f" - Call 1: add_numbers(10, 20) = {result1}") + print(f" - Call 2: add_numbers(100, 200) = {result2}") + + finally: + worker.shutdown()tests/unittest/scaffolding/test_mcp_worker.py-103-106 (1)
103-106:⚠️ Potential issue | 🟡 MinorCommented-out server readiness check may cause flaky tests.
The
_wait_for_servercall is commented out, which means tests proceed immediately after starting the server process. This creates a race condition where tests may run before the server is ready, potentially causing intermittent failures. Consider enabling the wait or adding an alternative readiness mechanism.Proposed fix
self.proc.start() # Wait for server to be ready - # self._wait_for_server(url=self.sse_url, - # timeout=self.MAX_SERVER_START_WAIT_S) + self._wait_for_server(url=self.sse_url, + timeout=self.MAX_SERVER_START_WAIT_S)tests/unittest/scaffolding/test_mcp_worker.py-253-253 (1)
253-253:⚠️ Potential issue | 🟡 MinorFix inconsistent async call to synchronous method.
MCPWorker.shutdown()is a synchronous method (notasync def). Line 436 intest_scaffolding_with_chat_mcp_controllerincorrectly usesawait mcp_worker.shutdown(). Remove theawaitto match the method signature and be consistent with all other test calls (lines 253, 282, 334, 364).examples/scaffolding/contrib/open_deep_research/WordLlama/pyproject.toml-4-4 (1)
4-4:⚠️ Potential issue | 🟡 MinorUpdate placeholder description before merging.
The description field contains placeholder text that should be replaced with a meaningful description of the package's purpose.
Suggested fix
-description = "Add your description here" +description = "MCP server for local knowledge base search used in Open Deep Research"examples/scaffolding/contrib/open_deep_research/WordLlama/README.md-7-17 (1)
7-17:⚠️ Potential issue | 🟡 MinorFix typo and add language specifier to code block.
- Line 12: "Travily" should be "Tavily"
- The fenced code block should specify a language (e.g.,
bashorshell) for proper syntax highlightingSuggested fix
-``` +```bash python run_deep_research.py --enable_query_collector You will see query_result.json and copy query_result.json to this path. -Kill the Travily serve. +Kill the Tavily server. uv run wordllama_serve.py Then you can continue to run run_deep_research.py. ```tensorrt_llm/scaffolding/contrib/open_deep_research/__init__.py-1-4 (1)
1-4:⚠️ Potential issue | 🟡 MinorAdd NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
Suggested fix
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from .researcher import Researcher from .supervisor import Supervisor, create_open_deep_research_scaffolding_llm -__all__ = ["Supervisor", "Researcher", "create_open_deep_research_scaffolding_llm"] +__all__ = ["Researcher", "Supervisor", "create_open_deep_research_scaffolding_llm"]As per coding guidelines: "All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification."
cpp/include/tensorrt_llm/batch_manager/microBatchScheduler.h-44-45 (1)
44-45:⚠️ Potential issue | 🟡 MinorRemove the orphan comment on line 44.
The comment references
AgentTreeConfigfromagentTree.h, but this type is never used inmicroBatchScheduler.hand there is no corresponding include for that header. This comment provides no value and should be removed.examples/scaffolding/contrib/open_deep_research/WordLlama/pyproject.toml-6-6 (1)
6-6:⚠️ Potential issue | 🟡 MinorDocument the Python 3.12 requirement rationale.
The declared dependencies (wordllama >=3.9, mcp >=3.10) do not require Python 3.12. Since the
requires-python = ">=3.12"constraint is intentional but undocumented, add a comment in the pyproject.toml (or README) explaining why Python 3.12 is required—whether for the package's own code, compatibility, or project policy.tensorrt_llm/scaffolding/contrib/open_deep_research/README.md-72-75 (1)
72-75:⚠️ Potential issue | 🟡 MinorTypo in script filename.
The script name appears to be misspelled. Based on the Tavily API context, it should likely be
tavily.pyinstead oftravily.py.📝 Proposed fix
cd examples/scaffolding/contrib/open_deep_research/TavilyMCP export TAVILY_API_KEY=<your_api_key> -uv run travily.py +uv run tavily.pyexamples/scaffolding/contrib/open_deep_research/WordLlama/wordllama_serve.py-1-30 (1)
1-30:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/scaffolding/contrib/open_deep_research/utils.py-1-11 (1)
1-11:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
📝 Proposed fix to add copyright header
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from datetime import datetimeexamples/scaffolding/contrib/open_deep_research/WordLlama/main.py-1-6 (1)
1-6:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
📝 Proposed fix to add copyright header
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + def main(): print("Hello from wordllama!")examples/scaffolding/contrib/open_deep_research/run_deep_research.py-1-17 (1)
1-17:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h-1696-1699 (1)
1696-1699:⚠️ Potential issue | 🟡 MinorUse
//!style for the new truncateBlocks doc comments.The new Doxygen blocks use
///, but the header rule calls for//!in this codebase.✏️ Proposed fix
- /// `@brief` Drop blocks from the KV cache manager. - /// `@param` targetTokens The target tokens with both prefix to keep and suffix to drop. - /// `@param` numTokensToKeep Number of tokens to keep from the end of the sequence. + //! `@brief` Drop blocks from the KV cache manager. + //! `@param` targetTokens The target tokens with both prefix to keep and suffix to drop. + //! `@param` numTokensToKeep Number of tokens to keep from the end of the sequence. virtual void truncateBlocks(LlmRequest::VecTokens const& targetTokens, SizeType32 numTokensToKeep) = 0;- /// `@brief` Drop blocks from the KV cache manager. - /// `@param` targetTokens The target tokens with both prefix to keep and suffix to drop. - /// `@param` numTokensToKeep Number of tokens to keep from the end of the sequence. + //! `@brief` Drop blocks from the KV cache manager. + //! `@param` targetTokens The target tokens with both prefix to keep and suffix to drop. + //! `@param` numTokensToKeep Number of tokens to keep from the end of the sequence. void truncateBlocks(LlmRequest::VecTokens const& targetTokens, SizeType32 numTokensToKeep) override;As per coding guidelines, "Follow Doxygen rules for documenting new C++ class interfaces and function prototypes. Use
//!for C++-style single-line comments and//!<for class members".Also applies to: 2049-2052
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h-915-915 (1)
915-915:⚠️ Potential issue | 🟡 MinorAdd Doxygen docs for the new public truncateBlocks APIs.
These public declarations are missing Doxygen blocks, which makes the interface harder to consume.
✏️ Proposed doc additions
+ //! \brief Drop blocks from the KV cache for this window. + //! \param targetTokens The target tokens with both prefix to keep and suffix to drop. + //! \param numTokensToKeep Number of tokens to keep from the end of the sequence. void truncateBlocks(LlmRequest::VecTokens const& targetTokens, SizeType32 numTokensToKeep);+ //! \brief Drop blocks from the KV cache for a specific window size. + //! \param targetTokens The target tokens with both prefix to keep and suffix to drop. + //! \param numTokensToKeep Number of tokens to keep from the end of the sequence. void truncateBlocks(LlmRequest::VecTokens const& targetTokens, SizeType32 numTokensToKeep, SizeType32 windowSize);As per coding guidelines, "Follow Doxygen rules for documenting new C++ class interfaces and function prototypes. Use
//!for C++-style single-line comments and//!<for class members".Also applies to: 1421-1421
tensorrt_llm/scaffolding/contrib/mcp/chat_handler.py-12-14 (1)
12-14:⚠️ Potential issue | 🟡 MinorHarden env parsing and use a Google‑style docstring.
int(os.environ.get(...))will raise if the variable is set to non‑numeric values (e.g., "true"). Consider a tolerant parse and a structured docstring.🛠️ Suggested fix
-def is_deterministic_mode(): - """Check if SCAFFOLDING_DETERMINISTIC environment variable is set to enable deterministic inference.""" - return int(os.environ.get("SCAFFOLDING_DETERMINISTIC", 0)) == 1 +def is_deterministic_mode() -> bool: + """Check if deterministic inference is enabled. + + Returns: + bool: True if SCAFFOLDING_DETERMINISTIC is set to a truthy value. + """ + value = os.environ.get("SCAFFOLDING_DETERMINISTIC", "0").strip().lower() + return value in {"1", "true", "yes", "y", "on"}As per coding guidelines: Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.
tests/unittest/scaffolding/test_worker.py-81-88 (1)
81-88:⚠️ Potential issue | 🟡 MinorMissing cleanup on assertion failure.
Unlike
test_trtoai_worker_generation(lines 73-78), this test doesn't wrap the assertion in a try/except block. If the assertion fails,worker.shutdown()won't be called, potentially leaving resources unreleased.🛡️ Proposed fix to ensure cleanup on failure
`@pytest.mark.asyncio`(loop_scope="module") def test_trtoai_worker_chat(default_prompt, model_name, server): worker = create_trtoai_worker(model_name, server.get_async_client()) task = ChatTask.create_from_messages([UserMessage(default_prompt)]) task.max_tokens = 100 status = asyncio.run(worker.run_task(task)) - assert status == TaskStatus.SUCCESS, "Chat Task is not successful with TRTOpenaiWorker" - worker.shutdown() + try: + assert status == TaskStatus.SUCCESS, "Chat Task is not successful with TRTOpenaiWorker" + except AssertionError as e: + worker.shutdown() + server.__exit__(None, None, None) + raise e + worker.shutdown()tests/unittest/scaffolding/test_bench.py-166-166 (1)
166-166:⚠️ Potential issue | 🟡 MinorRemove extraneous f-string prefix.
The f-string
f"before shutdown"has no placeholders. Use a regular string instead.🔧 Proposed fix
- print(f"before shutdown") + print("before shutdown")examples/scaffolding/benchmarks/__main__.py-1-24 (1)
1-24:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
examples/scaffolding/benchmarks/chat_benchmark.py-1-14 (1)
1-14:⚠️ Potential issue | 🟡 MinorMissing NVIDIA copyright header.
Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/scaffolding/contrib/open_deep_research/researcher.py-90-106 (1)
90-106:⚠️ Potential issue | 🟡 MinorValidate non-empty task list before indexing.
research_tasks[0]will throw if an empty list is passed.🛠️ Suggested fix
def process(self, research_tasks: List[ResearchTask], **kwargs): + if not research_tasks: + raise ValueError("Researcher requires at least one ResearchTask") chat_task = ChatTask.create_from_prompt( research_tasks[0].research_topic,tensorrt_llm/scaffolding/contrib/open_deep_research/researcher.py-52-55 (1)
52-55:⚠️ Potential issue | 🟡 MinorAvoid
assertfor runtime validation inCompressor.process.Asserts can be stripped; use an explicit exception for input validation.
🛠️ Suggested fix
- assert len(tasks) == 1 and isinstance(tasks[0], ChatTask), ( - "Compressor only supports one ChatTask" - ) + if len(tasks) != 1 or not isinstance(tasks[0], ChatTask): + raise ValueError("Compressor only supports a single ChatTask")tensorrt_llm/scaffolding/worker.py-394-396 (1)
394-396:⚠️ Potential issue | 🟡 MinorRemove the duplicate
asyncioimport.Line 394 re-imports
asyncio, which is already imported at the top and triggers F811. Clean up the duplicate import.tensorrt_llm/scaffolding/contrib/open_deep_research/prompts.py-151-156 (1)
151-156:⚠️ Potential issue | 🟡 MinorUpdate reflection instruction to match the tool name.
Line 152 still says “tavily_search” even though the tool list now uses
web_search, which can mislead the agent.Suggested fix
-After each tavily_search call, use reflection to analyze the results: +After each web_search call, use reflection to analyze the results:
|
Hi @dongxuy04 @QiJune @Superjomn , could you help to review this PR?
I think what we particularly need to review is the part where the core logic of trtllm has been modified. I'm not sure if there are any areas that introduce significant risks. Thanks! |
|
/bot run |
|
PR_Github #34384 [ run ] triggered by Bot. Commit: |
|
PR_Github #34384 [ run ] completed with state
|
6044c4a to
498712c
Compare
|
/bot run |
|
PR_Github #37355 [ run ] triggered by Bot. Commit: |
|
PR_Github #37355 [ run ] completed with state
|
34280f9 to
17f2c17
Compare
|
/bot run |
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
Signed-off-by: Yi Sun <[email protected]>
|
/bot run |
|
PR_Github #47350 [ run ] triggered by Bot. Commit: |
|
PR_Github #47350 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47399 [ run ] triggered by Bot. Commit: |
|
PR_Github #47399 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47490 [ run ] triggered by Bot. Commit: |
|
PR_Github #47490 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47517 [ run ] triggered by Bot. Commit: |
|
PR_Github #47517 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47536 [ run ] triggered by Bot. Commit: |
|
PR_Github #47536 [ run ] completed with state |
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.
Description
This PR addresses several issues in the Scaffolding-based implementation of Open Deep Research. It adds support for capturing agent application information via Task Collection, enables proactive KV cache eviction to reduce GPU memory pressure from agents, and introduces agent-tree–based performance isolation between agents and chatbots. It also includes benchmarks for evaluating Scaffolding-based agents.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.