[https://nvbugs/5955927][fix] Add warm up before aiperf to fix timeout issue. by dominicshanshan · Pull Request #12178 · NVIDIA/TensorRT-LLM

dominicshanshan · 2026-03-13T03:07:45Z

Summary by CodeRabbit

Release Notes

Tests
- Improved stress testing infrastructure with inference pipeline warmup for consistent benchmark conditions.
- Added configurable per-request timeout support for more flexible performance testing scenarios.
- Enhanced error diagnostics with detailed failure reporting and output capture.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-03-13T03:14:43Z

📝 Walkthrough

Walkthrough

This change introduces inference warmup logic to the stress testing framework before benchmarking, adds request timeout handling throughout the aiperf command construction, and enhances process execution with improved stdout/stderr capture and error reporting.

Changes

Cohort / File(s)	Summary
Stress Test Framework Enhancement `tests/integration/defs/stress_test/stress_test.py`	Added new `warmup_inference()` function to warm up the inference pipeline before benchmarking. Extended `create_aiperf_command()` signature to include `request_timeout_seconds` parameter (default 120.0 seconds). Enhanced `run_aiperf_process()` with thread-based stdout/stderr streaming, output aggregation for error reporting, and improved cleanup in a finally block. Reworked output filtering logic to support both filtered and non-filtered modes. Integrated warmup flow into stress test execution with failure logging that continues gracefully. Updated documentation strings and adjusted health-check sequencing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely empty—all key sections (Description, Test Coverage) are blank, and the checklist is incomplete with only a generic checkbox marked.	Provide a clear description of the issue being fixed, explain the warmup solution, document test coverage for the new warmup_inference function, and complete the PR checklist items relevant to this change.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly describes the main change: adding a warm-up step before aiperf to resolve a timeout issue, which aligns with the core objective of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/stress_test/stress_test.py`:
- Around line 978-989: Replace the unbounded lists stdout_lines and stderr_lines
with bounded deques (e.g., collections.deque(maxlen=50)) so only the last N
aiperf log lines are retained; update the imports to include deque and keep
using stdout_lock/stderr_lock and the _capture_and_print function unchanged
except to append to the deque instead of a list to prevent unbounded memory
growth (also apply the same change to the other buffer instances around the
_capture_and_print usage).
- Around line 352-355: The warmup currently uses default num_warmup_requests and
timeout (2×300s) and only logs a warning on False, so update the call sites (the
warmup_inference invocation in the stress test and the similar call around lines
796-803) to compute an explicit warmup deadline from the current stage
timeout/budget and pass that as the timeout argument to warmup_inference, then
immediately abort/raise/return (fail fast) if warmup_inference returns False;
locate the function warmup_inference and the caller run_aiperf_process (and the
other caller around 796-803) to make the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 650625cc-f140-4a3c-bb4f-16a13f4e17d0

📥 Commits

Reviewing files that changed from the base of the PR and between e226930 and 558382f.

📒 Files selected for processing (1)

tests/integration/defs/stress_test/stress_test.py

dominicshanshan · 2026-03-13T03:49:47Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1"

tensorrt-cicd · 2026-03-13T03:56:57Z

PR_Github #38821 [ run ] triggered by Bot. Commit: 558382f Link to invocation

tensorrt-cicd · 2026-03-13T05:42:17Z

PR_Github #38821 [ run ] completed with state SUCCESS. Commit: 558382f
/LLM/main/L0_MergeRequest_PR pipeline #30133 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

dominicshanshan · 2026-03-13T07:47:02Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1"

tensorrt-cicd · 2026-03-13T07:52:41Z

PR_Github #38848 [ run ] triggered by Bot. Commit: 5a090d8 Link to invocation

tensorrt-cicd · 2026-03-13T08:57:18Z

PR_Github #38848 [ run ] completed with state SUCCESS. Commit: 5a090d8
/LLM/main/L0_MergeRequest_PR pipeline #30158 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dominicshanshan · 2026-03-13T09:35:18Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

tensorrt-cicd · 2026-03-13T09:41:11Z

PR_Github #38859 [ run ] triggered by Bot. Commit: 5a090d8 Link to invocation

tensorrt-cicd · 2026-03-13T10:43:33Z

PR_Github #38859 [ run ] completed with state SUCCESS. Commit: 5a090d8
/LLM/main/L0_MergeRequest_PR pipeline #30169 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dominicshanshan · 2026-03-13T13:10:15Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

tensorrt-cicd · 2026-03-13T13:17:20Z

PR_Github #38872 [ run ] triggered by Bot. Commit: 8724354 Link to invocation

tensorrt-cicd · 2026-03-13T14:16:11Z

PR_Github #38872 [ run ] completed with state SUCCESS. Commit: 8724354
/LLM/main/L0_MergeRequest_PR pipeline #30182 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dominicshanshan · 2026-03-14T03:23:19Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

tensorrt-cicd · 2026-03-14T03:29:21Z

PR_Github #38930 [ run ] triggered by Bot. Commit: 0c855f3 Link to invocation

tensorrt-cicd · 2026-03-14T03:29:22Z

PR_Github #38930 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

dominicshanshan · 2026-03-14T14:24:53Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

tensorrt-cicd · 2026-03-14T14:31:44Z

PR_Github #38944 [ run ] triggered by Bot. Commit: 0c855f3 Link to invocation

tensorrt-cicd · 2026-03-14T14:31:45Z

PR_Github #38944 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

dominicshanshan · 2026-03-15T05:24:59Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

tensorrt-cicd · 2026-03-15T05:31:52Z

PR_Github #38964 [ run ] triggered by Bot. Commit: 0c855f3 Link to invocation

tensorrt-cicd · 2026-03-15T06:33:35Z

PR_Github #38964 [ run ] completed with state SUCCESS. Commit: 0c855f3
/LLM/main/L0_MergeRequest_PR pipeline #30246 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dominicshanshan · 2026-03-15T08:53:05Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

dominicshanshan · 2026-03-15T09:00:36Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

tensorrt-cicd · 2026-03-15T09:06:41Z

PR_Github #38970 [ run ] triggered by Bot. Commit: d6b7916 Link to invocation

tensorrt-cicd · 2026-03-15T11:27:34Z

PR_Github #38975 [ run ] triggered by Bot. Commit: d6b7916 Link to invocation

tensorrt-cicd · 2026-03-15T12:40:54Z

PR_Github #38975 [ run ] completed with state SUCCESS. Commit: d6b7916
/LLM/main/L0_MergeRequest_PR pipeline #30257 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

dominicshanshan · 2026-03-15T15:51:54Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

tensorrt-cicd · 2026-03-15T15:58:11Z

PR_Github #38992 [ run ] triggered by Bot. Commit: 3d1ea05 Link to invocation

tensorrt-cicd · 2026-03-15T17:07:02Z

PR_Github #38992 [ run ] completed with state SUCCESS. Commit: 3d1ea05
/LLM/main/L0_MergeRequest_PR pipeline #30272 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Wangshanshan <[email protected]>

This reverts commit 3d1ea05. Signed-off-by: Wangshanshan <[email protected]>

Signed-off-by: Wangshanshan <[email protected]>

dominicshanshan · 2026-03-16T02:37:36Z

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

tensorrt-cicd · 2026-03-16T02:45:13Z

PR_Github #39011 [ run ] triggered by Bot. Commit: 30f5d15 Link to invocation

tensorrt-cicd · 2026-03-16T04:03:47Z

PR_Github #39011 [ run ] completed with state SUCCESS. Commit: 30f5d15
/LLM/main/L0_MergeRequest_PR pipeline #30290 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

dominicshanshan · 2026-03-16T04:04:00Z

/bot run

tensorrt-cicd · 2026-03-16T04:09:45Z

PR_Github #39026 [ run ] triggered by Bot. Commit: 30f5d15 Link to invocation

tensorrt-cicd · 2026-03-16T06:47:32Z

PR_Github #39026 [ run ] completed with state SUCCESS. Commit: 30f5d15
/LLM/main/L0_MergeRequest_PR pipeline #30302 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dominicshanshan · 2026-03-16T07:14:56Z

/bot run

tensorrt-cicd · 2026-03-16T07:21:18Z

PR_Github #39047 [ run ] triggered by Bot. Commit: 30f5d15 Link to invocation

tensorrt-cicd · 2026-03-16T10:34:27Z

PR_Github #39047 [ run ] completed with state SUCCESS. Commit: 30f5d15
/LLM/main/L0_MergeRequest_PR pipeline #30320 completed with status: 'SUCCESS'

CI Report

Link to invocation

…t issue. (NVIDIA#12178) Signed-off-by: Wangshanshan <[email protected]>

dominicshanshan requested a review from a team as a code owner March 13, 2026 03:07

github-actions Bot assigned dominicshanshan Mar 13, 2026

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread tests/integration/defs/stress_test/stress_test.py

Comment thread tests/integration/defs/stress_test/stress_test.py

jieli-matrix approved these changes Mar 13, 2026

View reviewed changes

dominicshanshan force-pushed the user/shanshan/nvbug_5955927_main branch from 9b34130 to d6b7916 Compare March 15, 2026 08:16

dominicshanshan force-pushed the user/shanshan/nvbug_5955927_main branch from 2a74e0f to 3d1ea05 Compare March 15, 2026 14:14

dominicshanshan added 8 commits March 15, 2026 19:04

Add warm up fix.

811a770

Signed-off-by: Wangshanshan <[email protected]>

Remove stress test from waive list.

90f0146

Signed-off-by: Wangshanshan <[email protected]>

Fix aiperf warm up API.

4105f12

Signed-off-by: Wangshanshan <[email protected]>

Remove warm up concurrency from aiperf 0.6 version.

dcba04d

Signed-off-by: Wangshanshan <[email protected]>

Skip aiperf warmup.

305131f

Signed-off-by: Wangshanshan <[email protected]>

Update artifacts_dir from aiperf 4.0.

90162c0

Signed-off-by: Wangshanshan <[email protected]>

Revert "Update artifacts_dir from aiperf 4.0."

29fa1e3

This reverts commit 3d1ea05. Signed-off-by: Wangshanshan <[email protected]>

Revert last file path commit and update.

30f5d15

Signed-off-by: Wangshanshan <[email protected]>

dominicshanshan force-pushed the user/shanshan/nvbug_5955927_main branch from 7e70f12 to 30f5d15 Compare March 16, 2026 02:05

dominicshanshan merged commit e2df4f4 into NVIDIA:main Mar 16, 2026
5 checks passed

thorjohnsen mentioned this pull request Apr 20, 2026

[None][chore] Fix post-merge failures #13228

Closed

1 task

Conversation

dominicshanshan commented Mar 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dominicshanshan commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

dominicshanshan commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

dominicshanshan commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

dominicshanshan commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

tensorrt-cicd commented Mar 13, 2026

Uh oh!

dominicshanshan commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

dominicshanshan commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

tensorrt-cicd commented Mar 14, 2026

Uh oh!

dominicshanshan commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

dominicshanshan commented Mar 15, 2026

Uh oh!

dominicshanshan commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

dominicshanshan commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

tensorrt-cicd commented Mar 15, 2026

Uh oh!

dominicshanshan commented Mar 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 13, 2026 •

edited

Loading