Skip to content

[https://nvbugs/5955927][fix] Add warm up before aiperf to fix timeout issue.#12178

Merged
dominicshanshan merged 8 commits into
NVIDIA:mainfrom
dominicshanshan:user/shanshan/nvbug_5955927_main
Mar 16, 2026
Merged

[https://nvbugs/5955927][fix] Add warm up before aiperf to fix timeout issue.#12178
dominicshanshan merged 8 commits into
NVIDIA:mainfrom
dominicshanshan:user/shanshan/nvbug_5955927_main

Conversation

@dominicshanshan
Copy link
Copy Markdown
Collaborator

@dominicshanshan dominicshanshan commented Mar 13, 2026

Summary by CodeRabbit

Release Notes

  • Tests
    • Improved stress testing infrastructure with inference pipeline warmup for consistent benchmark conditions.
    • Added configurable per-request timeout support for more flexible performance testing scenarios.
    • Enhanced error diagnostics with detailed failure reporting and output capture.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

This change introduces inference warmup logic to the stress testing framework before benchmarking, adds request timeout handling throughout the aiperf command construction, and enhances process execution with improved stdout/stderr capture and error reporting.

Changes

Cohort / File(s) Summary
Stress Test Framework Enhancement
tests/integration/defs/stress_test/stress_test.py
Added new warmup_inference() function to warm up the inference pipeline before benchmarking. Extended create_aiperf_command() signature to include request_timeout_seconds parameter (default 120.0 seconds). Enhanced run_aiperf_process() with thread-based stdout/stderr streaming, output aggregation for error reporting, and improved cleanup in a finally block. Reworked output filtering logic to support both filtered and non-filtered modes. Integrated warmup flow into stress test execution with failure logging that continues gracefully. Updated documentation strings and adjusted health-check sequencing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely empty—all key sections (Description, Test Coverage) are blank, and the checklist is incomplete with only a generic checkbox marked. Provide a clear description of the issue being fixed, explain the warmup solution, document test coverage for the new warmup_inference function, and complete the PR checklist items relevant to this change.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly describes the main change: adding a warm-up step before aiperf to resolve a timeout issue, which aligns with the core objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/stress_test/stress_test.py`:
- Around line 978-989: Replace the unbounded lists stdout_lines and stderr_lines
with bounded deques (e.g., collections.deque(maxlen=50)) so only the last N
aiperf log lines are retained; update the imports to include deque and keep
using stdout_lock/stderr_lock and the _capture_and_print function unchanged
except to append to the deque instead of a list to prevent unbounded memory
growth (also apply the same change to the other buffer instances around the
_capture_and_print usage).
- Around line 352-355: The warmup currently uses default num_warmup_requests and
timeout (2×300s) and only logs a warning on False, so update the call sites (the
warmup_inference invocation in the stress test and the similar call around lines
796-803) to compute an explicit warmup deadline from the current stage
timeout/budget and pass that as the timeout argument to warmup_inference, then
immediately abort/raise/return (fail fast) if warmup_inference returns False;
locate the function warmup_inference and the caller run_aiperf_process (and the
other caller around 796-803) to make the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 650625cc-f140-4a3c-bb4f-16a13f4e17d0

📥 Commits

Reviewing files that changed from the base of the PR and between e226930 and 558382f.

📒 Files selected for processing (1)
  • tests/integration/defs/stress_test/stress_test.py

Comment thread tests/integration/defs/stress_test/stress_test.py
Comment thread tests/integration/defs/stress_test/stress_test.py
@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38821 [ run ] triggered by Bot. Commit: 558382f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38821 [ run ] completed with state SUCCESS. Commit: 558382f
/LLM/main/L0_MergeRequest_PR pipeline #30133 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38848 [ run ] triggered by Bot. Commit: 5a090d8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38848 [ run ] completed with state SUCCESS. Commit: 5a090d8
/LLM/main/L0_MergeRequest_PR pipeline #30158 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38859 [ run ] triggered by Bot. Commit: 5a090d8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38859 [ run ] completed with state SUCCESS. Commit: 5a090d8
/LLM/main/L0_MergeRequest_PR pipeline #30169 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38872 [ run ] triggered by Bot. Commit: 8724354 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38872 [ run ] completed with state SUCCESS. Commit: 8724354
/LLM/main/L0_MergeRequest_PR pipeline #30182 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38930 [ run ] triggered by Bot. Commit: 0c855f3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38930 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38944 [ run ] triggered by Bot. Commit: 0c855f3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38944 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/14.

Link to invocation

@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38964 [ run ] triggered by Bot. Commit: 0c855f3 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38964 [ run ] completed with state SUCCESS. Commit: 0c855f3
/LLM/main/L0_MergeRequest_PR pipeline #30246 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dominicshanshan dominicshanshan force-pushed the user/shanshan/nvbug_5955927_main branch from 9b34130 to d6b7916 Compare March 15, 2026 08:16
@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

1 similar comment
@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38970 [ run ] triggered by Bot. Commit: d6b7916 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38975 [ run ] triggered by Bot. Commit: d6b7916 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38975 [ run ] completed with state SUCCESS. Commit: d6b7916
/LLM/main/L0_MergeRequest_PR pipeline #30257 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@dominicshanshan dominicshanshan force-pushed the user/shanshan/nvbug_5955927_main branch from 2a74e0f to 3d1ea05 Compare March 15, 2026 14:14
@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38992 [ run ] triggered by Bot. Commit: 3d1ea05 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38992 [ run ] completed with state SUCCESS. Commit: 3d1ea05
/LLM/main/L0_MergeRequest_PR pipeline #30272 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dominicshanshan dominicshanshan force-pushed the user/shanshan/nvbug_5955927_main branch from 7e70f12 to 30f5d15 Compare March 16, 2026 02:05
@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39011 [ run ] triggered by Bot. Commit: 30f5d15 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39011 [ run ] completed with state SUCCESS. Commit: 30f5d15
/LLM/main/L0_MergeRequest_PR pipeline #30290 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39026 [ run ] triggered by Bot. Commit: 30f5d15 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39026 [ run ] completed with state SUCCESS. Commit: 30f5d15
/LLM/main/L0_MergeRequest_PR pipeline #30302 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dominicshanshan
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39047 [ run ] triggered by Bot. Commit: 30f5d15 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39047 [ run ] completed with state SUCCESS. Commit: 30f5d15
/LLM/main/L0_MergeRequest_PR pipeline #30320 completed with status: 'SUCCESS'

CI Report

Link to invocation

@dominicshanshan dominicshanshan merged commit e2df4f4 into NVIDIA:main Mar 16, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants