[https://nvbugs/5955927][fix] Add warm up before aiperf to fix timeout issue.#12178
Conversation
📝 WalkthroughWalkthroughThis change introduces inference warmup logic to the stress testing framework before benchmarking, adds request timeout handling throughout the aiperf command construction, and enhances process execution with improved stdout/stderr capture and error reporting. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/integration/defs/stress_test/stress_test.py`:
- Around line 978-989: Replace the unbounded lists stdout_lines and stderr_lines
with bounded deques (e.g., collections.deque(maxlen=50)) so only the last N
aiperf log lines are retained; update the imports to include deque and keep
using stdout_lock/stderr_lock and the _capture_and_print function unchanged
except to append to the deque instead of a list to prevent unbounded memory
growth (also apply the same change to the other buffer instances around the
_capture_and_print usage).
- Around line 352-355: The warmup currently uses default num_warmup_requests and
timeout (2×300s) and only logs a warning on False, so update the call sites (the
warmup_inference invocation in the stress test and the similar call around lines
796-803) to compute an explicit warmup deadline from the current stage
timeout/budget and pass that as the timeout argument to warmup_inference, then
immediately abort/raise/return (fail fast) if warmup_inference returns False;
locate the function warmup_inference and the caller run_aiperf_process (and the
other caller around 796-803) to make the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 650625cc-f140-4a3c-bb4f-16a13f4e17d0
📒 Files selected for processing (1)
tests/integration/defs/stress_test/stress_test.py
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" |
|
PR_Github #38821 [ run ] triggered by Bot. Commit: |
|
PR_Github #38821 [ run ] completed with state |
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" |
|
PR_Github #38848 [ run ] triggered by Bot. Commit: |
|
PR_Github #38848 [ run ] completed with state
|
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test |
|
PR_Github #38859 [ run ] triggered by Bot. Commit: |
|
PR_Github #38859 [ run ] completed with state
|
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test |
|
PR_Github #38872 [ run ] triggered by Bot. Commit: |
|
PR_Github #38872 [ run ] completed with state
|
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test |
|
PR_Github #38930 [ run ] triggered by Bot. Commit: |
|
PR_Github #38930 [ run ] completed with state |
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test |
|
PR_Github #38944 [ run ] triggered by Bot. Commit: |
|
PR_Github #38944 [ run ] completed with state |
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test |
|
PR_Github #38964 [ run ] triggered by Bot. Commit: |
|
PR_Github #38964 [ run ] completed with state
|
9b34130 to
d6b7916
Compare
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test |
1 similar comment
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test |
|
PR_Github #38970 [ run ] triggered by Bot. Commit: |
|
PR_Github #38975 [ run ] triggered by Bot. Commit: |
|
PR_Github #38975 [ run ] completed with state |
2a74e0f to
3d1ea05
Compare
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test |
|
PR_Github #38992 [ run ] triggered by Bot. Commit: |
|
PR_Github #38992 [ run ] completed with state
|
Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
This reverts commit 3d1ea05. Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
7e70f12 to
30f5d15
Compare
|
/bot run --stage-list "A10-PyTorch-Post-Merge-1" --disable-reuse-test |
|
PR_Github #39011 [ run ] triggered by Bot. Commit: |
|
PR_Github #39011 [ run ] completed with state |
|
/bot run |
|
PR_Github #39026 [ run ] triggered by Bot. Commit: |
|
PR_Github #39026 [ run ] completed with state
|
|
/bot run |
|
PR_Github #39047 [ run ] triggered by Bot. Commit: |
|
PR_Github #39047 [ run ] completed with state |
…t issue. (NVIDIA#12178) Signed-off-by: Wangshanshan <[email protected]>
…t issue. (NVIDIA#12178) Signed-off-by: Wangshanshan <[email protected]>
…t issue. (NVIDIA#12178) Signed-off-by: Wangshanshan <[email protected]>
…t issue. (NVIDIA#12178) Signed-off-by: Wangshanshan <[email protected]>
Summary by CodeRabbit
Release Notes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.