[https://nvbugs/6143599][fix] DeepSeek-V3 OOM and artifacts path by dominicshanshan · Pull Request #14232 · NVIDIA/TensorRT-LLM

dominicshanshan · 2026-05-18T01:32:21Z

Lower kv_cache_free_gpu_memory_fraction from 0.85 to 0.75 for DeepSeek-V3/R1; the previous fraction left no headroom for the transient DeepGEMM MoE workspace and OOM'd at max_batch_size=2048.
Set PYTORCH_ALLOC_CONF=expandable_segments:True for DeepSeek-V3/R1 to reduce CUDA allocator fragmentation under stress.
Add ARTIFACTS_DIR constant anchored to this file's location; pass it to aiperf via --output-artifact-dir and use it as the default reader path in extract_stress_test_metrics, so writes and reads stay aligned independent of pytest cwd.

Summary by CodeRabbit

Tests
- Improved stress test infrastructure with standardized artifact storage and retrieval.
- Enhanced DeepSeek model testing configurations for better memory management and performance optimization.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

- Lower kv_cache_free_gpu_memory_fraction from 0.85 to 0.75 for DeepSeek-V3/R1; the previous fraction left no headroom for the transient DeepGEMM MoE workspace and OOM'd at max_batch_size=2048. - Set PYTORCH_ALLOC_CONF=expandable_segments:True for DeepSeek-V3/R1 to reduce CUDA allocator fragmentation under stress. - Add ARTIFACTS_DIR constant anchored to this file's location; pass it to aiperf via --output-artifact-dir and use it as the default reader path in extract_stress_test_metrics, so writes and reads stay aligned independent of pytest cwd. Signed-off-by: Wangshanshan <[email protected]>

coderabbitai · 2026-05-18T01:37:20Z

📝 Walkthrough

Walkthrough

The stress test script consolidates artifact directory handling by introducing a centralized ARTIFACTS_DIR constant and updating both the aiperf command builder and metrics extraction function to reference it consistently, ensuring artifacts are written to and read from the same location.

Changes

Artifact Directory Centralization

Layer / File(s)	Summary
Centralized artifact directory configuration `tests/integration/defs/stress_test/stress_test.py`	`ARTIFACTS_DIR` constant is defined relative to the script location and used as the `--output-artifact-dir` argument for aiperf and as the default `artifacts_dir` for metrics extraction, establishing a single source of truth for artifact paths.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#12823: Adjusts DeepSeek KV cache configuration settings to prevent GPU out-of-memory errors in stress test scenarios.

Suggested reviewers

StanleySun639

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The PR description provides a clear explanation of the issue and solution. However, the Test Coverage section is empty, which is a required template section.	Add details to the Test Coverage section listing the relevant tests that safeguard these changes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the PR's main changes: addressing DeepSeek-V3 OOM issues and fixing artifact path alignment.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/integration/defs/stress_test/stress_test.py (2)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update the NVIDIA copyright year for this modified file.

This file was modified but still shows 2022-2024 in the SPDX header on Line 1.

As per coding guidelines, “Include NVIDIA copyright header on all new files; update year on modified files”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/stress_test/stress_test.py` at line 1, Update the SPDX
header at the top of stress_test.py to reflect the current copyright year by
changing "2022-2024" to "2022-2026" (i.e., update the year range in the existing
SPDX comment); ensure the header remains the same format and spelling as the
other files' NVIDIA SPDX header.
1370-1377: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update outdated docstring path description.

The docstring still describes a parent defs/artifacts default, but Line 1379 now defaults to ARTIFACTS_DIR under stress_test/artifacts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/stress_test/stress_test.py` around lines 1370 - 1377,
The docstring for the function parameter artifacts_dir is outdated (mentions
defs/artifacts parent path) — update the documentation to reflect the current
default path (ARTIFACTS_DIR under stress_test/artifacts) and any local-testing
note; specifically modify the docstring lines that describe the default
artifacts_dir and the commented local testing path so they mention ARTIFACTS_DIR
(and/or stress_test/artifacts) instead of defs/artifacts, referencing the
artifacts_dir parameter and the ARTIFACTS_DIR constant to keep them consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/defs/stress_test/stress_test.py`:
- Around line 966-967: The fixed ARTIFACTS_DIR passed via the
"--output-artifact-dir" flag is causing recursive reads of historical
profile_export_aiperf.json files and contaminating Stage 2 metrics; update the
run setup to either (a) create a unique per-run artifact directory (e.g., append
a timestamp or run-id to ARTIFACTS_DIR) when constructing the
"--output-artifact-dir" value, or (b) explicitly delete or move any existing
profile_export_aiperf.json files under ARTIFACTS_DIR before starting profiling;
ensure changes touch the code that builds the CLI args containing
"--output-artifact-dir" and references ARTIFACTS_DIR, and apply the same
cleanup/unique-dir logic to the other occurrences mentioned (the other block
that passes ARTIFACTS_DIR).
- Around line 579-583: The code sets PYTORCH_ALLOC_CONF unconditionally for
DeepSeek models but never restores it, so capture the previous
os.environ.get("PYTORCH_ALLOC_CONF") before setting it, set
os.environ["PYTORCH_ALLOC_CONF"]="expandable_segments:True" only for the scope
where ServerConfig is created (the block around the if checking config.model_dir
and the subsequent test_server_config = ServerConfig(...)), and ensure you
restore the original value in a finally/cleanup step (reassign the previous
value or delete the env var if it was None) so later parametrized tests don't
inherit the override.

---

Outside diff comments:
In `@tests/integration/defs/stress_test/stress_test.py`:
- Line 1: Update the SPDX header at the top of stress_test.py to reflect the
current copyright year by changing "2022-2024" to "2022-2026" (i.e., update the
year range in the existing SPDX comment); ensure the header remains the same
format and spelling as the other files' NVIDIA SPDX header.
- Around line 1370-1377: The docstring for the function parameter artifacts_dir
is outdated (mentions defs/artifacts parent path) — update the documentation to
reflect the current default path (ARTIFACTS_DIR under stress_test/artifacts) and
any local-testing note; specifically modify the docstring lines that describe
the default artifacts_dir and the commented local testing path so they mention
ARTIFACTS_DIR (and/or stress_test/artifacts) instead of defs/artifacts,
referencing the artifacts_dir parameter and the ARTIFACTS_DIR constant to keep
them consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1c594b23-ffb2-4946-8b94-4de99297b36b

📥 Commits

Reviewing files that changed from the base of the PR and between 1109e1c and 17770ee.

📒 Files selected for processing (1)

tests/integration/defs/stress_test/stress_test.py

dominicshanshan · 2026-05-18T02:03:00Z

/bot run

tensorrt-cicd · 2026-05-18T02:08:20Z

PR_Github #48805 [ run ] triggered by Bot. Commit: 17770ee Link to invocation

tensorrt-cicd · 2026-05-18T04:24:27Z

PR_Github #48805 [ run ] completed with state SUCCESS. Commit: 17770ee
/LLM/main/L0_MergeRequest_PR pipeline #38568 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dominicshanshan · 2026-05-18T07:39:45Z

/bot run

tensorrt-cicd · 2026-05-18T07:45:07Z

PR_Github #48860 [ run ] triggered by Bot. Commit: 17770ee Link to invocation

tensorrt-cicd · 2026-05-18T08:00:01Z

PR_Github #48860 [ run ] completed with state FAILURE. Commit: 17770ee
/LLM/main/L0_MergeRequest_PR pipeline #38614 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dominicshanshan · 2026-05-18T10:34:42Z

/bot run

tensorrt-cicd · 2026-05-18T10:40:11Z

PR_Github #48895 [ run ] triggered by Bot. Commit: 17770ee Link to invocation

tensorrt-cicd · 2026-05-18T14:14:57Z

PR_Github #48895 [ run ] completed with state SUCCESS. Commit: 17770ee
/LLM/main/L0_MergeRequest_PR pipeline #38645 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dominicshanshan · 2026-05-19T03:00:05Z

/bot run

tensorrt-cicd · 2026-05-19T03:06:02Z

PR_Github #49057 [ run ] triggered by Bot. Commit: 17770ee Link to invocation

tensorrt-cicd · 2026-05-19T09:39:39Z

PR_Github #49057 [ run ] completed with state SUCCESS. Commit: 17770ee
/LLM/main/L0_MergeRequest_PR pipeline #38788 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dominicshanshan · 2026-05-19T10:23:43Z

/bot run

tensorrt-cicd · 2026-05-19T10:29:59Z

PR_Github #49174 [ run ] triggered by Bot. Commit: 17770ee Link to invocation

tensorrt-cicd · 2026-05-19T15:14:46Z

PR_Github #49174 [ run ] completed with state SUCCESS. Commit: 17770ee
/LLM/main/L0_MergeRequest_PR pipeline #38853 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dominicshanshan · 2026-05-20T01:44:35Z

/bot run

tensorrt-cicd · 2026-05-20T01:50:13Z

PR_Github #49304 [ run ] triggered by Bot. Commit: 17770ee Link to invocation

tensorrt-cicd · 2026-05-20T03:02:25Z

PR_Github #49304 [ run ] completed with state SUCCESS. Commit: 17770ee
/LLM/main/L0_MergeRequest_PR pipeline #38966 completed with status: 'SUCCESS'

CI Report

Link to invocation

…DIA#14232) Signed-off-by: Wangshanshan <[email protected]>

dominicshanshan requested a review from a team as a code owner May 18, 2026 01:32

github-actions Bot assigned dominicshanshan May 18, 2026

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread tests/integration/defs/stress_test/stress_test.py

Comment thread tests/integration/defs/stress_test/stress_test.py

jieli-matrix approved these changes May 18, 2026

View reviewed changes

dominicshanshan mentioned this pull request May 18, 2026

[https://nvbugs/6143599][fix] Re-apply proven fix from commit 295615d8bf (not present in HEAD): subtract 2× pr #13915

Closed

2 tasks

dominicshanshan merged commit 3024ec6 into NVIDIA:main May 21, 2026
13 of 14 checks passed

Conversation

dominicshanshan commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented May 18, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dominicshanshan commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

dominicshanshan commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

dominicshanshan commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

dominicshanshan commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

dominicshanshan commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

dominicshanshan commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dominicshanshan commented May 18, 2026 •

edited by coderabbitai Bot

Loading