[None][fix] Pass CUDA_VISIBLE_DEVICES as script arg instead of srun --export by qiaoxj07 · Pull Request #12370 · NVIDIA/TensorRT-LLM

qiaoxj07 · 2026-03-20T02:08:03Z

Summary

srun --export cannot reliably pass comma-separated values (like CUDA_VISIBLE_DEVICES=0,1) inside shared pyxis containers. The shared container retains the environment from its initial creation, ignoring subsequent --export overrides.
This caused all disagg benchmark workers to see all GPUs (e.g., 0,1,2,3) instead of their assigned subset, leading to GPU OOM during model load when multiple servers share the same node.
Fix: pass CUDA_VISIBLE_DEVICES as a command-line argument to start_worker.sh and export it inside the script, bypassing srun --export entirely for this variable.

Test plan

Run disagg benchmark with multi-GPU EP config (e.g., DEP2 gen + 2x CTX) on a single node and verify each worker sees only its assigned GPUs
Verify model loads successfully without GPU OOM
Verify single-GPU CTX workers still get correct CUDA_VISIBLE_DEVICES

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Refactor
- Adjusted GPU device visibility configuration in SLURM benchmark workflow; worker startup script now accepts explicit device specification argument for enhanced environment setup consistency.

…-export srun --export cannot reliably pass comma-separated values (like CUDA_VISIBLE_DEVICES=0,1) inside shared pyxis containers. The shared container retains the environment from its initial creation, ignoring subsequent --export overrides. This caused all workers to see all GPUs instead of their assigned subset, leading to GPU OOM during model load. Fix: pass CUDA_VISIBLE_DEVICES as a command-line argument to start_worker.sh and export it inside the script, bypassing srun --export entirely for this variable. Signed-off-by: Xianjie <[email protected]>

qiaoxj07 · 2026-03-20T02:10:45Z

/bot skip --comment "slurm scripts are not protected by CI"

coderabbitai · 2026-03-20T02:12:04Z

📝 Walkthrough

Walkthrough

The changes relocate GPU device visibility configuration from the Python submission script to the shell worker startup script, with submit.py now computing cuda_devices and passing it as an argument to start_worker.sh, which sets CUDA_VISIBLE_DEVICES accordingly.

Changes

Cohort / File(s)	Summary
SLURM Benchmark GPU Configuration `examples/disaggregated/slurm/benchmark/start_worker.sh`, `examples/disaggregated/slurm/benchmark/submit.py`	Shifted `CUDA_VISIBLE_DEVICES` setup from Python script to shell script. `start_worker.sh` now accepts `cuda_devices` as 9th argument and exports it. `submit.py` removes `gpu_ids` parameter from `build_worker_environment()`, computes `cuda_devices` string per worker, and passes it to `start_worker.sh` invocation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: moving CUDA_VISIBLE_DEVICES from srun --export to a script argument, which directly addresses the core fix.
Description check	✅ Passed	The PR description provides clear context (problem, impact, fix) and test coverage, though it lacks explicit sections matching the template's structure, the content covers the essential information adequately.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can enforce grammar and style rules using `languagetool`.

Configure the reviews.tools.languagetool setting to enable/disable rules and categories. Refer to the LanguageTool Community to learn more.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/disaggregated/slurm/benchmark/submit.py`:
- Line 520: The current code builds a single cuda_devices string from gpu_ids of
the first node (variable gpu_ids -> cuda_devices) which can misconfigure
multi-node, heterogeneous allocations; change the logic to compute
CUDA_VISIBLE_DEVICES per node using each node's local GPU id list (iterate over
the allocation's nodes or use the existing structure that contains per-node GPU
ids), and ensure wherever cuda_devices is injected into the job environment or
srun/ssh launch it uses the node-specific string (e.g., construct cuda_devices
inside the per-node launch loop or map keyed by node hostname) instead of
reusing the first-node value.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 998bd6fd-4329-4462-b881-069677676d08

📥 Commits

Reviewing files that changed from the base of the PR and between 297fa20 and b968c8c.

📒 Files selected for processing (2)

examples/disaggregated/slurm/benchmark/start_worker.sh
examples/disaggregated/slurm/benchmark/submit.py

tensorrt-cicd · 2026-03-20T02:16:40Z

PR_Github #39667 [ skip ] triggered by Bot. Commit: b968c8c Link to invocation

tensorrt-cicd · 2026-03-20T02:25:46Z

PR_Github #39667 [ skip ] completed with state SUCCESS. Commit: b968c8c
Skipping testing for commit b968c8c

Link to invocation

…-export (NVIDIA#12370) Signed-off-by: Xianjie <[email protected]>

qiaoxj07 requested review from a team as code owners March 20, 2026 02:08

qiaoxj07 requested review from QiJune, Tabrizian, laikhtewari and schetlur-nv March 20, 2026 02:08

github-actions Bot assigned qiaoxj07 Mar 20, 2026

qiaoxj07 requested a review from dc3671 March 20, 2026 02:09

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread examples/disaggregated/slurm/benchmark/submit.py

dc3671 approved these changes Mar 20, 2026

View reviewed changes

Shixiaowei02 approved these changes Mar 20, 2026

View reviewed changes

qiaoxj07 merged commit 14544c0 into NVIDIA:main Mar 20, 2026
8 checks passed

longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026

[None][fix] Pass CUDA_VISIBLE_DEVICES as script arg instead of srun -…

682e584

…-export (NVIDIA#12370) Signed-off-by: Xianjie <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Pass CUDA_VISIBLE_DEVICES as script arg instead of srun --export#12370

[None][fix] Pass CUDA_VISIBLE_DEVICES as script arg instead of srun --export#12370
qiaoxj07 merged 1 commit into
NVIDIA:mainfrom
qiaoxj07:fix/disagg-cuda-visible-devices

qiaoxj07 commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

qiaoxj07 commented Mar 20, 2026

Uh oh!

coderabbitai Bot commented Mar 20, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

qiaoxj07 commented Mar 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

qiaoxj07 commented Mar 20, 2026

Uh oh!

coderabbitai Bot commented Mar 20, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

qiaoxj07 commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading