Skip to content

[None][fix] Pass CUDA_VISIBLE_DEVICES as script arg instead of srun --export#12370

Merged
qiaoxj07 merged 1 commit into
NVIDIA:mainfrom
qiaoxj07:fix/disagg-cuda-visible-devices
Mar 20, 2026
Merged

[None][fix] Pass CUDA_VISIBLE_DEVICES as script arg instead of srun --export#12370
qiaoxj07 merged 1 commit into
NVIDIA:mainfrom
qiaoxj07:fix/disagg-cuda-visible-devices

Conversation

@qiaoxj07
Copy link
Copy Markdown
Collaborator

@qiaoxj07 qiaoxj07 commented Mar 20, 2026

Summary

  • srun --export cannot reliably pass comma-separated values (like CUDA_VISIBLE_DEVICES=0,1) inside shared pyxis containers. The shared container retains the environment from its initial creation, ignoring subsequent --export overrides.
  • This caused all disagg benchmark workers to see all GPUs (e.g., 0,1,2,3) instead of their assigned subset, leading to GPU OOM during model load when multiple servers share the same node.
  • Fix: pass CUDA_VISIBLE_DEVICES as a command-line argument to start_worker.sh and export it inside the script, bypassing srun --export entirely for this variable.

Test plan

  • Run disagg benchmark with multi-GPU EP config (e.g., DEP2 gen + 2x CTX) on a single node and verify each worker sees only its assigned GPUs
  • Verify model loads successfully without GPU OOM
  • Verify single-GPU CTX workers still get correct CUDA_VISIBLE_DEVICES

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • Refactor
    • Adjusted GPU device visibility configuration in SLURM benchmark workflow; worker startup script now accepts explicit device specification argument for enhanced environment setup consistency.

…-export

srun --export cannot reliably pass comma-separated values (like
CUDA_VISIBLE_DEVICES=0,1) inside shared pyxis containers. The shared
container retains the environment from its initial creation, ignoring
subsequent --export overrides. This caused all workers to see all GPUs
instead of their assigned subset, leading to GPU OOM during model load.

Fix: pass CUDA_VISIBLE_DEVICES as a command-line argument to
start_worker.sh and export it inside the script, bypassing srun
--export entirely for this variable.

Signed-off-by: Xianjie <[email protected]>
@qiaoxj07
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "slurm scripts are not protected by CI"

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

📝 Walkthrough

Walkthrough

The changes relocate GPU device visibility configuration from the Python submission script to the shell worker startup script, with submit.py now computing cuda_devices and passing it as an argument to start_worker.sh, which sets CUDA_VISIBLE_DEVICES accordingly.

Changes

Cohort / File(s) Summary
SLURM Benchmark GPU Configuration
examples/disaggregated/slurm/benchmark/start_worker.sh, examples/disaggregated/slurm/benchmark/submit.py
Shifted CUDA_VISIBLE_DEVICES setup from Python script to shell script. start_worker.sh now accepts cuda_devices as 9th argument and exports it. submit.py removes gpu_ids parameter from build_worker_environment(), computes cuda_devices string per worker, and passes it to start_worker.sh invocation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: moving CUDA_VISIBLE_DEVICES from srun --export to a script argument, which directly addresses the core fix.
Description check ✅ Passed The PR description provides clear context (problem, impact, fix) and test coverage, though it lacks explicit sections matching the template's structure, the content covers the essential information adequately.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can enforce grammar and style rules using `languagetool`.

Configure the reviews.tools.languagetool setting to enable/disable rules and categories. Refer to the LanguageTool Community to learn more.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/disaggregated/slurm/benchmark/submit.py`:
- Line 520: The current code builds a single cuda_devices string from gpu_ids of
the first node (variable gpu_ids -> cuda_devices) which can misconfigure
multi-node, heterogeneous allocations; change the logic to compute
CUDA_VISIBLE_DEVICES per node using each node's local GPU id list (iterate over
the allocation's nodes or use the existing structure that contains per-node GPU
ids), and ensure wherever cuda_devices is injected into the job environment or
srun/ssh launch it uses the node-specific string (e.g., construct cuda_devices
inside the per-node launch loop or map keyed by node hostname) instead of
reusing the first-node value.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 998bd6fd-4329-4462-b881-069677676d08

📥 Commits

Reviewing files that changed from the base of the PR and between 297fa20 and b968c8c.

📒 Files selected for processing (2)
  • examples/disaggregated/slurm/benchmark/start_worker.sh
  • examples/disaggregated/slurm/benchmark/submit.py

Comment thread examples/disaggregated/slurm/benchmark/submit.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39667 [ skip ] triggered by Bot. Commit: b968c8c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39667 [ skip ] completed with state SUCCESS. Commit: b968c8c
Skipping testing for commit b968c8c

Link to invocation

@qiaoxj07 qiaoxj07 merged commit 14544c0 into NVIDIA:main Mar 20, 2026
8 checks passed
longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants