[None][fix] Pass CUDA_VISIBLE_DEVICES as script arg instead of srun --export#12370
Conversation
…-export srun --export cannot reliably pass comma-separated values (like CUDA_VISIBLE_DEVICES=0,1) inside shared pyxis containers. The shared container retains the environment from its initial creation, ignoring subsequent --export overrides. This caused all workers to see all GPUs instead of their assigned subset, leading to GPU OOM during model load. Fix: pass CUDA_VISIBLE_DEVICES as a command-line argument to start_worker.sh and export it inside the script, bypassing srun --export entirely for this variable. Signed-off-by: Xianjie <[email protected]>
|
/bot skip --comment "slurm scripts are not protected by CI" |
📝 WalkthroughWalkthroughThe changes relocate GPU device visibility configuration from the Python submission script to the shell worker startup script, with Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can enforce grammar and style rules using `languagetool`.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/disaggregated/slurm/benchmark/submit.py`:
- Line 520: The current code builds a single cuda_devices string from gpu_ids of
the first node (variable gpu_ids -> cuda_devices) which can misconfigure
multi-node, heterogeneous allocations; change the logic to compute
CUDA_VISIBLE_DEVICES per node using each node's local GPU id list (iterate over
the allocation's nodes or use the existing structure that contains per-node GPU
ids), and ensure wherever cuda_devices is injected into the job environment or
srun/ssh launch it uses the node-specific string (e.g., construct cuda_devices
inside the per-node launch loop or map keyed by node hostname) instead of
reusing the first-node value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 998bd6fd-4329-4462-b881-069677676d08
📒 Files selected for processing (2)
examples/disaggregated/slurm/benchmark/start_worker.shexamples/disaggregated/slurm/benchmark/submit.py
|
PR_Github #39667 [ skip ] triggered by Bot. Commit: |
|
PR_Github #39667 [ skip ] completed with state |
…-export (NVIDIA#12370) Signed-off-by: Xianjie <[email protected]>
Summary
--exportcannot reliably pass comma-separated values (likeCUDA_VISIBLE_DEVICES=0,1) inside shared pyxis containers. The shared container retains the environment from its initial creation, ignoring subsequent--exportoverrides.0,1,2,3) instead of their assigned subset, leading to GPU OOM during model load when multiple servers share the same node.CUDA_VISIBLE_DEVICESas a command-line argument tostart_worker.shandexportit inside the script, bypassing srun--exportentirely for this variable.Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes