set `device_id` in torch's `init_process_group` #7266

stas00 · 2025-04-30T19:09:54Z

This PR overcomes this issue when using any torch.distributed calls w/ deepspeed:

[W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 
to perform barrier as devices used by this process are currently unknown. This can
 potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in
 barrier() to force use of a particular device, or call init_process_group() with a device_id.

by setting device_id to the correct device corresponding to LOCAL_RANK env var.

Update: discovered torch.dist deadlocks with torch=>2.7.0 when using device_id arg - switching to draft for now as we can't commit this until we know how to work around this.

stas00 · 2025-05-06T18:43:06Z

@loadams?

loadams · 2025-05-07T14:43:22Z

@loadams?

Sorry @stas00, I missed this and will review today.

loadams · 2025-05-07T15:38:06Z

deepspeed/comm/torch.py

                                                 rank=rank,
-                                                 world_size=world_size)
+                                                 world_size=world_size,
+                                                 device_id=torch.device('cuda', local_rank))


@stas00 - the cuda here will cause failures on non-cuda backends like HPU (not sure why the tests didn't run, but ran manually here: https://github.com/deepspeedai/DeepSpeed/actions/runs/14886572284/job/41807642413)

aha, thank you so much for seeing the big picture, @loadams

so we need something like:

device = torch.device(f'cuda:{local_rank}' if torch.cuda.is_available() else 'cpu')

or should I just add device_id only if torch.cuda.is_available() and do nothing otherwise - I mean I don't know what device to use in the case of HPU if it's not cpu?

Could we use get_accelerator() here?

whatever works - could you please show what you have in mind specifically for filling out:

device_id=torch.device('cuda', local_rank)?

I think what is needed is get_accelerator().device(local_rank).

For cuda this maps to torch.cuda.device(device_index)

@stas00, does that work?

Nope, still randomly deadlocks:

Thread 474480 (idle): "MainThread" broadcast (torch/distributed/distributed_c10d.py:2772) wrapper (torch/distributed/c10d_logger.py:81) broadcast (deepspeed/comm/torch.py:216) broadcast (deepspeed/comm/comm.py:224) log_wrapper (deepspeed/comm/comm.py:117) _zero_init_param (deepspeed/runtime/zero/partition_parameters.py:1054) _post_init_method (deepspeed/runtime/zero/partition_parameters.py:1099) wrapper (deepspeed/runtime/zero/partition_parameters.py:521) __init__ (transformers/models/llama/modeling_llama.py:166) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) __init__ (transformers/models/llama/modeling_llama.py:297) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) <listcomp> (transformers/models/llama/modeling_llama.py:477) __init__ (transformers/models/llama/modeling_llama.py:477) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) __init__ (transformers/models/llama/modeling_llama.py:740) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) from_pretrained (transformers/modeling_utils.py:4340) _wrapper (transformers/modeling_utils.py:279) from_pretrained (transformers/models/auto/auto_factory.py:571) from_pretrained (liger_kernel/transformers/auto_model.py:38) create_model (arctic_training/model/liger_factory.py:45) wrapper (arctic_training/callback/mixin.py:45) __call__ (arctic_training/model/factory.py:68) __init__ (arctic_training/trainer/trainer.py:228) wrapper (arctic_training/callback/mixin.py:45) run_script (arctic_training/cli.py:108) <module> (arctic_training_run:8) Thread 476034 (idle): "Thread-1" wait (threading.py:324) wait (threading.py:607) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:1016) _bootstrap (threading.py:973) Process 474481: /usr/bin/python -u /home/yak/.local/bin/arctic_training_run --local_rank=6 --mode train --config run-dp1-sp8.yml Python v3.10.12 (/usr/bin/python3.10) Thread 474481 (active): "MainThread" wrapped_fn (deepspeed/runtime/zero/partition_parameters.py:240) _compute_default_rope_parameters (transformers/modeling_rope_utils.py:130) __init__ (transformers/models/llama/modeling_llama.py:106) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) __init__ (transformers/models/llama/modeling_llama.py:480) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) __init__ (transformers/models/llama/modeling_llama.py:740) wrapper (deepspeed/runtime/zero/partition_parameters.py:511) from_pretrained (transformers/modeling_utils.py:4340) _wrapper (transformers/modeling_utils.py:279) from_pretrained (transformers/models/auto/auto_factory.py:571) from_pretrained (liger_kernel/transformers/auto_model.py:38) create_model (arctic_training/model/liger_factory.py:45) wrapper (arctic_training/callback/mixin.py:45) __call__ (arctic_training/model/factory.py:68) __init__ (arctic_training/trainer/trainer.py:228) wrapper (arctic_training/callback/mixin.py:45) run_script (arctic_training/cli.py:108) <module> (arctic_training_run:8) Thread 476031 (idle): "Thread-1" wait (threading.py:324) wait (threading.py:607) run (tqdm/_monitor.py:60) _bootstrap_inner (threading.py:1016) _bootstrap (threading.py:973) Process 474482: /usr/bin/python -u /home/yak/.local/bin/arctic_training_run --local_rank=7 --mode train --config run-dp1-sp8.yml Python v3.10.12 (/usr/bin/python3.10)

so definitely let's not merge this - asking pytorch folks for help.

Good point. Yes, let's align with HF Transformers. Thanks.

@loadams can you help with this?

@sfc-gh-truwase @stas00 - yes, I think we should do something like this. Min 2.1 would be good. Agreed 2.3 might be a bit rushed, but let me check what cuda/GPU versions that implies as well.

Further investigation shows the deadlocks start at torch>=2.7.0 - it's difficult to debug since the deadlocks aren't always reproducible but usually happen after some 3-6 re-runs

So switching to draft for now as we can't commit this until we know how to work around this. I'm actively pursuing this with pytorch devs.

A seemingly related Issue is: modded-nanogpt flaky NCCL hang starting 3/30 nightly

Signed-off-by: Stas Bekman <[email protected]>

stas00 · 2025-07-16T00:38:30Z

ok, so now we know setting device_id leads to hanging in 2.6.0<torch<2.7.1 pytorch/pytorch#153960

so adapted to that.

This PR overcomes this issue when using any `torch.distributed` calls w/ deepspeed: ``` [W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. ``` by setting `device_id` to the correct device corresponding to `LOCAL_RANK` env var. ------------------- Update: discovered `torch.dist` deadlocks with `torch=>2.7.0` when using `device_id` arg - switching to draft for now as we can't commit this until we know how to work around this. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

The [PR #7266](#7266) enforces the devices having explicit device indices (i.e. 'hpu:0', 'cuda:0', etc). This PR aligns HPU devices to the requirement. Signed-off-by: Max Kovalenko <[email protected]>

This PR overcomes this issue when using any `torch.distributed` calls w/ deepspeed: ``` [W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. ``` by setting `device_id` to the correct device corresponding to `LOCAL_RANK` env var. ------------------- Update: discovered `torch.dist` deadlocks with `torch=>2.7.0` when using `device_id` arg - switching to draft for now as we can't commit this until we know how to work around this. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: lym <[email protected]>

The [PR deepspeedai#7266](deepspeedai#7266) enforces the devices having explicit device indices (i.e. 'hpu:0', 'cuda:0', etc). This PR aligns HPU devices to the requirement. Signed-off-by: Max Kovalenko <[email protected]> Signed-off-by: lym <[email protected]>

This PR overcomes this issue when using any `torch.distributed` calls w/ deepspeed: ``` [W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. ``` by setting `device_id` to the correct device corresponding to `LOCAL_RANK` env var. ------------------- Update: discovered `torch.dist` deadlocks with `torch=>2.7.0` when using `device_id` arg - switching to draft for now as we can't commit this until we know how to work around this. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

The [PR deepspeedai#7266](deepspeedai#7266) enforces the devices having explicit device indices (i.e. 'hpu:0', 'cuda:0', etc). This PR aligns HPU devices to the requirement. Signed-off-by: Max Kovalenko <[email protected]>

Update torch.py

700bc77

stas00 requested a review from GuanhuaWang as a code owner April 30, 2025 19:09

stas00 requested review from loadams and removed request for GuanhuaWang April 30, 2025 19:09

loadams reviewed May 7, 2025

View reviewed changes

sfc-gh-truwase and others added 3 commits May 14, 2025 19:50

Merge branch 'master' into stas00-dist-init-device-id

272f01a

Update torch.py

4cd92e7

Merge branch 'master' into stas00-dist-init-device-id

eb34bc4

loadams approved these changes May 15, 2025

View reviewed changes

stas00 added 2 commits May 15, 2025 11:10

fix

71420e6

Signed-off-by: Stas Bekman <[email protected]>

add device_id if init_process_group has it

6ac04d9

Signed-off-by: Stas Bekman <[email protected]>

stas00 marked this pull request as draft May 15, 2025 22:15

Merge branch 'master' into stas00-dist-init-device-id

01a1c89

stas00 mentioned this pull request May 20, 2025

Passing device_id to torch.distributed.init_process_group() results in NCCL randomly hanging during communications pytorch/pytorch#153960

Closed

stas00 mentioned this pull request Jul 13, 2025

fix: all reduce bench prevent warning stas00/ml-engineering#112

Merged

stas00 and others added 2 commits July 15, 2025 16:12

Merge branch 'master' into stas00-dist-init-device-id

feb585e

safe pt versions to use device_id with

bb0a380

Signed-off-by: Stas Bekman <[email protected]>

stas00 marked this pull request as ready for review July 16, 2025 00:37

loadams merged commit ee286e5 into master Jul 16, 2025
10 of 11 checks passed

loadams deleted the stas00-dist-init-device-id branch July 16, 2025 15:32

sfc-gh-sbekman mentioned this pull request Aug 6, 2025

After training Qwen3-MoE, the model's performance is very poor. snowflakedb/ArcticTraining#250

Closed

deepcharm mentioned this pull request Aug 18, 2025

Add index to HPU devices #7497

Merged

stas00 mentioned this pull request Sep 4, 2025

avoid setting device_id to init_process_group #7542

Merged

hamishivi mentioned this pull request Dec 26, 2025

Upgrade deepspeed allenai/open-instruct#1311

Merged

set device_id in torch's init_process_group #7266

set device_id in torch's init_process_group #7266

Uh oh!

Conversation

stas00 commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented May 6, 2025

Uh oh!

loadams commented May 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 commented Jul 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

set `device_id` in torch's `init_process_group` #7266

set `device_id` in torch's `init_process_group` #7266

stas00 commented Apr 30, 2025 •

edited

Loading

stas00 May 7, 2025 •

edited

Loading