Skip to content

Conversation

@yao-matrix
Copy link
Contributor

@yao-matrix yao-matrix commented Aug 14, 2025

Reproduce

w/ PyTorch 2.8

$ git clone https://github.com/huggingface/trl.git
$ cd ./trl
$ accelerate launch     --config_file examples/accelerate_configs/deepspeed_zero3.yaml     examples/scripts/sft_gpt_oss.py     --torch_dtype bfloat16     --model_name_or_path openai/gpt-oss-20b     --packing true packing_strategy wrapped     --run_name 20b-full-eager     --attn_implementation sdpa     --dataset_num_proc 6     --dataset_name HuggingFaceH4/Multilingual-Thinking     --gradient_checkpointing     --max_length 4096     --per_device_train_batch_size 1     --num_train_epochs 1     --logging_steps 1     --warmup_ratio 0.03     --lr_scheduler_type cosine_with_min_lr     --lr_scheduler_kwargs '{"min_lr_rate": 0.1}'     --output_dir gpt-oss-20b-multilingual-reasoner     --report_to trackio     --seed 42

Issue

File "/workspace/accelerate/src/accelerate/state.py", line 216, in init
dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 854, in init_distributed
cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 120, in init
self.init_process_group(backend, timeout, init_method, rank, world_size)
File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 164, in init_process_group
torch.distributed.init_process_group(backend, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 1685, in init_process_group
if device_id is not None and device_id.type != "cpu":
AttributeError: 'device' object has no attribute 'type'

Root Cause

torch.xpu.device in PyTorch is a context manager in PyTorch rather than a device class, it doesn't have attribute type

Fix

switch to use torch.device

@yao-matrix
Copy link
Contributor Author

@delock, pls help review, thx

@sfc-gh-truwase sfc-gh-truwase requested a review from delock August 14, 2025 21:50
Copy link
Collaborator

@delock delock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) August 15, 2025 16:21
@sfc-gh-truwase sfc-gh-truwase merged commit 33cd945 into deepspeedai:master Aug 15, 2025
13 checks passed
@yao-matrix yao-matrix deleted the xpu-fix branch August 15, 2025 18:11
LYMDLUT pushed a commit to LYMDLUT/DeepSpeed that referenced this pull request Aug 20, 2025
# Reproduce
w/ PyTorch 2.8
```
$ git clone https://github.com/huggingface/trl.git
$ cd ./trl
$ accelerate launch     --config_file examples/accelerate_configs/deepspeed_zero3.yaml     examples/scripts/sft_gpt_oss.py     --torch_dtype bfloat16     --model_name_or_path openai/gpt-oss-20b     --packing true packing_strategy wrapped     --run_name 20b-full-eager     --attn_implementation sdpa     --dataset_num_proc 6     --dataset_name HuggingFaceH4/Multilingual-Thinking     --gradient_checkpointing     --max_length 4096     --per_device_train_batch_size 1     --num_train_epochs 1     --logging_steps 1     --warmup_ratio 0.03     --lr_scheduler_type cosine_with_min_lr     --lr_scheduler_kwargs '{"min_lr_rate": 0.1}'     --output_dir gpt-oss-20b-multilingual-reasoner     --report_to trackio     --seed 42
```

# Issue

> File "/workspace/accelerate/src/accelerate/state.py", line 216, in
__init__
> dist.init_distributed(dist_backend=self.backend,
auto_mpi_discovery=False, **kwargs)
> File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py",
line 854, in init_distributed
> cdb = TorchBackend(dist_backend, timeout, init_method, rank,
world_size)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line
120, in __init__
> self.init_process_group(backend, timeout, init_method, rank,
world_size)
> File
"/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line
164, in init_process_group
>     torch.distributed.init_process_group(backend, **kwargs)
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py",
line 81, in wrapper
>     return func(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py",
line 95, in wrapper
>     func_return = func(*args, **kwargs)
>                   ^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py",
line 1685, in init_process_group
>     if device_id is not None and device_id.type != "cpu":
> AttributeError: 'device' object has no attribute 'type'

# Root Cause
`torch.xpu.device` in PyTorch is a context manager in PyTorch rather
than a device class, it doesn't have attribute `type`

# Fix
switch to use `torch.device`

Signed-off-by: Yao, Matrix <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: lym <[email protected]>
delock added a commit that referenced this pull request Sep 17, 2025
## Environment
```
torch        2.7.1
torch_npu    2.7.1rc1
deepspeed    0.17.3
```
## Issue
An `AttributeError` is raised when `init_process_group` on NPU device
since deepspeed v0.17.3.
The issue is similar to
#7488.

Trace:
```
Traceback (most recent call last):
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/cli/sft.py", line 10, in <module>
    sft_main()
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/train/sft.py", line 331, in sft_main
    return SwiftSft(args).main()
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/train/sft.py", line 27, in __init__
    super().__init__(args)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/base.py", line 19, in __init__
    self.args = self._parse_args(args)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/base.py", line 31, in _parse_args
    args, remaining_argv = parse_args(self.args_class, args)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/utils/utils.py", line 152, in parse_args
    args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 358, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 325, in __init__
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/argument/train_args.py", line 175, in __post_init__
    self.training_args = TrainerFactory.get_training_args(self)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/trainer_factory.py", line 70, in get_training_args
    return training_args_cls(**args_dict)
  File "<string>", line 167, in __init__
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/arguments.py", line 152, in __post_init__
    super().__post_init__()
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/arguments.py", line 133, in __post_init__
    super().__post_init__()
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 1803, in __post_init__
    self.device
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 2332, in device
    return self._setup_devices
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/utils/generic.py", line 74, in __get__
    cached = self.fget(obj)
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 2259, in _setup_devices
    self.distributed_state = PartialState(**accelerator_state_kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/accelerate/state.py", line 216, in __init__
    dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 854, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
  File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in __init__
    self.init_process_group(backend, timeout, init_method, rank, world_size)
  File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 163, in init_process_group
    torch.distributed.init_process_group(backend, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1717, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1831, in _new_process_group_helper
    if device_id is not None and (device_id.index is None or device_id.type == "cpu"):
AttributeError: 'device' object has no attribute 'index'
```

## Fix
Switch `torch.npu.device(device_index)` to `torch.device('npu',
device_index)`.

Now:

https://github.com/deepspeedai/DeepSpeed/blob/d40a0f5de84cf825b4e59dec041a50a1b3106989/accelerator/npu_accelerator.py#L47-L48

After fix:
```python
 def device(self, device_index=None): 
     return torch.device('npu', device_index) 
```

Signed-off-by: welsper <[email protected]>
Co-authored-by: welsper <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Ma, Guokai <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
# Reproduce
w/ PyTorch 2.8
```
$ git clone https://github.com/huggingface/trl.git
$ cd ./trl
$ accelerate launch     --config_file examples/accelerate_configs/deepspeed_zero3.yaml     examples/scripts/sft_gpt_oss.py     --torch_dtype bfloat16     --model_name_or_path openai/gpt-oss-20b     --packing true packing_strategy wrapped     --run_name 20b-full-eager     --attn_implementation sdpa     --dataset_num_proc 6     --dataset_name HuggingFaceH4/Multilingual-Thinking     --gradient_checkpointing     --max_length 4096     --per_device_train_batch_size 1     --num_train_epochs 1     --logging_steps 1     --warmup_ratio 0.03     --lr_scheduler_type cosine_with_min_lr     --lr_scheduler_kwargs '{"min_lr_rate": 0.1}'     --output_dir gpt-oss-20b-multilingual-reasoner     --report_to trackio     --seed 42
```

# Issue

> File "/workspace/accelerate/src/accelerate/state.py", line 216, in
__init__
> dist.init_distributed(dist_backend=self.backend,
auto_mpi_discovery=False, **kwargs)
> File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py",
line 854, in init_distributed
> cdb = TorchBackend(dist_backend, timeout, init_method, rank,
world_size)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line
120, in __init__
> self.init_process_group(backend, timeout, init_method, rank,
world_size)
> File
"/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line
164, in init_process_group
>     torch.distributed.init_process_group(backend, **kwargs)
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py",
line 81, in wrapper
>     return func(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py",
line 95, in wrapper
>     func_return = func(*args, **kwargs)
>                   ^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py",
line 1685, in init_process_group
>     if device_id is not None and device_id.type != "cpu":
> AttributeError: 'device' object has no attribute 'type'

# Root Cause
`torch.xpu.device` in PyTorch is a context manager in PyTorch rather
than a device class, it doesn't have attribute `type`

# Fix
switch to use `torch.device`

Signed-off-by: Yao, Matrix <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
## Environment
```
torch        2.7.1
torch_npu    2.7.1rc1
deepspeed    0.17.3
```
## Issue
An `AttributeError` is raised when `init_process_group` on NPU device
since deepspeed v0.17.3.
The issue is similar to
deepspeedai#7488.

Trace:
```
Traceback (most recent call last):
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/cli/sft.py", line 10, in <module>
    sft_main()
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/train/sft.py", line 331, in sft_main
    return SwiftSft(args).main()
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/train/sft.py", line 27, in __init__
    super().__init__(args)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/base.py", line 19, in __init__
    self.args = self._parse_args(args)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/base.py", line 31, in _parse_args
    args, remaining_argv = parse_args(self.args_class, args)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/utils/utils.py", line 152, in parse_args
    args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 358, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 325, in __init__
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/argument/train_args.py", line 175, in __post_init__
    self.training_args = TrainerFactory.get_training_args(self)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/trainer_factory.py", line 70, in get_training_args
    return training_args_cls(**args_dict)
  File "<string>", line 167, in __init__
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/arguments.py", line 152, in __post_init__
    super().__post_init__()
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/arguments.py", line 133, in __post_init__
    super().__post_init__()
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 1803, in __post_init__
    self.device
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 2332, in device
    return self._setup_devices
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/utils/generic.py", line 74, in __get__
    cached = self.fget(obj)
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 2259, in _setup_devices
    self.distributed_state = PartialState(**accelerator_state_kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/accelerate/state.py", line 216, in __init__
    dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 854, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
  File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in __init__
    self.init_process_group(backend, timeout, init_method, rank, world_size)
  File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 163, in init_process_group
    torch.distributed.init_process_group(backend, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1717, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1831, in _new_process_group_helper
    if device_id is not None and (device_id.index is None or device_id.type == "cpu"):
AttributeError: 'device' object has no attribute 'index'
```

## Fix
Switch `torch.npu.device(device_index)` to `torch.device('npu',
device_index)`.

Now:

https://github.com/deepspeedai/DeepSpeed/blob/d40a0f5de84cf825b4e59dec041a50a1b3106989/accelerator/npu_accelerator.py#L47-L48

After fix:
```python
 def device(self, device_index=None): 
     return torch.device('npu', device_index) 
```

Signed-off-by: welsper <[email protected]>
Co-authored-by: welsper <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Ma, Guokai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants