-
Notifications
You must be signed in to change notification settings - Fork 4.7k
fix: Propagate strip_tensor_paddings
#7426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Sam Foreman <[email protected]>
8170f4a to
d60dbe9
Compare
Collaborator
sfc-gh-truwase
approved these changes
Jul 13, 2025
lpnpcs
pushed a commit
to lpnpcs/DeepSpeed
that referenced
this pull request
Jul 30, 2025
Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`,
I encountered:
```python
Traceback (most recent call last):
File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
self._strip_tensor_paddings(sd)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
```
(see full traceback[^traceback] below)
The issue is, there's no way to propagate the `strip_tensor_paddings`
argument from the
[`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/deepspeed_checkpoint.py#L123)
method through to the [`ZeroCheckpoint.get_state_for_rank(...)`
method](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/zero_checkpoint.py#L53)
(which accepts it as an argument) since it doesn't accept it.
This PR adds this additional `strip_tensor_paddings` argument (default
`True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method,
and passes it through to the
`self.zero_checkpoint.get_state_for_rank(...,
strip_tensor_paddings=strip_tensor_paddings)`, as shown below:
```diff
- def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+ def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict:
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
tp_index=tp_index,
dp_index=dp_index,
- keys_to_ignore=[PARAM_SHAPES])
+ keys_to_ignore=[PARAM_SHAPES],
+ strip_tensor_paddings=strip_tensor_paddings)
```
[^traceback]: Full traceback:
<details closed><summary>[Full Traceback]:</summary>
```bash
#[🐍 aurora_nre_models_frameworks-2025.0.0](👻
aurora_nre_models_frameworks-2025.0.0)
#[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓]
#[07/12/25 @ 16:07:12][x4209c2s4b0n0]
;
ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash
; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo
"global step: ${gs}" && python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py
--input_folder"${ckpt_dir}/global_step${gs}" --output_folder
"${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder
global step: 158945
[W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only
once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the
same dispatch key
operator: aten::_cummax_helper(Tensor self, Tensor(a!) values,
Tensor(b!) indices, int dim) -> ()
registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at
/build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
new kernel: registered at
/build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971
(function operator())
/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6:
UserWarning: pkg_resources is deprecated as an API. See
https://setuptools.pypa.io/en/latest/pkg_resources.html. The
pkg_resources package is slated for removal as early as 2025-11-30.
Refrain from using this package or pin to Setuptools<81.
import pkg_resources
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
[2025-07-12 16:07:27,740] [INFO]
[real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu
(auto detect)
[2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1]
[TorchCheckpointEngine] Initialized with serialization = False
args =
Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945',
output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal',
num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True,
strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945
to Universal checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal
/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290:
FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated.
Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334:
FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated.
Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
[2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging
level to 'INFO' on 'RANK == 0'
[2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging
level to 'CRITICAL' on all others 'RANK != 0'
*** 1. Extracting ZeRO fragments
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋|
767/768 [01:29<00:00, 8.53it/s]
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py",
line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
tp_index=tp_index, dp_index=dp_index)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py",
line 124, in get_zero_checkpoint_state
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 62, in get_state_for_rank
self._strip_tensor_paddings(sd)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0,
raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 549, in <module>
main(args)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 499, in main
_extract_zero_shard_files(args, ds_checkpoint, temp_dir)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 370, in _extract_zero_shard_files
_do_parallel_work(do_work, _3d_range_list, args.num_extract_workers)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 354, in _do_parallel_work
results.append(f.result())
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
return self.__get_result()
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
raise self._exception
RuntimeError: narrow(): length must be non-negative.
[1] 144664 exit 1 python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder
took: 0h:02m:08s
```
</details>
Signed-off-by: Sam Foreman <[email protected]>
LYMDLUT
pushed a commit
to LYMDLUT/DeepSpeed
that referenced
this pull request
Aug 20, 2025
Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`,
I encountered:
```python
Traceback (most recent call last):
File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
self._strip_tensor_paddings(sd)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
```
(see full traceback[^traceback] below)
The issue is, there's no way to propagate the `strip_tensor_paddings`
argument from the
[`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/deepspeed_checkpoint.py#L123)
method through to the [`ZeroCheckpoint.get_state_for_rank(...)`
method](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/zero_checkpoint.py#L53)
(which accepts it as an argument) since it doesn't accept it.
This PR adds this additional `strip_tensor_paddings` argument (default
`True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method,
and passes it through to the
`self.zero_checkpoint.get_state_for_rank(...,
strip_tensor_paddings=strip_tensor_paddings)`, as shown below:
```diff
- def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+ def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict:
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
tp_index=tp_index,
dp_index=dp_index,
- keys_to_ignore=[PARAM_SHAPES])
+ keys_to_ignore=[PARAM_SHAPES],
+ strip_tensor_paddings=strip_tensor_paddings)
```
[^traceback]: Full traceback:
<details closed><summary>[Full Traceback]:</summary>
```bash
#[🐍 aurora_nre_models_frameworks-2025.0.0](👻
aurora_nre_models_frameworks-2025.0.0)
#[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓]
#[07/12/25 @ 16:07:12][x4209c2s4b0n0]
;
ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash
; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo
"global step: ${gs}" && python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py
--input_folder"${ckpt_dir}/global_step${gs}" --output_folder
"${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder
global step: 158945
[W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only
once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the
same dispatch key
operator: aten::_cummax_helper(Tensor self, Tensor(a!) values,
Tensor(b!) indices, int dim) -> ()
registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at
/build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
new kernel: registered at
/build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971
(function operator())
/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6:
UserWarning: pkg_resources is deprecated as an API. See
https://setuptools.pypa.io/en/latest/pkg_resources.html. The
pkg_resources package is slated for removal as early as 2025-11-30.
Refrain from using this package or pin to Setuptools<81.
import pkg_resources
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
[2025-07-12 16:07:27,740] [INFO]
[real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu
(auto detect)
[2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1]
[TorchCheckpointEngine] Initialized with serialization = False
args =
Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945',
output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal',
num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True,
strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945
to Universal checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal
/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290:
FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated.
Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334:
FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated.
Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
[2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging
level to 'INFO' on 'RANK == 0'
[2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging
level to 'CRITICAL' on all others 'RANK != 0'
*** 1. Extracting ZeRO fragments
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋|
767/768 [01:29<00:00, 8.53it/s]
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py",
line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
tp_index=tp_index, dp_index=dp_index)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py",
line 124, in get_zero_checkpoint_state
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 62, in get_state_for_rank
self._strip_tensor_paddings(sd)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0,
raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 549, in <module>
main(args)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 499, in main
_extract_zero_shard_files(args, ds_checkpoint, temp_dir)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 370, in _extract_zero_shard_files
_do_parallel_work(do_work, _3d_range_list, args.num_extract_workers)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 354, in _do_parallel_work
results.append(f.result())
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
return self.__get_result()
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
raise self._exception
RuntimeError: narrow(): length must be non-negative.
[1] 144664 exit 1 python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder
took: 0h:02m:08s
```
</details>
Signed-off-by: Sam Foreman <[email protected]>
Signed-off-by: lym <[email protected]>
mauryaavinash95
pushed a commit
to DataStates/DeepSpeed
that referenced
this pull request
Oct 4, 2025
Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`,
I encountered:
```python
Traceback (most recent call last):
File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
self._strip_tensor_paddings(sd)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
```
(see full traceback[^traceback] below)
The issue is, there's no way to propagate the `strip_tensor_paddings`
argument from the
[`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/deepspeed_checkpoint.py#L123)
method through to the [`ZeroCheckpoint.get_state_for_rank(...)`
method](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/zero_checkpoint.py#L53)
(which accepts it as an argument) since it doesn't accept it.
This PR adds this additional `strip_tensor_paddings` argument (default
`True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method,
and passes it through to the
`self.zero_checkpoint.get_state_for_rank(...,
strip_tensor_paddings=strip_tensor_paddings)`, as shown below:
```diff
- def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+ def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict:
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
tp_index=tp_index,
dp_index=dp_index,
- keys_to_ignore=[PARAM_SHAPES])
+ keys_to_ignore=[PARAM_SHAPES],
+ strip_tensor_paddings=strip_tensor_paddings)
```
[^traceback]: Full traceback:
<details closed><summary>[Full Traceback]:</summary>
```bash
#[🐍 aurora_nre_models_frameworks-2025.0.0](👻
aurora_nre_models_frameworks-2025.0.0)
#[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓]
#[07/12/25 @ 16:07:12][x4209c2s4b0n0]
;
ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash
; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo
"global step: ${gs}" && python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py
--input_folder"${ckpt_dir}/global_step${gs}" --output_folder
"${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder
global step: 158945
[W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only
once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the
same dispatch key
operator: aten::_cummax_helper(Tensor self, Tensor(a!) values,
Tensor(b!) indices, int dim) -> ()
registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at
/build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
new kernel: registered at
/build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971
(function operator())
/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6:
UserWarning: pkg_resources is deprecated as an API. See
https://setuptools.pypa.io/en/latest/pkg_resources.html. The
pkg_resources package is slated for removal as early as 2025-11-30.
Refrain from using this package or pin to Setuptools<81.
import pkg_resources
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
[2025-07-12 16:07:27,740] [INFO]
[real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu
(auto detect)
[2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1]
[TorchCheckpointEngine] Initialized with serialization = False
args =
Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945',
output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal',
num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True,
strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945
to Universal checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal
/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290:
FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated.
Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334:
FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated.
Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
[2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging
level to 'INFO' on 'RANK == 0'
[2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging
level to 'CRITICAL' on all others 'RANK != 0'
*** 1. Extracting ZeRO fragments
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋|
767/768 [01:29<00:00, 8.53it/s]
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py",
line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
tp_index=tp_index, dp_index=dp_index)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py",
line 124, in get_zero_checkpoint_state
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 62, in get_state_for_rank
self._strip_tensor_paddings(sd)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0,
raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 549, in <module>
main(args)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 499, in main
_extract_zero_shard_files(args, ds_checkpoint, temp_dir)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 370, in _extract_zero_shard_files
_do_parallel_work(do_work, _3d_range_list, args.num_extract_workers)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 354, in _do_parallel_work
results.append(f.result())
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
return self.__get_result()
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
raise self._exception
RuntimeError: narrow(): length must be non-negative.
[1] 144664 exit 1 python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder
took: 0h:02m:08s
```
</details>
Signed-off-by: Sam Foreman <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Trying to use the
DeepSpeed/deepspeed/checkpoints/ds_to_universal.py, I encountered:(see full traceback1 below)
The issue is, there's no way to propagate the
strip_tensor_paddingsargument from theDeepSpeedCheckpoint.get_zero_checkpoint_state(...)method through to theZeroCheckpoint.get_state_for_rank(...)method (which accepts it as an argument) since it doesn't accept it.This PR adds this additional
strip_tensor_paddingsargument (defaultTrue) in theDeepSpeedCheckpoint.get_zero_checkpoint_statemethod, and passes it through to theself.zero_checkpoint.get_state_for_rank(..., strip_tensor_paddings=strip_tensor_paddings), as shown below:Footnotes
Full traceback:
[Full Traceback]: