Skip to content

Conversation

@saforem2
Copy link
Collaborator

Trying to use the DeepSpeed/deepspeed/checkpoints/ds_to_universal.py, I encountered:

Traceback (most recent call last):
  File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
    sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
    return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
    self._strip_tensor_paddings(sd)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
    group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
RuntimeError: narrow(): length must be non-negative.

(see full traceback1 below)

The issue is, there's no way to propagate the strip_tensor_paddings argument from the DeepSpeedCheckpoint.get_zero_checkpoint_state(...) method through to the ZeroCheckpoint.get_state_for_rank(...) method (which accepts it as an argument) since it doesn't accept it.

This PR adds this additional strip_tensor_paddings argument (default True) in the DeepSpeedCheckpoint.get_zero_checkpoint_state method, and passes it through to the self.zero_checkpoint.get_state_for_rank(..., strip_tensor_paddings=strip_tensor_paddings), as shown below:

-    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict:
        return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
                                                       tp_index=tp_index,
                                                       dp_index=dp_index,
-                                                       keys_to_ignore=[PARAM_SHAPES])
+                                                       keys_to_ignore=[PARAM_SHAPES],
+                                                       strip_tensor_paddings=strip_tensor_paddings)

Footnotes

  1. Full traceback:

    [Full Traceback]:
    #[🐍 aurora_nre_models_frameworks-2025.0.0](👻 aurora_nre_models_frameworks-2025.0.0)
    #[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓]
    #[07/12/25 @ 16:07:12][x4209c2s4b0n0]
    ; ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash ; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo "global step: ${gs}" && python3 deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder"${ckpt_dir}/global_step${gs}" --output_folder "${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder
    global step: 158945
    [W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only once for all operators,  other operators may also be overridden.
      Overriding a previously registered kernel for the same operator and the same dispatch key
      operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> ()
        registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      dispatch key: XPU
      previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
           new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator())
    /opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
      import pkg_resources
    AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    [2025-07-12 16:07:27,740] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    [2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
    args = Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945', output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True, strict=True, inject_missing_state=False)
    Convert DeepSpeed Checkpoint to Universal Checkpoint
    Converting DeepSpeed checkpoint in checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945 to Universal checkpoint in checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal
    /lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
      def forward(
    /lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
      def backward(ctx, grad_output):
    [2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging level to 'INFO' on 'RANK == 0'
    [2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
    *** 1. Extracting ZeRO fragments
    100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 767/768 [01:29<00:00,  8.53it/s]
    concurrent.futures.process._RemoteTraceback:
    """
    Traceback (most recent call last):
      File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
        r = call_item.fn(*call_item.args, **call_item.kwargs)
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
        sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
        return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
        self._strip_tensor_paddings(sd)
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
        group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
    RuntimeError: narrow(): length must be non-negative.
    """
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 549, in <module>
        main(args)
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 499, in main
        _extract_zero_shard_files(args, ds_checkpoint, temp_dir)
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 370, in _extract_zero_shard_files
        _do_parallel_work(do_work, _3d_range_list, args.num_extract_workers)
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 354, in _do_parallel_work
        results.append(f.result())
      File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py", line 451, in result
        return self.__get_result()
      File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
        raise self._exception
    RuntimeError: narrow(): length must be non-negative.
    [1]    144664 exit 1     python3 deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder
    took: 0h:02m:08s
    

@saforem2 saforem2 requested a review from tjruwase as a code owner July 12, 2025 22:13
@saforem2 saforem2 force-pushed the saforem2/strip-tensor-paddings branch from 8170f4a to d60dbe9 Compare July 12, 2025 22:44
@sfc-gh-truwase
Copy link
Collaborator

Thanks @saforem2!

@xylian86, FYI. It seems this was missing in your earlier PR.

@sfc-gh-truwase sfc-gh-truwase merged commit a687d32 into master Jul 13, 2025
9 checks passed
@sfc-gh-truwase sfc-gh-truwase deleted the saforem2/strip-tensor-paddings branch July 13, 2025 01:22
lpnpcs pushed a commit to lpnpcs/DeepSpeed that referenced this pull request Jul 30, 2025
Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`,
I encountered:


```python
Traceback (most recent call last):
  File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
    sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
    return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
    self._strip_tensor_paddings(sd)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
    group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
```

(see full traceback[^traceback] below)


The issue is, there's no way to propagate the `strip_tensor_paddings`
argument from the
[`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/deepspeed_checkpoint.py#L123)
method through to the [`ZeroCheckpoint.get_state_for_rank(...)`
method](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/zero_checkpoint.py#L53)
(which accepts it as an argument) since it doesn't accept it.

This PR adds this additional `strip_tensor_paddings` argument (default
`True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method,
and passes it through to the
`self.zero_checkpoint.get_state_for_rank(...,
strip_tensor_paddings=strip_tensor_paddings)`, as shown below:

```diff
-    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict:
        return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
                                                       tp_index=tp_index,
                                                       dp_index=dp_index,
-                                                       keys_to_ignore=[PARAM_SHAPES])
+                                                       keys_to_ignore=[PARAM_SHAPES],
+                                                       strip_tensor_paddings=strip_tensor_paddings)
```

[^traceback]: Full traceback:

	<details closed><summary>[Full Traceback]:</summary>
	
	```bash
#[🐍 aurora_nre_models_frameworks-2025.0.0](👻
aurora_nre_models_frameworks-2025.0.0)
	#[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓]
	#[07/12/25 @ 16:07:12][x4209c2s4b0n0]
;
ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash
; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo
"global step: ${gs}" && python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py
--input_folder"${ckpt_dir}/global_step${gs}" --output_folder
"${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder
	global step: 158945
[W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only
once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the
same dispatch key
operator: aten::_cummax_helper(Tensor self, Tensor(a!) values,
Tensor(b!) indices, int dim) -> ()
registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
	  dispatch key: XPU
previous kernel: registered at
/build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
new kernel: registered at
/build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971
(function operator())

/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6:
UserWarning: pkg_resources is deprecated as an API. See
https://setuptools.pypa.io/en/latest/pkg_resources.html. The
pkg_resources package is slated for removal as early as 2025-11-30.
Refrain from using this package or pin to Setuptools<81.
	  import pkg_resources
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
[2025-07-12 16:07:27,740] [INFO]
[real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu
(auto detect)
[2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1]
[TorchCheckpointEngine] Initialized with serialization = False
args =
Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945',
output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal',
num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True,
strict=True, inject_missing_state=False)
	Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945
to Universal checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal

/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290:
FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated.
Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
	  def forward(

/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334:
FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated.
Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
	  def backward(ctx, grad_output):
[2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging
level to 'INFO' on 'RANK == 0'
[2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging
level to 'CRITICAL' on all others 'RANK != 0'
	*** 1. Extracting ZeRO fragments

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋|
767/768 [01:29<00:00, 8.53it/s]
	concurrent.futures.process._RemoteTraceback:
	"""
	Traceback (most recent call last):
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py",
line 246, in _process_worker
	    r = call_item.fn(*call_item.args, **call_item.kwargs)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
tp_index=tp_index, dp_index=dp_index)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py",
line 124, in get_zero_checkpoint_state
	    return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 62, in get_state_for_rank
	    self._strip_tensor_paddings(sd)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0,
raw_length).clone()
	RuntimeError: narrow(): length must be non-negative.
	"""
	
	The above exception was the direct cause of the following exception:
	
	Traceback (most recent call last):
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 549, in <module>
	    main(args)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 499, in main
	    _extract_zero_shard_files(args, ds_checkpoint, temp_dir)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 370, in _extract_zero_shard_files
_do_parallel_work(do_work, _3d_range_list, args.num_extract_workers)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 354, in _do_parallel_work
	    results.append(f.result())
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
	    return self.__get_result()
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
	    raise self._exception
	RuntimeError: narrow(): length must be non-negative.
[1] 144664 exit 1 python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder
	took: 0h:02m:08s
	```
	
	</details>

Signed-off-by: Sam Foreman <[email protected]>
LYMDLUT pushed a commit to LYMDLUT/DeepSpeed that referenced this pull request Aug 20, 2025
Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`,
I encountered:

```python
Traceback (most recent call last):
  File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
    sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
    return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
    self._strip_tensor_paddings(sd)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
    group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
```

(see full traceback[^traceback] below)

The issue is, there's no way to propagate the `strip_tensor_paddings`
argument from the
[`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/deepspeed_checkpoint.py#L123)
method through to the [`ZeroCheckpoint.get_state_for_rank(...)`
method](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/zero_checkpoint.py#L53)
(which accepts it as an argument) since it doesn't accept it.

This PR adds this additional `strip_tensor_paddings` argument (default
`True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method,
and passes it through to the
`self.zero_checkpoint.get_state_for_rank(...,
strip_tensor_paddings=strip_tensor_paddings)`, as shown below:

```diff
-    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict:
        return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
                                                       tp_index=tp_index,
                                                       dp_index=dp_index,
-                                                       keys_to_ignore=[PARAM_SHAPES])
+                                                       keys_to_ignore=[PARAM_SHAPES],
+                                                       strip_tensor_paddings=strip_tensor_paddings)
```

[^traceback]: Full traceback:

	<details closed><summary>[Full Traceback]:</summary>

	```bash
#[🐍 aurora_nre_models_frameworks-2025.0.0](👻
aurora_nre_models_frameworks-2025.0.0)
	#[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓]
	#[07/12/25 @ 16:07:12][x4209c2s4b0n0]
;
ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash
; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo
"global step: ${gs}" && python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py
--input_folder"${ckpt_dir}/global_step${gs}" --output_folder
"${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder
	global step: 158945
[W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only
once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the
same dispatch key
operator: aten::_cummax_helper(Tensor self, Tensor(a!) values,
Tensor(b!) indices, int dim) -> ()
registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
	  dispatch key: XPU
previous kernel: registered at
/build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
new kernel: registered at
/build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971
(function operator())

/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6:
UserWarning: pkg_resources is deprecated as an API. See
https://setuptools.pypa.io/en/latest/pkg_resources.html. The
pkg_resources package is slated for removal as early as 2025-11-30.
Refrain from using this package or pin to Setuptools<81.
	  import pkg_resources
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
[2025-07-12 16:07:27,740] [INFO]
[real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu
(auto detect)
[2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1]
[TorchCheckpointEngine] Initialized with serialization = False
args =
Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945',
output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal',
num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True,
strict=True, inject_missing_state=False)
	Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945
to Universal checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal

/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290:
FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated.
Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
	  def forward(

/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334:
FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated.
Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
	  def backward(ctx, grad_output):
[2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging
level to 'INFO' on 'RANK == 0'
[2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging
level to 'CRITICAL' on all others 'RANK != 0'
	*** 1. Extracting ZeRO fragments

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋|
767/768 [01:29<00:00, 8.53it/s]
	concurrent.futures.process._RemoteTraceback:
	"""
	Traceback (most recent call last):
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py",
line 246, in _process_worker
	    r = call_item.fn(*call_item.args, **call_item.kwargs)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
tp_index=tp_index, dp_index=dp_index)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py",
line 124, in get_zero_checkpoint_state
	    return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 62, in get_state_for_rank
	    self._strip_tensor_paddings(sd)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0,
raw_length).clone()
	RuntimeError: narrow(): length must be non-negative.
	"""

	The above exception was the direct cause of the following exception:

	Traceback (most recent call last):
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 549, in <module>
	    main(args)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 499, in main
	    _extract_zero_shard_files(args, ds_checkpoint, temp_dir)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 370, in _extract_zero_shard_files
_do_parallel_work(do_work, _3d_range_list, args.num_extract_workers)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 354, in _do_parallel_work
	    results.append(f.result())
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
	    return self.__get_result()
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
	    raise self._exception
	RuntimeError: narrow(): length must be non-negative.
[1] 144664 exit 1 python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder
	took: 0h:02m:08s
	```

	</details>

Signed-off-by: Sam Foreman <[email protected]>
Signed-off-by: lym <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`,
I encountered:


```python
Traceback (most recent call last):
  File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
    sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
    return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
    self._strip_tensor_paddings(sd)
  File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
    group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
```

(see full traceback[^traceback] below)


The issue is, there's no way to propagate the `strip_tensor_paddings`
argument from the
[`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/deepspeed_checkpoint.py#L123)
method through to the [`ZeroCheckpoint.get_state_for_rank(...)`
method](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/zero_checkpoint.py#L53)
(which accepts it as an argument) since it doesn't accept it.

This PR adds this additional `strip_tensor_paddings` argument (default
`True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method,
and passes it through to the
`self.zero_checkpoint.get_state_for_rank(...,
strip_tensor_paddings=strip_tensor_paddings)`, as shown below:

```diff
-    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict:
        return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
                                                       tp_index=tp_index,
                                                       dp_index=dp_index,
-                                                       keys_to_ignore=[PARAM_SHAPES])
+                                                       keys_to_ignore=[PARAM_SHAPES],
+                                                       strip_tensor_paddings=strip_tensor_paddings)
```

[^traceback]: Full traceback:

	<details closed><summary>[Full Traceback]:</summary>
	
	```bash
#[🐍 aurora_nre_models_frameworks-2025.0.0](👻
aurora_nre_models_frameworks-2025.0.0)
	#[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓]
	#[07/12/25 @ 16:07:12][x4209c2s4b0n0]
;
ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash
; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo
"global step: ${gs}" && python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py
--input_folder"${ckpt_dir}/global_step${gs}" --output_folder
"${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder
	global step: 158945
[W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only
once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the
same dispatch key
operator: aten::_cummax_helper(Tensor self, Tensor(a!) values,
Tensor(b!) indices, int dim) -> ()
registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
	  dispatch key: XPU
previous kernel: registered at
/build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
new kernel: registered at
/build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971
(function operator())

/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6:
UserWarning: pkg_resources is deprecated as an API. See
https://setuptools.pypa.io/en/latest/pkg_resources.html. The
pkg_resources package is slated for removal as early as 2025-11-30.
Refrain from using this package or pin to Setuptools<81.
	  import pkg_resources
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
[2025-07-12 16:07:27,740] [INFO]
[real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu
(auto detect)
[2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1]
[TorchCheckpointEngine] Initialized with serialization = False
args =
Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945',
output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal',
num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True,
strict=True, inject_missing_state=False)
	Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945
to Universal checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal

/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290:
FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated.
Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
	  def forward(

/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334:
FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated.
Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
	  def backward(ctx, grad_output):
[2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging
level to 'INFO' on 'RANK == 0'
[2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging
level to 'CRITICAL' on all others 'RANK != 0'
	*** 1. Extracting ZeRO fragments

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋|
767/768 [01:29<00:00, 8.53it/s]
	concurrent.futures.process._RemoteTraceback:
	"""
	Traceback (most recent call last):
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py",
line 246, in _process_worker
	    r = call_item.fn(*call_item.args, **call_item.kwargs)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
tp_index=tp_index, dp_index=dp_index)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py",
line 124, in get_zero_checkpoint_state
	    return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 62, in get_state_for_rank
	    self._strip_tensor_paddings(sd)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0,
raw_length).clone()
	RuntimeError: narrow(): length must be non-negative.
	"""
	
	The above exception was the direct cause of the following exception:
	
	Traceback (most recent call last):
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 549, in <module>
	    main(args)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 499, in main
	    _extract_zero_shard_files(args, ds_checkpoint, temp_dir)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 370, in _extract_zero_shard_files
_do_parallel_work(do_work, _3d_range_list, args.num_extract_workers)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 354, in _do_parallel_work
	    results.append(f.result())
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
	    return self.__get_result()
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
	    raise self._exception
	RuntimeError: narrow(): length must be non-negative.
[1] 144664 exit 1 python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder
	took: 0h:02m:08s
	```
	
	</details>

Signed-off-by: Sam Foreman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants