-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Comparing changes
Open a pull request
base repository: deepspeedai/DeepSpeed
base: v0.17.2
head repository: deepspeedai/DeepSpeed
compare: v0.17.3
- 17 commits
- 25 files changed
- 11 contributors
Commits on Jul 7, 2025
-
[TiledMLP]: fix for bs>1 (#7412)
It looks like my TiledMLP was working correctly only for batch_size=1 fixing to work with any bs thanks to @winglian for detecting the problem and sending me an easy repro --------- Signed-off-by: Stas Bekman <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2790220 - Browse repository at this point
Copy the full SHA 2790220View commit details -
Configuration menu - View commit details
-
Copy full SHA for d6fe70e - Browse repository at this point
Copy the full SHA d6fe70eView commit details -
Enable torch version dependent compilation of record_module and iter_…
…params (#7362) Dynamo breaks graphs because currently compile is disabled for a number of functions such as `iter_params` and `record_module`. The above functions compile successfully for at least PyTorch version 2.7.0. We enable the compilation based on the user PyTorch version using a new `compiler.enable(min_version=None)` decorator. This should avoid the corresponding graph breaks and improve the performance. --------- Signed-off-by: Max Kovalenko <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Logan Adams <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8ace4da - Browse repository at this point
Copy the full SHA 8ace4daView commit details
Commits on Jul 8, 2025
-
[BUGFIX] Reset
bucket.elementsafter reduction in ZeRO Stage 3 (#7418)closes #7415 # Description Resets `bucket.elements` after reduction in ZeRO Stage 3. Without this, the bucket grows indefinitely, reducing only one param at a time. Added `bucket.elements = 0` after `params_in_bucket.clear()`. Co-authored-by: a <a>
Configuration menu - View commit details
-
Copy full SHA for 2722846 - Browse repository at this point
Copy the full SHA 2722846View commit details -
Align missing argument in AllReduceCoalescedHandle (#7414)
After a new argument (handle_dependency) was added to the corresponding wait() methods, AllReduceCoalescedHandle has to be aligned, too. Signed-off-by: Max Kovalenko <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ac16035 - Browse repository at this point
Copy the full SHA ac16035View commit details
Commits on Jul 9, 2025
-
Improvements to Communication Logger (#7404)
These changes add - ability to return communication logs as dictionaries, rather than only printing to stdout - convenience helper functions for getting information about current logs - ability to clear existing log operations - additional documentation for logging operations These address points made in #7403 --------- Signed-off-by: Alex Kiefer <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Co-authored-by: Logan Adams <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f485e13 - Browse repository at this point
Copy the full SHA f485e13View commit details
Commits on Jul 11, 2025
-
trying to fix nv-accelerate-v100.yml CI job (#7424)
trying a day old accelerate from the day before huggingface/accelerate@1ac8643 --------- Signed-off-by: Stas Bekman <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for affee60 - Browse repository at this point
Copy the full SHA affee60View commit details
Commits on Jul 13, 2025
-
fix: Propagate
strip_tensor_paddings(#7426)Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`, I encountered: ```python Traceback (most recent call last): File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index, File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank self._strip_tensor_paddings(sd) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone() RuntimeError: narrow(): length must be non-negative. ``` (see full traceback[^traceback] below) The issue is, there's no way to propagate the `strip_tensor_paddings` argument from the [`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/deepspeed_checkpoint.py#L123) method through to the [`ZeroCheckpoint.get_state_for_rank(...)` method](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/zero_checkpoint.py#L53) (which accepts it as an argument) since it doesn't accept it. This PR adds this additional `strip_tensor_paddings` argument (default `True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method, and passes it through to the `self.zero_checkpoint.get_state_for_rank(..., strip_tensor_paddings=strip_tensor_paddings)`, as shown below: ```diff - def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict: + def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict: return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index, - keys_to_ignore=[PARAM_SHAPES]) + keys_to_ignore=[PARAM_SHAPES], + strip_tensor_paddings=strip_tensor_paddings) ``` [^traceback]: Full traceback: <details closed><summary>[Full Traceback]:</summary> ```bash #[🐍 aurora_nre_models_frameworks-2025.0.0](👻 aurora_nre_models_frameworks-2025.0.0) #[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓] #[07/12/25 @ 16:07:12][x4209c2s4b0n0] ; ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash ; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo "global step: ${gs}" && python3 deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder"${ckpt_dir}/global_step${gs}" --output_folder "${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder global step: 158945 [W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only once for all operators, other operators may also be overridden. Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> () registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 dispatch key: XPU previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476 new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator()) /opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' [2025-07-12 16:07:27,740] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False args = Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945', output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True, strict=True, inject_missing_state=False) Convert DeepSpeed Checkpoint to Universal Checkpoint Converting DeepSpeed checkpoint in checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945 to Universal checkpoint in checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal /lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging level to 'INFO' on 'RANK == 0' [2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0' *** 1. Extracting ZeRO fragments 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋| 767/768 [01:29<00:00, 8.53it/s] concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index, File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank self._strip_tensor_paddings(sd) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone() RuntimeError: narrow(): length must be non-negative. """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 549, in <module> main(args) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 499, in main _extract_zero_shard_files(args, ds_checkpoint, temp_dir) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 370, in _extract_zero_shard_files _do_parallel_work(do_work, _3d_range_list, args.num_extract_workers) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 354, in _do_parallel_work results.append(f.result()) File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception RuntimeError: narrow(): length must be non-negative. [1] 144664 exit 1 python3 deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder took: 0h:02m:08s ``` </details> Signed-off-by: Sam Foreman <[email protected]>Configuration menu - View commit details
-
Copy full SHA for a687d32 - Browse repository at this point
Copy the full SHA a687d32View commit details
Commits on Jul 14, 2025
-
Use past_key_value when provided (#7428)
The KV cache can be passed via `layer_past` or `past_key_value` arguments. Previously, `past_key_value` was ignored, causing workload incompatibilities. This PR fixes the issue while preserving the original logic. --------- Signed-off-by: Max Kovalenko <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 88ba24a - Browse repository at this point
Copy the full SHA 88ba24aView commit details
Commits on Jul 16, 2025
-
set
device_idin torch'sinit_process_group(#7266)This PR overcomes this issue when using any `torch.distributed` calls w/ deepspeed: ``` [W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. ``` by setting `device_id` to the correct device corresponding to `LOCAL_RANK` env var. ------------------- Update: discovered `torch.dist` deadlocks with `torch=>2.7.0` when using `device_id` arg - switching to draft for now as we can't commit this until we know how to work around this. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Stas Bekman <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for ee286e5 - Browse repository at this point
Copy the full SHA ee286e5View commit details -
[Ulysses-ALST] add FA3 support (#7430)
FA3 is needed for 500K+ seqlen on llama-8b. Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d33b562 - Browse repository at this point
Copy the full SHA d33b562View commit details -
TiledMLP + SequenceTiledCompute: improve the bs>1 use-case (#7422)
Improved TiledMLP and SequenceTiledCompute for bs>1 This PR: - extends the testing utils to add `CaptureStd*`, `CaptureLogger` context managers - extends the test to run both bs=1 and bs=2 - use an uneven seqlen to test varlen shards - flattens bs+seqlen dim, to avoid problems with grad tensor strides when bs>1 - mlp doesn't care for the bs dimension so using a pretend `bs*seqlen` seqlen instead and restoring the shape at the end for the grad. --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Logan Adams <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c2bb53f - Browse repository at this point
Copy the full SHA c2bb53fView commit details
Commits on Jul 22, 2025
-
Configuration menu - View commit details
-
Copy full SHA for 3bf5345 - Browse repository at this point
Copy the full SHA 3bf5345View commit details
Commits on Jul 23, 2025
-
[ALST] fix typo in the url (#7444)
fixing the misspelled url --------- Signed-off-by: Stas Bekman <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1d10d48 - Browse repository at this point
Copy the full SHA 1d10d48View commit details -
[ALST] fix typo in the url part2 (#7446)
oops, forgot to rename the file itself :( continuation of #7444 --------- Signed-off-by: Stas Bekman <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 70caefe - Browse repository at this point
Copy the full SHA 70caefeView commit details
Commits on Jul 24, 2025
-
Configuration menu - View commit details
-
Copy full SHA for 43f00ba - Browse repository at this point
Copy the full SHA 43f00baView commit details
Commits on Jul 26, 2025
-
Fix: Adapt Llama injection policy for newer transformers versions (#7443
) This PR fixes an `AttributeError` that occurs during `deepspeed.init_inference` when using kernel injection (`replace_with_kernel_inject=True`) with Llama models from recent versions of `transformers`. **The Bug:** In newer `transformers` versions (e.g., `4.53.3`), configurations like `num_heads` and `rope_theta` were moved from direct attributes of the `LlamaAttention` module into a nested `config` object. The current DeepSpeed injection policy tries to access these attributes from their old, direct location, causing the initialization to fail with an `AttributeError: 'LlamaAttention' object has no attribute 'num_heads'`. **The Solution:** This change updates the Llama injection logic to be more robust: 1. It first tries to read attributes like `num_heads` from the new `config` object location. 2. If that fails, it falls back to the legacy direct attribute path. --------- Signed-off-by: huanyuqu <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 092625c - Browse repository at this point
Copy the full SHA 092625cView commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff v0.17.2...v0.17.3