Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: deepspeedai/DeepSpeed
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.17.2
Choose a base ref
...
head repository: deepspeedai/DeepSpeed
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.17.3
Choose a head ref
  • 17 commits
  • 25 files changed
  • 11 contributors

Commits on Jul 7, 2025

  1. [TiledMLP]: fix for bs>1 (#7412)

    It looks like my TiledMLP was working correctly only for batch_size=1
    
    fixing to work with any bs 
    
    thanks to @winglian for detecting the problem and sending me an easy
    repro
    
    ---------
    
    Signed-off-by: Stas Bekman <[email protected]>
    stas00 authored Jul 7, 2025
    Configuration menu
    Copy the full SHA
    2790220 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d6fe70e View commit details
    Browse the repository at this point in the history
  3. Enable torch version dependent compilation of record_module and iter_…

    …params (#7362)
    
    Dynamo breaks graphs because currently compile is disabled for a number
    of functions such as `iter_params` and `record_module`.
    
    The above functions compile successfully for at least PyTorch version
    2.7.0.
    
    We enable the compilation based on the user PyTorch version using a new
    `compiler.enable(min_version=None)` decorator.
    This should avoid the corresponding graph breaks and improve the
    performance.
    
    ---------
    
    Signed-off-by: Max Kovalenko <[email protected]>
    Co-authored-by: Masahiro Tanaka <[email protected]>
    Co-authored-by: Logan Adams <[email protected]>
    3 people authored Jul 7, 2025
    Configuration menu
    Copy the full SHA
    8ace4da View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2025

  1. [BUGFIX] Reset bucket.elements after reduction in ZeRO Stage 3 (#7418)

    closes #7415 
    
    # Description
    Resets `bucket.elements` after reduction in ZeRO Stage 3.
    Without this, the bucket grows indefinitely, reducing only one param at
    a time.
    Added `bucket.elements = 0` after `params_in_bucket.clear()`.
    
    Co-authored-by: a <a>
    rahul713rk authored Jul 8, 2025
    Configuration menu
    Copy the full SHA
    2722846 View commit details
    Browse the repository at this point in the history
  2. Align missing argument in AllReduceCoalescedHandle (#7414)

    After a new argument (handle_dependency) was added to the corresponding
    wait() methods, AllReduceCoalescedHandle has to be aligned, too.
    
    Signed-off-by: Max Kovalenko <[email protected]>
    Co-authored-by: Logan Adams <[email protected]>
    Co-authored-by: Masahiro Tanaka <[email protected]>
    3 people authored Jul 8, 2025
    Configuration menu
    Copy the full SHA
    ac16035 View commit details
    Browse the repository at this point in the history

Commits on Jul 9, 2025

  1. Improvements to Communication Logger (#7404)

    These changes add
    - ability to return communication logs as dictionaries, rather than only
    printing to stdout
    - convenience helper functions for getting information about current
    logs
    - ability to clear existing log operations
    - additional documentation for logging operations
    
    These address points made in #7403
    
    ---------
    
    Signed-off-by: Alex Kiefer <[email protected]>
    Co-authored-by: Masahiro Tanaka <[email protected]>
    Co-authored-by: Olatunji Ruwase <[email protected]>
    Co-authored-by: Hongwei Chen <[email protected]>
    Co-authored-by: Logan Adams <[email protected]>
    5 people authored Jul 9, 2025
    Configuration menu
    Copy the full SHA
    f485e13 View commit details
    Browse the repository at this point in the history

Commits on Jul 11, 2025

  1. trying to fix nv-accelerate-v100.yml CI job (#7424)

    trying a day old accelerate from the day before
    huggingface/accelerate@1ac8643
    
    ---------
    
    Signed-off-by: Stas Bekman <[email protected]>
    stas00 authored Jul 11, 2025
    Configuration menu
    Copy the full SHA
    affee60 View commit details
    Browse the repository at this point in the history

Commits on Jul 13, 2025

  1. fix: Propagate strip_tensor_paddings (#7426)

    Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`,
    I encountered:
    
    
    ```python
    Traceback (most recent call last):
      File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
        r = call_item.fn(*call_item.args, **call_item.kwargs)
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
        sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
        return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
        self._strip_tensor_paddings(sd)
      File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
        group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
    RuntimeError: narrow(): length must be non-negative.
    ```
    
    (see full traceback[^traceback] below)
    
    
    The issue is, there's no way to propagate the `strip_tensor_paddings`
    argument from the
    [`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/deepspeed_checkpoint.py#L123)
    method through to the [`ZeroCheckpoint.get_state_for_rank(...)`
    method](https://github.com/deepspeedai/DeepSpeed/blob/affee605e47c9befd21c4c1445e11fd29d295201/deepspeed/checkpoint/zero_checkpoint.py#L53)
    (which accepts it as an argument) since it doesn't accept it.
    
    This PR adds this additional `strip_tensor_paddings` argument (default
    `True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method,
    and passes it through to the
    `self.zero_checkpoint.get_state_for_rank(...,
    strip_tensor_paddings=strip_tensor_paddings)`, as shown below:
    
    ```diff
    -    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
    +    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict:
            return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
                                                           tp_index=tp_index,
                                                           dp_index=dp_index,
    -                                                       keys_to_ignore=[PARAM_SHAPES])
    +                                                       keys_to_ignore=[PARAM_SHAPES],
    +                                                       strip_tensor_paddings=strip_tensor_paddings)
    ```
    
    [^traceback]: Full traceback:
    
    	<details closed><summary>[Full Traceback]:</summary>
    	
    	```bash
    #[🐍 aurora_nre_models_frameworks-2025.0.0](👻
    aurora_nre_models_frameworks-2025.0.0)
    	#[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓]
    	#[07/12/25 @ 16:07:12][x4209c2s4b0n0]
    ;
    ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash
    ; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo
    "global step: ${gs}" && python3
    deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py
    --input_folder"${ckpt_dir}/global_step${gs}" --output_folder
    "${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder
    	global step: 158945
    [W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only
    once for all operators, other operators may also be overridden.
    Overriding a previously registered kernel for the same operator and the
    same dispatch key
    operator: aten::_cummax_helper(Tensor self, Tensor(a!) values,
    Tensor(b!) indices, int dim) -> ()
    registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
    	  dispatch key: XPU
    previous kernel: registered at
    /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
    new kernel: registered at
    /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971
    (function operator())
    
    /opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6:
    UserWarning: pkg_resources is deprecated as an API. See
    https://setuptools.pypa.io/en/latest/pkg_resources.html. The
    pkg_resources package is slated for removal as early as 2025-11-30.
    Refrain from using this package or pin to Setuptools<81.
    	  import pkg_resources
    	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    	AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
    [2025-07-12 16:07:27,740] [INFO]
    [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu
    (auto detect)
    [2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1]
    [TorchCheckpointEngine] Initialized with serialization = False
    args =
    Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945',
    output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal',
    num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True,
    strict=True, inject_missing_state=False)
    	Convert DeepSpeed Checkpoint to Universal Checkpoint
    Converting DeepSpeed checkpoint in
    checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945
    to Universal checkpoint in
    checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal
    
    /lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290:
    FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated.
    Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
    	  def forward(
    
    /lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334:
    FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated.
    Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
    	  def backward(ctx, grad_output):
    [2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging
    level to 'INFO' on 'RANK == 0'
    [2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging
    level to 'CRITICAL' on all others 'RANK != 0'
    	*** 1. Extracting ZeRO fragments
    
    100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋|
    767/768 [01:29<00:00, 8.53it/s]
    	concurrent.futures.process._RemoteTraceback:
    	"""
    	Traceback (most recent call last):
    File
    "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py",
    line 246, in _process_worker
    	    r = call_item.fn(*call_item.args, **call_item.kwargs)
    File
    "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
    line 114, in extract_zero_shards
    sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
    tp_index=tp_index, dp_index=dp_index)
    File
    "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py",
    line 124, in get_zero_checkpoint_state
    	    return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
    File
    "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
    line 62, in get_state_for_rank
    	    self._strip_tensor_paddings(sd)
    File
    "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
    line 110, in _strip_tensor_paddings
    group_state[state_name] = torch.narrow(state_value, 0, 0,
    raw_length).clone()
    	RuntimeError: narrow(): length must be non-negative.
    	"""
    	
    	The above exception was the direct cause of the following exception:
    	
    	Traceback (most recent call last):
    File
    "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
    line 549, in <module>
    	    main(args)
    File
    "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
    line 499, in main
    	    _extract_zero_shard_files(args, ds_checkpoint, temp_dir)
    File
    "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
    line 370, in _extract_zero_shard_files
    _do_parallel_work(do_work, _3d_range_list, args.num_extract_workers)
    File
    "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
    line 354, in _do_parallel_work
    	    results.append(f.result())
    File
    "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
    line 451, in result
    	    return self.__get_result()
    File
    "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
    line 403, in __get_result
    	    raise self._exception
    	RuntimeError: narrow(): length must be non-negative.
    [1] 144664 exit 1 python3
    deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder
    	took: 0h:02m:08s
    	```
    	
    	</details>
    
    Signed-off-by: Sam Foreman <[email protected]>
    saforem2 authored Jul 13, 2025
    Configuration menu
    Copy the full SHA
    a687d32 View commit details
    Browse the repository at this point in the history

Commits on Jul 14, 2025

  1. Use past_key_value when provided (#7428)

    The KV cache can be passed via `layer_past` or `past_key_value`
    arguments. Previously, `past_key_value` was ignored, causing workload
    incompatibilities.
    
    This PR fixes the issue while preserving the original logic.
    
    ---------
    
    Signed-off-by: Max Kovalenko <[email protected]>
    deepcharm authored Jul 14, 2025
    Configuration menu
    Copy the full SHA
    88ba24a View commit details
    Browse the repository at this point in the history

Commits on Jul 16, 2025

  1. set device_id in torch's init_process_group (#7266)

    This PR overcomes this issue when using any `torch.distributed` calls w/
    deepspeed:
    ```
    [W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 
    to perform barrier as devices used by this process are currently unknown. This can
     potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in
     barrier() to force use of a particular device, or call init_process_group() with a device_id.
    ```
    by setting `device_id` to the correct device corresponding to
    `LOCAL_RANK` env var.
    
    -------------------
    
    Update: discovered `torch.dist` deadlocks with `torch=>2.7.0` when using
    `device_id` arg - switching to draft for now as we can't commit this
    until we know how to work around this.
    
    ---------
    
    Signed-off-by: Stas Bekman <[email protected]>
    Signed-off-by: Stas Bekman <[email protected]>
    Co-authored-by: Olatunji Ruwase <[email protected]>
    Co-authored-by: Logan Adams <[email protected]>
    Co-authored-by: Stas Bekman <[email protected]>
    4 people authored Jul 16, 2025
    Configuration menu
    Copy the full SHA
    ee286e5 View commit details
    Browse the repository at this point in the history
  2. [Ulysses-ALST] add FA3 support (#7430)

    FA3 is needed for 500K+ seqlen on llama-8b.
    
    Signed-off-by: Stas Bekman <[email protected]>
    Co-authored-by: Stas Bekman <[email protected]>
    stas00 and sfc-gh-sbekman authored Jul 16, 2025
    Configuration menu
    Copy the full SHA
    d33b562 View commit details
    Browse the repository at this point in the history
  3. TiledMLP + SequenceTiledCompute: improve the bs>1 use-case (#7422)

    Improved TiledMLP and SequenceTiledCompute for bs>1
    
    This PR:
    - extends the testing utils to add `CaptureStd*`, `CaptureLogger`
    context managers
    - extends the test to run both bs=1 and bs=2
    - use an uneven seqlen to test varlen shards
    - flattens bs+seqlen dim, to avoid problems with grad tensor strides
    when bs>1 - mlp doesn't care for the bs dimension so using a pretend
    `bs*seqlen` seqlen instead and restoring the shape at the end for the
    grad.
    
    ---------
    
    Signed-off-by: Stas Bekman <[email protected]>
    Co-authored-by: Logan Adams <[email protected]>
    stas00 and loadams authored Jul 16, 2025
    Configuration menu
    Copy the full SHA
    c2bb53f View commit details
    Browse the repository at this point in the history

Commits on Jul 22, 2025

  1. Configuration menu
    Copy the full SHA
    3bf5345 View commit details
    Browse the repository at this point in the history

Commits on Jul 23, 2025

  1. [ALST] fix typo in the url (#7444)

    fixing the misspelled url
    
    ---------
    
    Signed-off-by: Stas Bekman <[email protected]>
    stas00 authored Jul 23, 2025
    Configuration menu
    Copy the full SHA
    1d10d48 View commit details
    Browse the repository at this point in the history
  2. [ALST] fix typo in the url part2 (#7446)

    oops, forgot to rename the file itself :( continuation of
    #7444
    
    ---------
    
    Signed-off-by: Stas Bekman <[email protected]>
    stas00 authored Jul 23, 2025
    Configuration menu
    Copy the full SHA
    70caefe View commit details
    Browse the repository at this point in the history

Commits on Jul 24, 2025

  1. Configuration menu
    Copy the full SHA
    43f00ba View commit details
    Browse the repository at this point in the history

Commits on Jul 26, 2025

  1. Fix: Adapt Llama injection policy for newer transformers versions (#7443

    )
    
    This PR fixes an `AttributeError` that occurs during
    `deepspeed.init_inference` when using kernel injection
    (`replace_with_kernel_inject=True`) with Llama models from recent
    versions of `transformers`.
    
    **The Bug:**
    
    In newer `transformers` versions (e.g., `4.53.3`), configurations like
    `num_heads` and `rope_theta` were moved from direct attributes of the
    `LlamaAttention` module into a nested `config` object.
    
    The current DeepSpeed injection policy tries to access these attributes
    from their old, direct location, causing the initialization to fail with
    an `AttributeError: 'LlamaAttention' object has no attribute
    'num_heads'`.
    
    **The Solution:**
    
    This change updates the Llama injection logic to be more robust:
    1. It first tries to read attributes like `num_heads` from the new
    `config` object location.
    2. If that fails, it falls back to the legacy direct attribute path.
    
    ---------
    
    Signed-off-by: huanyuqu <[email protected]>
    huanyuqu authored Jul 26, 2025
    Configuration menu
    Copy the full SHA
    092625c View commit details
    Browse the repository at this point in the history
Loading