Skip to content

Conversation

@LYMDLUT
Copy link
Contributor

@LYMDLUT LYMDLUT commented Jul 8, 2025

Resolves: #7251

@LYMDLUT LYMDLUT requested review from tjruwase and tohtana as code owners July 8, 2025 16:15
@tohtana
Copy link
Collaborator

tohtana commented Jul 9, 2025

Hi @LYMDLUT, thank you for the great work!

Does this code actually work for ZeRO1 as well? If it doesn't, can we keep the assertions changing the condition?
We have unit tests for this feature. Can you add stage 2 (and 1 if it works) to the test argument?

@LYMDLUT LYMDLUT closed this Jul 9, 2025
@LYMDLUT LYMDLUT reopened this Jul 9, 2025
@LYMDLUT LYMDLUT requested a review from loadams as a code owner July 9, 2025 20:25
@LYMDLUT
Copy link
Contributor Author

LYMDLUT commented Jul 9, 2025

Hi @LYMDLUT, thank you for the great work!

Does this code actually work for ZeRO1 as well? If it doesn't, can we keep the assertions changing the condition? We have unit tests for this feature. Can you add stage 2 (and 1 if it works) to the test argument?

Thank you for your review. I have added a test unit for zero1 and zero2. However, I am not very familiar with DeepSpeed. Could you please carefully check the code and provide other guidance? I will do my best to cooperate with you.

@LYMDLUT
Copy link
Contributor Author

LYMDLUT commented Jul 11, 2025

Looking forward to your guidance.

@tohtana
Copy link
Collaborator

tohtana commented Jul 17, 2025

Hi @LYMDLUT,
Sorry for my late response and thank you for adding the test!

  1. This actually works for ZeRO1 when I tried. How about add a parameter like this?
@pytest.mark.parametrize("zero_stage", [1, 2])
...
    def test_offload_states_zero2(self, included_state, pin_memory, non_blocking, zero_stage):
        hidden_dim = 1024
        config_dict = {
            "train_micro_batch_size_per_gpu": 1,
            "optimizer": {"type": "Adam", "params": {"lr": 1e-6}},
            "zero_optimization": {"stage": zero_stage},
            "bf16": {"enabled": True}
        }
  1. As the behavior of memory release is often tricky, can you check if the memory usage is actually reduced? You can find the example here.

  2. Can we consolidate the new test into test_offload_states.py? The initialization and validation of values are a bit different from ZeRO3, I think we can just have another function in TestOffloadStates. If you can even consolidate functions (e.g. run_model in test_offload_states and run_model_zero2), it would be great. We don't even need to have the logic to check the memory usage in both run_model and run_model_zero2.

@LYMDLUT LYMDLUT changed the title Try to support deepspeed offload states with ZeRO2 Try to support deepspeed offload states with ZeRO1 and ZeRO2 Jul 18, 2025
@LYMDLUT
Copy link
Contributor Author

LYMDLUT commented Jul 20, 2025

@tohtana , I really need your guidance and help!

  1. As the behavior of memory release is often tricky, can you check if the memory usage is actually reduced? You can find the example here.

When I was performing the second step, I noticed that the memory allocated (as indicated by memory_allocated()) for hp_param, lp_param, and lp_grads remained unchanged, with only the memory for optim_states showing a decrease. However, the corresponding tensors had indeed been transferred from CUDA to the CPU.

Despite debugging, I couldn't figure out the reason. I will continue to conduct in-depth research on this issue. However, if you know how to solve it, it will be of great help to me. Thank you very much.

@tohtana
Copy link
Collaborator

tohtana commented Jul 25, 2025

Hi @LYMDLUT,

Thank you for enhancing the test!

I think we still have a reference to the buffer somewhere. PyTorch's memory allocator frees the memory only when there is no reference to the buffer. Some operators like view and narrow creates a view of the buffer, which refer to the buffer. We need to make sure we don't have any these views.

Can you first check if we create such views of the buffer? We need to free the views when offloading and reconstruct it when reloading.

@LYMDLUT
Copy link
Contributor Author

LYMDLUT commented Aug 6, 2025

Hi @tohtana,
We may have resolved most of the issues. Before I delete tests/unit/runtime/zero/test_offload_states_zero2.py, I hope you can review the update changes to ensure the correctness of the code. I look forward to your further guidance.

@LYMDLUT
Copy link
Contributor Author

LYMDLUT commented Aug 12, 2025

The contiguous_grad_buffer should not need to be implemented in zero1 and zero2. I believe most of the work has already been completed, and I hope to receive further guidance. Meanwhile, it should be noted that for the lp_param function, some tensors that do not require gradients (such as inv_freq_expanded in ROPE) will still remain on the GPU's cuda instead of being offloaded to the CPU.

@LYMDLUT
Copy link
Contributor Author

LYMDLUT commented Aug 16, 2025

image image image

Based on OpenRLHF version 0.8.10 and DeepSpeed, the following operations including offload, reload_optim_states, contiguous_grad_buffer, hp_params, and lp_grads were performed on zero2 and zero3. From the result curves, it seems that zero2 can work properly.

@sfc-gh-truwase
Copy link
Collaborator

@LYMDLUT thanks sharing those curves. That looks great. Can you please address the formatting issues via
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

@sfc-gh-truwase
Copy link
Collaborator

Hmm, that is strange since the CI still fails:

https://github.com/deepspeedai/DeepSpeed/actions/runs/17046165288/job/48322449749?pr=7421.

Perhaps your versions are different from expected: pre-commit== 4.3.0 and clang-format== 18.1.3.

@LYMDLUT
Copy link
Contributor Author

LYMDLUT commented Aug 19, 2025

I tested the results of OpenRLHF on zero1, zero2, zero3, and autotp with zero1 and ds_tensor=2. From the curves, the effect is normal.
image
image
image
image

@LYMDLUT
Copy link
Contributor Author

LYMDLUT commented Aug 19, 2025

tests/unit/runtime/zero/test_offload_states_zero2.py

Please remember to delete this intermediate test file before merging.

@LYMDLUT LYMDLUT changed the title Try to support deepspeed offload states with ZeRO1 and ZeRO2 Support deepspeed offload and reload states with ZeRO1 and ZeRO2 Aug 19, 2025
@LYMDLUT LYMDLUT changed the title Support deepspeed offload and reload states with ZeRO1 and ZeRO2 Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2. Aug 19, 2025
@LYMDLUT LYMDLUT changed the title Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2. Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 Aug 19, 2025
@sfc-gh-truwase
Copy link
Collaborator

tests/unit/runtime/zero/test_offload_states_zero2.py

Please remember to delete this intermediate test file before merging.

@LYMDLUT since you are the author, can you not delete this file now?

@LYMDLUT
Copy link
Contributor Author

LYMDLUT commented Aug 20, 2025

tests/unit/runtime/zero/test_offload_states_zero2.py
Please remember to delete this intermediate test file before merging.

@LYMDLUT since you are the author, can you not delete this file now?

done!

@sfc-gh-truwase
Copy link
Collaborator

@LYMDLUT thanks for the quick delete. Can you check formatting as it seems to have broken again?.

LYMDLUT and others added 5 commits August 20, 2025 21:47
After a new argument (handle_dependency) was added to the corresponding
wait() methods, AllReduceCoalescedHandle has to be aligned, too.

Signed-off-by: Max Kovalenko <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Signed-off-by: lym <[email protected]>
stas00 and others added 16 commits August 20, 2025 21:48
currently passing `deepspeed ... --venv_script foo.sh` ends up with a
pdsh cmd like:

```
pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh export NCCL_NET_PLUGIN=blah; ...
```
you can see, `;` is missing before exports start, so the first export is
ignored.

It should be:

```
pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh; export NCCL_NET_PLUGIN=blah; ...
```

This PR is fixing it.

Signed-off-by: lym <[email protected]>
This is an initial effort to migrate CI unto Modal infra. This PR
creates two new workflows that run on Modal
1. modal-torch-latest: a subset of nv-torch-latest-v100 that includes
`tests/unit/runtime/zero/test_zero.py`.
2. modal-accelerate: a full copy of nv-accelerate-v100.

Follow up PRs will selectively migrate relevant workflows onto Modal.

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Tunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: lym <[email protected]>
Fixes snowflakedb/ArcticTraining#254 - to
support multi-epoch training with `UlyssesSPDataLoaderAdapter`.

Thanks to @yanrui27 for the fix

Signed-off-by: Stas Bekman <[email protected]>
Co-authored-by: Rui Yan <[email protected]>
Signed-off-by: lym <[email protected]>
Adding inference support for `TiledFusedLogitsLoss` by skipping
`backward` inside `forward` if the incoming tensor doesn't require grad.

xref: snowflakedb/ArcticTraining#259

---------

Signed-off-by: Stas Bekman <[email protected]>
Co-authored-by: Rui Yan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: lym <[email protected]>
+ Fix pre-compile on cpu-only machines

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: lym <[email protected]>
Enable forked PRs

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: lym <[email protected]>
# Reproduce
w/ PyTorch 2.8
```
$ git clone https://github.com/huggingface/trl.git
$ cd ./trl
$ accelerate launch     --config_file examples/accelerate_configs/deepspeed_zero3.yaml     examples/scripts/sft_gpt_oss.py     --torch_dtype bfloat16     --model_name_or_path openai/gpt-oss-20b     --packing true packing_strategy wrapped     --run_name 20b-full-eager     --attn_implementation sdpa     --dataset_num_proc 6     --dataset_name HuggingFaceH4/Multilingual-Thinking     --gradient_checkpointing     --max_length 4096     --per_device_train_batch_size 1     --num_train_epochs 1     --logging_steps 1     --warmup_ratio 0.03     --lr_scheduler_type cosine_with_min_lr     --lr_scheduler_kwargs '{"min_lr_rate": 0.1}'     --output_dir gpt-oss-20b-multilingual-reasoner     --report_to trackio     --seed 42
```

# Issue

> File "/workspace/accelerate/src/accelerate/state.py", line 216, in
__init__
> dist.init_distributed(dist_backend=self.backend,
auto_mpi_discovery=False, **kwargs)
> File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py",
line 854, in init_distributed
> cdb = TorchBackend(dist_backend, timeout, init_method, rank,
world_size)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line
120, in __init__
> self.init_process_group(backend, timeout, init_method, rank,
world_size)
> File
"/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line
164, in init_process_group
>     torch.distributed.init_process_group(backend, **kwargs)
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py",
line 81, in wrapper
>     return func(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py",
line 95, in wrapper
>     func_return = func(*args, **kwargs)
>                   ^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py",
line 1685, in init_process_group
>     if device_id is not None and device_id.type != "cpu":
> AttributeError: 'device' object has no attribute 'type'

# Root Cause
`torch.xpu.device` in PyTorch is a context manager in PyTorch rather
than a device class, it doesn't have attribute `type`

# Fix
switch to use `torch.device`

Signed-off-by: Yao, Matrix <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: lym <[email protected]>
This PR adds ZenFlow, a importance-aware offloaded training framework
for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between
computation and communication during offloaded training, improving GPU
utilization and reducing stalls.

Highlights:
- New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW)
- ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration
- Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig
- Unit tests and documentation included

Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will
be introduced in a follow-up PR.

---------

Signed-off-by: Tingfeng Lan <[email protected]>
Signed-off-by: Yusen Wu <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Guokai Ma <[email protected]>
Signed-off-by: lym <[email protected]>
Fix invalid f-strings detected by ruff.

---------

Signed-off-by: cyy <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Signed-off-by: lym <[email protected]>
This PR updates the kernel generation function arguments in Inductor to
ensure DeepCompile is compatible with PyTorch v2.8.
It also fixes the logging output of DeepCompile.

Signed-off-by: lym <[email protected]>
)

For some accelerators (such as HPU) running in a non-compile scenarios,
the `compiler.enable` decorator can cause significant performance drops
up to 8-12%.

We can easily avoid the performance hit in non-compile scenarios, by
detecting the ongoing compilation and returning immediately.

Signed-off-by: Max Kovalenko <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: lym <[email protected]>
The [PR deepspeedai#7266](deepspeedai#7266)
enforces the devices having explicit device indices (i.e. 'hpu:0',
'cuda:0', etc).

This PR aligns HPU devices to the requirement.

Signed-off-by: Max Kovalenko <[email protected]>
Signed-off-by: lym <[email protected]>
Signed-off-by: Tunji Ruwase <[email protected]>
@sfc-gh-truwase
Copy link
Collaborator

I don't know what kind of format error it is. Could you offer some help?

Similar to earlier discussion, can you run pre-commit run --all-files using pre-commit==4.3.0 and clang-format==18.1.3?

@sfc-gh-truwase
Copy link
Collaborator

@LYMDLUT actually I just made the formatting changes locally and pushed into your branch. Let's see if that works.

@loadams loadams enabled auto-merge (squash) August 20, 2025 21:42
@loadams loadams merged commit bc8c0db into deepspeedai:master Aug 20, 2025
10 of 12 checks passed
@tohtana
Copy link
Collaborator

tohtana commented Aug 21, 2025

Hi @LYMDLUT,
I am glad this PR was merged and really appreciate your great effort! Sorry that I wasn't actively help this.

Can I ask about the details for the future improvement? I see some similarity between offload_optimizer_states and offload_adam_states (reload_optimizer_states and reload_optimizer_states as well). Do you think there is a challenge to refactor them to consolidate the functions?

mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
…pspeedai#7421)

Please refer to deepspeedai#7251

---------

Signed-off-by: lym <[email protected]>
Signed-off-by: Max Kovalenko <[email protected]>
Signed-off-by: Alex Kiefer <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Sam Foreman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: huanyuqu <[email protected]>
Signed-off-by: weeknan <[email protected]>
Signed-off-by: WoosungMyung <[email protected]>
Signed-off-by: Nir Sonnenschein <[email protected]>
Signed-off-by: Junjie Mao <[email protected]>
Signed-off-by: vinceliu <[email protected]>
Signed-off-by: Tingfeng Lan <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Tunji Ruwase <[email protected]>
Signed-off-by: Yao, Matrix <[email protected]>
Signed-off-by: Yusen Wu <[email protected]>
Signed-off-by: cyy <[email protected]>
Co-authored-by: Max Kovalenko <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Alexander Kiefer <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Hongwei Chen <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Sam Foreman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: huanyuqu <[email protected]>
Co-authored-by: weeknan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Zhipeng Wang <[email protected]>
Co-authored-by: WoosungMyung <[email protected]>
Co-authored-by: Nir Sonnenschein <[email protected]>
Co-authored-by: Junjie Mao <[email protected]>
Co-authored-by: Junjie Mao <[email protected]>
Co-authored-by: lpnpcs <[email protected]>
Co-authored-by: Ma, Guokai <[email protected]>
Co-authored-by: Tingfeng Lan <[email protected]>
Co-authored-by: Rui Yan <[email protected]>
Co-authored-by: Feng Yunlong <[email protected]>
Co-authored-by: Yao Matrix <[email protected]>
Co-authored-by: Tingfeng Lan <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
Co-authored-by: Yuanyuan Chen <[email protected]>
Co-authored-by: Michael Wyatt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] deepspeed offload states is not compatible with ZeRO2