Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 #7421

LYMDLUT · 2025-07-08T16:15:20Z

Resolves: #7251

tohtana · 2025-07-09T18:12:31Z

Hi @LYMDLUT, thank you for the great work!

Does this code actually work for ZeRO1 as well? If it doesn't, can we keep the assertions changing the condition?
We have unit tests for this feature. Can you add stage 2 (and 1 if it works) to the test argument?

LYMDLUT · 2025-07-09T20:30:48Z

Hi @LYMDLUT, thank you for the great work!

Does this code actually work for ZeRO1 as well? If it doesn't, can we keep the assertions changing the condition? We have unit tests for this feature. Can you add stage 2 (and 1 if it works) to the test argument?

Thank you for your review. I have added a test unit for zero1 and zero2. However, I am not very familiar with DeepSpeed. Could you please carefully check the code and provide other guidance? I will do my best to cooperate with you.

LYMDLUT · 2025-07-11T20:10:32Z

Looking forward to your guidance.

tohtana · 2025-07-17T22:50:14Z

Hi @LYMDLUT,
Sorry for my late response and thank you for adding the test!

This actually works for ZeRO1 when I tried. How about add a parameter like this?

@pytest.mark.parametrize("zero_stage", [1, 2])
...
    def test_offload_states_zero2(self, included_state, pin_memory, non_blocking, zero_stage):
        hidden_dim = 1024
        config_dict = {
            "train_micro_batch_size_per_gpu": 1,
            "optimizer": {"type": "Adam", "params": {"lr": 1e-6}},
            "zero_optimization": {"stage": zero_stage},
            "bf16": {"enabled": True}
        }

As the behavior of memory release is often tricky, can you check if the memory usage is actually reduced? You can find the example here.
Can we consolidate the new test into test_offload_states.py? The initialization and validation of values are a bit different from ZeRO3, I think we can just have another function in TestOffloadStates. If you can even consolidate functions (e.g. run_model in test_offload_states and run_model_zero2), it would be great. We don't even need to have the logic to check the memory usage in both run_model and run_model_zero2.

LYMDLUT · 2025-07-20T14:53:15Z

@tohtana , I really need your guidance and help!

As the behavior of memory release is often tricky, can you check if the memory usage is actually reduced? You can find the example here.

When I was performing the second step, I noticed that the memory allocated (as indicated by memory_allocated()) for hp_param, lp_param, and lp_grads remained unchanged, with only the memory for optim_states showing a decrease. However, the corresponding tensors had indeed been transferred from CUDA to the CPU.

Despite debugging, I couldn't figure out the reason. I will continue to conduct in-depth research on this issue. However, if you know how to solve it, it will be of great help to me. Thank you very much.

tohtana · 2025-07-25T05:41:48Z

Hi @LYMDLUT,

Thank you for enhancing the test!

I think we still have a reference to the buffer somewhere. PyTorch's memory allocator frees the memory only when there is no reference to the buffer. Some operators like view and narrow creates a view of the buffer, which refer to the buffer. We need to make sure we don't have any these views.

Can you first check if we create such views of the buffer? We need to free the views when offloading and reconstruct it when reloading.

LYMDLUT · 2025-08-06T05:28:56Z

Hi @tohtana,
We may have resolved most of the issues. Before I delete tests/unit/runtime/zero/test_offload_states_zero2.py, I hope you can review the update changes to ensure the correctness of the code. I look forward to your further guidance.

LYMDLUT · 2025-08-12T16:33:16Z

The contiguous_grad_buffer should not need to be implemented in zero1 and zero2. I believe most of the work has already been completed, and I hope to receive further guidance. Meanwhile, it should be noted that for the lp_param function, some tensors that do not require gradients (such as inv_freq_expanded in ROPE) will still remain on the GPU's cuda instead of being offloaded to the CPU.

LYMDLUT · 2025-08-16T14:30:30Z

Based on OpenRLHF version 0.8.10 and DeepSpeed, the following operations including offload, reload_optim_states, contiguous_grad_buffer, hp_params, and lp_grads were performed on zero2 and zero3. From the result curves, it seems that zero2 can work properly.

sfc-gh-truwase · 2025-08-16T18:00:05Z

@LYMDLUT thanks sharing those curves. That looks great. Can you please address the formatting issues via
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

sfc-gh-truwase · 2025-08-18T16:46:54Z

Hmm, that is strange since the CI still fails:

https://github.com/deepspeedai/DeepSpeed/actions/runs/17046165288/job/48322449749?pr=7421.

Perhaps your versions are different from expected: pre-commit== 4.3.0 and clang-format== 18.1.3.

LYMDLUT · 2025-08-19T18:38:44Z

I tested the results of OpenRLHF on zero1, zero2, zero3, and autotp with zero1 and ds_tensor=2. From the curves, the effect is normal.

LYMDLUT · 2025-08-19T18:40:14Z

tests/unit/runtime/zero/test_offload_states_zero2.py

Please remember to delete this intermediate test file before merging.

sfc-gh-truwase · 2025-08-20T13:31:38Z

tests/unit/runtime/zero/test_offload_states_zero2.py

Please remember to delete this intermediate test file before merging.

@LYMDLUT since you are the author, can you not delete this file now?

LYMDLUT · 2025-08-20T13:35:32Z

tests/unit/runtime/zero/test_offload_states_zero2.py
Please remember to delete this intermediate test file before merging.

@LYMDLUT since you are the author, can you not delete this file now?

done！

sfc-gh-truwase · 2025-08-20T13:40:16Z

@LYMDLUT thanks for the quick delete. Can you check formatting as it seems to have broken again?.

Signed-off-by: lym <[email protected]>

After a new argument (handle_dependency) was added to the corresponding wait() methods, AllReduceCoalescedHandle has to be aligned, too. Signed-off-by: Max Kovalenko <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Signed-off-by: lym <[email protected]>

currently passing `deepspeed ... --venv_script foo.sh` ends up with a pdsh cmd like: ``` pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh export NCCL_NET_PLUGIN=blah; ... ``` you can see, `;` is missing before exports start, so the first export is ignored. It should be: ``` pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh; export NCCL_NET_PLUGIN=blah; ... ``` This PR is fixing it. Signed-off-by: lym <[email protected]>

This is an initial effort to migrate CI unto Modal infra. This PR creates two new workflows that run on Modal 1. modal-torch-latest: a subset of nv-torch-latest-v100 that includes `tests/unit/runtime/zero/test_zero.py`. 2. modal-accelerate: a full copy of nv-accelerate-v100. Follow up PRs will selectively migrate relevant workflows onto Modal. --------- Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Tunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: lym <[email protected]>

@yanrui27

Fixes snowflakedb/ArcticTraining#254 - to support multi-epoch training with `UlyssesSPDataLoaderAdapter`. Thanks to @yanrui27 for the fix Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Rui Yan <[email protected]> Signed-off-by: lym <[email protected]>

Adding inference support for `TiledFusedLogitsLoss` by skipping `backward` inside `forward` if the incoming tensor doesn't require grad. xref: snowflakedb/ArcticTraining#259 --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Rui Yan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>

Signed-off-by: lym <[email protected]>

+ Fix pre-compile on cpu-only machines --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>

Enable forked PRs --------- Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>

# Reproduce w/ PyTorch 2.8 ``` $ git clone https://github.com/huggingface/trl.git $ cd ./trl $ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml examples/scripts/sft_gpt_oss.py --torch_dtype bfloat16 --model_name_or_path openai/gpt-oss-20b --packing true packing_strategy wrapped --run_name 20b-full-eager --attn_implementation sdpa --dataset_num_proc 6 --dataset_name HuggingFaceH4/Multilingual-Thinking --gradient_checkpointing --max_length 4096 --per_device_train_batch_size 1 --num_train_epochs 1 --logging_steps 1 --warmup_ratio 0.03 --lr_scheduler_type cosine_with_min_lr --lr_scheduler_kwargs '{"min_lr_rate": 0.1}' --output_dir gpt-oss-20b-multilingual-reasoner --report_to trackio --seed 42 ``` # Issue > File "/workspace/accelerate/src/accelerate/state.py", line 216, in __init__ > dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs) > File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 854, in init_distributed > cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 120, in __init__ > self.init_process_group(backend, timeout, init_method, rank, world_size) > File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 164, in init_process_group > torch.distributed.init_process_group(backend, **kwargs) > File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper > return func(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper > func_return = func(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 1685, in init_process_group > if device_id is not None and device_id.type != "cpu": > AttributeError: 'device' object has no attribute 'type' # Root Cause `torch.xpu.device` in PyTorch is a context manager in PyTorch rather than a device class, it doesn't have attribute `type` # Fix switch to use `torch.device` Signed-off-by: Yao, Matrix <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>

This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls. Highlights: - New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW) - ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration - Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig - Unit tests and documentation included Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR. --------- Signed-off-by: Tingfeng Lan <[email protected]> Signed-off-by: Yusen Wu <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Yusen Wu <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Guokai Ma <[email protected]> Signed-off-by: lym <[email protected]>

Fix invalid f-strings detected by ruff. --------- Signed-off-by: cyy <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Signed-off-by: lym <[email protected]>

This PR updates the kernel generation function arguments in Inductor to ensure DeepCompile is compatible with PyTorch v2.8. It also fixes the logging output of DeepCompile. Signed-off-by: lym <[email protected]>

) For some accelerators (such as HPU) running in a non-compile scenarios, the `compiler.enable` decorator can cause significant performance drops up to 8-12%. We can easily avoid the performance hit in non-compile scenarios, by detecting the ongoing compilation and returning immediately. Signed-off-by: Max Kovalenko <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>

The [PR deepspeedai#7266](deepspeedai#7266) enforces the devices having explicit device indices (i.e. 'hpu:0', 'cuda:0', etc). This PR aligns HPU devices to the requirement. Signed-off-by: Max Kovalenko <[email protected]> Signed-off-by: lym <[email protected]>

Signed-off-by: lym <[email protected]>

Signed-off-by: Tunji Ruwase <[email protected]>

sfc-gh-truwase · 2025-08-20T14:14:29Z

I don't know what kind of format error it is. Could you offer some help?

Similar to earlier discussion, can you run pre-commit run --all-files using pre-commit==4.3.0 and clang-format==18.1.3?

sfc-gh-truwase · 2025-08-20T14:16:23Z

@LYMDLUT actually I just made the formatting changes locally and pushed into your branch. Let's see if that works.

tohtana · 2025-08-21T17:47:26Z

Hi @LYMDLUT,
I am glad this PR was merged and really appreciate your great effort! Sorry that I wasn't actively help this.

Can I ask about the details for the future improvement? I see some similarity between offload_optimizer_states and offload_adam_states (reload_optimizer_states and reload_optimizer_states as well). Do you think there is a challenge to refactor them to consolidate the functions?

…pspeedai#7421) Please refer to deepspeedai#7251 --------- Signed-off-by: lym <[email protected]> Signed-off-by: Max Kovalenko <[email protected]> Signed-off-by: Alex Kiefer <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Sam Foreman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: huanyuqu <[email protected]> Signed-off-by: weeknan <[email protected]> Signed-off-by: WoosungMyung <[email protected]> Signed-off-by: Nir Sonnenschein <[email protected]> Signed-off-by: Junjie Mao <[email protected]> Signed-off-by: vinceliu <[email protected]> Signed-off-by: Tingfeng Lan <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Tunji Ruwase <[email protected]> Signed-off-by: Yao, Matrix <[email protected]> Signed-off-by: Yusen Wu <[email protected]> Signed-off-by: cyy <[email protected]> Co-authored-by: Max Kovalenko <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Alexander Kiefer <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Sam Foreman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: huanyuqu <[email protected]> Co-authored-by: weeknan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Zhipeng Wang <[email protected]> Co-authored-by: WoosungMyung <[email protected]> Co-authored-by: Nir Sonnenschein <[email protected]> Co-authored-by: Junjie Mao <[email protected]> Co-authored-by: Junjie Mao <[email protected]> Co-authored-by: lpnpcs <[email protected]> Co-authored-by: Ma, Guokai <[email protected]> Co-authored-by: Tingfeng Lan <[email protected]> Co-authored-by: Rui Yan <[email protected]> Co-authored-by: Feng Yunlong <[email protected]> Co-authored-by: Yao Matrix <[email protected]> Co-authored-by: Tingfeng Lan <[email protected]> Co-authored-by: Yusen Wu <[email protected]> Co-authored-by: Yuanyuan Chen <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>

LYMDLUT requested review from tjruwase and tohtana as code owners July 8, 2025 16:15

LYMDLUT closed this Jul 9, 2025

LYMDLUT reopened this Jul 9, 2025

LYMDLUT requested a review from loadams as a code owner July 9, 2025 20:25

LYMDLUT changed the title ~~Try to support deepspeed offload states with ZeRO2~~ Try to support deepspeed offload states with ZeRO1 and ZeRO2 Jul 18, 2025

LYMDLUT changed the title ~~Try to support deepspeed offload states with ZeRO1 and ZeRO2~~ Support deepspeed offload and reload states with ZeRO1 and ZeRO2 Aug 19, 2025

LYMDLUT changed the title ~~Support deepspeed offload and reload states with ZeRO1 and ZeRO2~~ Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2. Aug 19, 2025

LYMDLUT changed the title ~~Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2.~~ Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 Aug 19, 2025

LYMDLUT and others added 5 commits August 20, 2025 21:47

Update stage_1_and_2.py

c723c32

Signed-off-by: lym <[email protected]>

Update stage_1_and_2.py

b808560

Signed-off-by: lym <[email protected]>

Update engine.py

2e2e058

Signed-off-by: lym <[email protected]>

Update offload_states.py

1740fe9

Signed-off-by: lym <[email protected]>

stas00 and others added 16 commits August 20, 2025 21:48

Update test_offload_states_zero2.py

fde1035

Signed-off-by: lym <[email protected]>

Update test_offload_states_zero2.py

d16aa8a

Signed-off-by: lym <[email protected]>

Update stage_1_and_2.py

f5f7d49

Signed-off-by: lym <[email protected]>

Fix pre-compile on cpu-only machines (deepspeedai#7168)

8a0d226

+ Fix pre-compile on cpu-only machines --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>

Enable forked PRs (deepspeedai#7486)

4ffb442

Enable forked PRs --------- Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>

Fix DeepCompile for PyTorch v2.8 (deepspeedai#7496)

5777e6c

This PR updates the kernel generation function arguments in Inductor to ensure DeepCompile is compatible with PyTorch v2.8. It also fixes the logging output of DeepCompile. Signed-off-by: lym <[email protected]>

Delete tests/unit/runtime/zero/test_offload_states_zero2.py

6ef88b9

Signed-off-by: lym <[email protected]>

LYMDLUT force-pushed the master branch from b938587 to 6ef88b9 Compare August 20, 2025 13:48

LYMDLUT requested review from GuanhuaWang, hwchen2017 and jomayeri as code owners August 20, 2025 13:48

Merge branch 'master' into master

f2dcffb

sfc-gh-truwase approved these changes Aug 20, 2025

View reviewed changes

Format fixes

fbefa4a

Signed-off-by: Tunji Ruwase <[email protected]>

Merge branch 'master' into master

829b742

loadams enabled auto-merge (squash) August 20, 2025 21:42

loadams merged commit bc8c0db into deepspeedai:master Aug 20, 2025
10 of 12 checks passed

Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 #7421

Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 #7421

Uh oh!

Conversation

LYMDLUT commented Jul 8, 2025 • edited by loadams Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LYMDLUT commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LYMDLUT commented Jul 11, 2025

Uh oh!

tohtana commented Jul 17, 2025

Uh oh!

LYMDLUT commented Jul 20, 2025

Uh oh!

tohtana commented Jul 25, 2025

Uh oh!

LYMDLUT commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LYMDLUT commented Aug 12, 2025

Uh oh!

LYMDLUT commented Aug 16, 2025

Uh oh!

sfc-gh-truwase commented Aug 16, 2025

Uh oh!

sfc-gh-truwase commented Aug 18, 2025

Uh oh!

LYMDLUT commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LYMDLUT commented Aug 19, 2025

Uh oh!

sfc-gh-truwase commented Aug 20, 2025

Uh oh!

LYMDLUT commented Aug 20, 2025

Uh oh!

sfc-gh-truwase commented Aug 20, 2025

Uh oh!

sfc-gh-truwase commented Aug 20, 2025

Uh oh!

sfc-gh-truwase commented Aug 20, 2025

Uh oh!

Uh oh!

tohtana commented Aug 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

LYMDLUT commented Jul 8, 2025 •

edited by loadams

Loading

tohtana commented Jul 9, 2025 •

edited

Loading

LYMDLUT commented Jul 9, 2025 •

edited

Loading

LYMDLUT commented Aug 6, 2025 •

edited

Loading

LYMDLUT commented Aug 19, 2025 •

edited

Loading