-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 #7421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @LYMDLUT, thank you for the great work! Does this code actually work for ZeRO1 as well? If it doesn't, can we keep the assertions changing the condition? |
Thank you for your review. I have added a test unit for zero1 and zero2. However, I am not very familiar with DeepSpeed. Could you please carefully check the code and provide other guidance? I will do my best to cooperate with you. |
|
Looking forward to your guidance. |
|
Hi @LYMDLUT,
@pytest.mark.parametrize("zero_stage", [1, 2])
...
def test_offload_states_zero2(self, included_state, pin_memory, non_blocking, zero_stage):
hidden_dim = 1024
config_dict = {
"train_micro_batch_size_per_gpu": 1,
"optimizer": {"type": "Adam", "params": {"lr": 1e-6}},
"zero_optimization": {"stage": zero_stage},
"bf16": {"enabled": True}
}
|
When I was performing the second step, I noticed that the memory allocated (as indicated by memory_allocated()) for hp_param, lp_param, and lp_grads remained unchanged, with only the memory for optim_states showing a decrease. However, the corresponding tensors had indeed been transferred from CUDA to the CPU. Despite debugging, I couldn't figure out the reason. I will continue to conduct in-depth research on this issue. However, if you know how to solve it, it will be of great help to me. Thank you very much. |
|
Hi @LYMDLUT, Thank you for enhancing the test! I think we still have a reference to the buffer somewhere. PyTorch's memory allocator frees the memory only when there is no reference to the buffer. Some operators like Can you first check if we create such views of the buffer? We need to free the views when offloading and reconstruct it when reloading. |
|
Hi @tohtana, |
|
The contiguous_grad_buffer should not need to be implemented in zero1 and zero2. I believe most of the work has already been completed, and I hope to receive further guidance. Meanwhile, it should be noted that for the lp_param function, some tensors that do not require gradients (such as inv_freq_expanded in ROPE) will still remain on the GPU's cuda instead of being offloaded to the CPU. |
|
@LYMDLUT thanks sharing those curves. That looks great. Can you please address the formatting issues via |
|
Hmm, that is strange since the CI still fails: Perhaps your versions are different from expected: |
|
tests/unit/runtime/zero/test_offload_states_zero2.py Please remember to delete this intermediate test file before merging. |
@LYMDLUT since you are the author, can you not delete this file now? |
done! |
|
@LYMDLUT thanks for the quick delete. Can you check formatting as it seems to have broken again?. |
Signed-off-by: lym <[email protected]>
Signed-off-by: lym <[email protected]>
Signed-off-by: lym <[email protected]>
Signed-off-by: lym <[email protected]>
After a new argument (handle_dependency) was added to the corresponding wait() methods, AllReduceCoalescedHandle has to be aligned, too. Signed-off-by: Max Kovalenko <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Signed-off-by: lym <[email protected]>
currently passing `deepspeed ... --venv_script foo.sh` ends up with a pdsh cmd like: ``` pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh export NCCL_NET_PLUGIN=blah; ... ``` you can see, `;` is missing before exports start, so the first export is ignored. It should be: ``` pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh; export NCCL_NET_PLUGIN=blah; ... ``` This PR is fixing it. Signed-off-by: lym <[email protected]>
This is an initial effort to migrate CI unto Modal infra. This PR creates two new workflows that run on Modal 1. modal-torch-latest: a subset of nv-torch-latest-v100 that includes `tests/unit/runtime/zero/test_zero.py`. 2. modal-accelerate: a full copy of nv-accelerate-v100. Follow up PRs will selectively migrate relevant workflows onto Modal. --------- Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Tunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: lym <[email protected]>
Fixes snowflakedb/ArcticTraining#254 - to support multi-epoch training with `UlyssesSPDataLoaderAdapter`. Thanks to @yanrui27 for the fix Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Rui Yan <[email protected]> Signed-off-by: lym <[email protected]>
Adding inference support for `TiledFusedLogitsLoss` by skipping `backward` inside `forward` if the incoming tensor doesn't require grad. xref: snowflakedb/ArcticTraining#259 --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Rui Yan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>
Signed-off-by: lym <[email protected]>
Signed-off-by: lym <[email protected]>
Signed-off-by: lym <[email protected]>
+ Fix pre-compile on cpu-only machines --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>
Enable forked PRs --------- Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>
# Reproduce w/ PyTorch 2.8 ``` $ git clone https://github.com/huggingface/trl.git $ cd ./trl $ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml examples/scripts/sft_gpt_oss.py --torch_dtype bfloat16 --model_name_or_path openai/gpt-oss-20b --packing true packing_strategy wrapped --run_name 20b-full-eager --attn_implementation sdpa --dataset_num_proc 6 --dataset_name HuggingFaceH4/Multilingual-Thinking --gradient_checkpointing --max_length 4096 --per_device_train_batch_size 1 --num_train_epochs 1 --logging_steps 1 --warmup_ratio 0.03 --lr_scheduler_type cosine_with_min_lr --lr_scheduler_kwargs '{"min_lr_rate": 0.1}' --output_dir gpt-oss-20b-multilingual-reasoner --report_to trackio --seed 42 ``` # Issue > File "/workspace/accelerate/src/accelerate/state.py", line 216, in __init__ > dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs) > File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 854, in init_distributed > cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 120, in __init__ > self.init_process_group(backend, timeout, init_method, rank, world_size) > File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 164, in init_process_group > torch.distributed.init_process_group(backend, **kwargs) > File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper > return func(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper > func_return = func(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 1685, in init_process_group > if device_id is not None and device_id.type != "cpu": > AttributeError: 'device' object has no attribute 'type' # Root Cause `torch.xpu.device` in PyTorch is a context manager in PyTorch rather than a device class, it doesn't have attribute `type` # Fix switch to use `torch.device` Signed-off-by: Yao, Matrix <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>
This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls. Highlights: - New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW) - ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration - Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig - Unit tests and documentation included Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR. --------- Signed-off-by: Tingfeng Lan <[email protected]> Signed-off-by: Yusen Wu <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Yusen Wu <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Guokai Ma <[email protected]> Signed-off-by: lym <[email protected]>
Fix invalid f-strings detected by ruff. --------- Signed-off-by: cyy <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Michael Wyatt <[email protected]> Signed-off-by: lym <[email protected]>
This PR updates the kernel generation function arguments in Inductor to ensure DeepCompile is compatible with PyTorch v2.8. It also fixes the logging output of DeepCompile. Signed-off-by: lym <[email protected]>
) For some accelerators (such as HPU) running in a non-compile scenarios, the `compiler.enable` decorator can cause significant performance drops up to 8-12%. We can easily avoid the performance hit in non-compile scenarios, by detecting the ongoing compilation and returning immediately. Signed-off-by: Max Kovalenko <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: lym <[email protected]>
The [PR deepspeedai#7266](deepspeedai#7266) enforces the devices having explicit device indices (i.e. 'hpu:0', 'cuda:0', etc). This PR aligns HPU devices to the requirement. Signed-off-by: Max Kovalenko <[email protected]> Signed-off-by: lym <[email protected]>
Signed-off-by: lym <[email protected]>
Signed-off-by: Tunji Ruwase <[email protected]>
Similar to earlier discussion, can you run |
|
@LYMDLUT actually I just made the formatting changes locally and pushed into your branch. Let's see if that works. |
|
Hi @LYMDLUT, Can I ask about the details for the future improvement? I see some similarity between |
…pspeedai#7421) Please refer to deepspeedai#7251 --------- Signed-off-by: lym <[email protected]> Signed-off-by: Max Kovalenko <[email protected]> Signed-off-by: Alex Kiefer <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Sam Foreman <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: huanyuqu <[email protected]> Signed-off-by: weeknan <[email protected]> Signed-off-by: WoosungMyung <[email protected]> Signed-off-by: Nir Sonnenschein <[email protected]> Signed-off-by: Junjie Mao <[email protected]> Signed-off-by: vinceliu <[email protected]> Signed-off-by: Tingfeng Lan <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Tunji Ruwase <[email protected]> Signed-off-by: Yao, Matrix <[email protected]> Signed-off-by: Yusen Wu <[email protected]> Signed-off-by: cyy <[email protected]> Co-authored-by: Max Kovalenko <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Alexander Kiefer <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Sam Foreman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: huanyuqu <[email protected]> Co-authored-by: weeknan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Zhipeng Wang <[email protected]> Co-authored-by: WoosungMyung <[email protected]> Co-authored-by: Nir Sonnenschein <[email protected]> Co-authored-by: Junjie Mao <[email protected]> Co-authored-by: Junjie Mao <[email protected]> Co-authored-by: lpnpcs <[email protected]> Co-authored-by: Ma, Guokai <[email protected]> Co-authored-by: Tingfeng Lan <[email protected]> Co-authored-by: Rui Yan <[email protected]> Co-authored-by: Feng Yunlong <[email protected]> Co-authored-by: Yao Matrix <[email protected]> Co-authored-by: Tingfeng Lan <[email protected]> Co-authored-by: Yusen Wu <[email protected]> Co-authored-by: Yuanyuan Chen <[email protected]> Co-authored-by: Michael Wyatt <[email protected]>







Resolves: #7251