Skip to content

Conversation

@delock
Copy link
Collaborator

@delock delock commented Aug 8, 2025

In ZeRO offload, significant time is spent on CPUAdam, which is CPU code. Thus use --bind_cores_to_rank in deepspeed launch command would help improve the performance of ZeRO offload. This PR add this command to ZeRO offload tutorial to increase user awareness.

For Qwen2.5-3B finetuning on 2 A100-40B cards, running on CPU host with 128 CPU cores, the average step time is as follow, near 1.3x performance improvement:
without --bind_cores_to_rank: 3084.44ms per step
with --bind_cores_to_rank: 2383.16ms per step

@delock delock requested review from loadams and tjruwase as code owners August 8, 2025 06:58
@hwchen2017 hwchen2017 merged commit f03d416 into master Aug 8, 2025
2 checks passed
@hwchen2017 hwchen2017 deleted the gma/zero_offload_doc branch August 8, 2025 17:34
LYMDLUT pushed a commit to LYMDLUT/DeepSpeed that referenced this pull request Aug 20, 2025
In ZeRO offload, significant time is spent on CPUAdam, which is CPU
code. Thus use `--bind_cores_to_rank` in deepspeed launch command would
help improve the performance of ZeRO offload. This PR add this command
to ZeRO offload tutorial to increase user awareness.

For Qwen2.5-3B finetuning on 2 A100-40B cards, running on CPU host with
128 CPU cores, the average step time is as follow, near 1.3x performance
improvement:
without `--bind_cores_to_rank`: 3084.44ms per step
with `--bind_cores_to_rank`: 2383.16ms per step

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: lym <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
In ZeRO offload, significant time is spent on CPUAdam, which is CPU
code. Thus use `--bind_cores_to_rank` in deepspeed launch command would
help improve the performance of ZeRO offload. This PR add this command
to ZeRO offload tutorial to increase user awareness.

For Qwen2.5-3B finetuning on 2 A100-40B cards, running on CPU host with
128 CPU cores, the average step time is as follow, near 1.3x performance
improvement:
without `--bind_cores_to_rank`: 3084.44ms per step
with `--bind_cores_to_rank`: 2383.16ms per step

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants