fix #7188 #7371

lpnpcs · 2025-06-19T08:07:44Z

I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue #7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications.

before

after

tjruwase · 2025-06-19T12:47:04Z

@lpnpcs, thanks for contributing this fix. I am a bit concerned of the perf impact of synchronizing the device. Are you able to measure the perf before/after the fix. This will help guide whether to pursue finer-grained synchronization on streams instead of device.

tjruwase · 2025-06-19T12:50:30Z

@lpnpcs, please use the following to fix the formatting issues
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

lpnpcs · 2025-06-20T09:13:14Z

@lpnpcs, thanks for contributing this fix. I am a bit concerned of the perf impact of synchronizing the device. Are you able to measure the perf before/after the fix. This will help guide whether to pursue finer-grained synchronization on streams instead of device.

Thank you for your review. I conducted the following experiments to illustrate the impact on performance.

I trained the Qwen2.5-vl-7b model using 8 A100 GPUs with 1,000 samples for 3 epochs. Below are the performances under several cases.

1 Original code

2 device synchronize

3 streams synchronize

Overall, adding synchronization makes the code slightly slower than the original, but it avoids bugs. Stream-level synchronization shows some improvement compared to device-level synchronization. Stream-level synchronization might be more precise and can also solve the issue, so I made some changes to the code.

hwchen2017 · 2025-06-20T20:03:13Z

Hi @lpnpcs , can you share your full repo - including source code, dataset, and launch script? I’d be happy to help investigate further if I can reproduce the issue.

lpnpcs · 2025-06-23T09:26:30Z

Hi @lpnpcs , can you share your full repo - including source code, dataset, and launch script? I’d be happy to help investigate further if I can reproduce the issue.

Sorry, our dataset is somewhat sensitive. Everything except the dataset is public, and we used llamafactory to fine-tune qwen2.5-vl-7b. One characteristic of my dataset is that it contains a few pieces of dirty data, which might cause the grad_norm to become extremely large. At this point, the gradient turns into NaN, and then deepspeed processes it as -1. However, using zero3 or disabling overlap_comm or contiguous_gradients resolves the issue.

hwchen2017 · 2025-06-24T00:48:04Z

Can you enable comm_overlap and run your code with CUDA sanitizer and share the output if any?
TORCH_CUDA_SANITIZER=1 python your_code.py

FYI: https://pytorch.org/docs/stable/cuda._sanitizer.html

deepspeed/runtime/zero/stage_1_and_2.py

sfc-gh-truwase · 2025-07-11T14:46:44Z

@lpnpcs, please fix conflict. Thanks!

jhwei · 2025-07-18T02:54:52Z

I found similar issue recently while training Qwen2.5-VL-&B too. My solution is pretty similar.

I think the key issue is discovered in some issues like #5545 and #5606, that is the default stream should wait for the reduction stream.

PR 5606 claimed to have fixed this issue but it didn't. It does fix in some cases. but I think the default stream should wait for the reduction stream at the end of reduce_ipg_grads because the latter code in reduce_independent_p_g_buckets_and_remove_grads modifies the buffer at the same time.

This PR fix the issue as well as it syncronize after reduce_ipg_grads. I think the reduction stream does not need to wait for the default stream(current stream) at this place as the default stream does not modify the buffer at the same time.

loadams · 2025-07-28T19:20:39Z

@lpnpcs - would you be able to resolve the conflicts on this and we can get it merged?

lpnpcs · 2025-07-29T06:17:04Z

@lpnpcs - would you be able to resolve the conflicts on this and we can get it merged?

Done!

loadams · 2025-07-29T15:18:36Z

@lpnpcs - would you be able to resolve the conflicts on this and we can get it merged?

Done!

Thanks, @lpnpcs - could you resolve the formatting fixes as well then we can merge?

Signed-off-by: vinceliu <[email protected]>

lpnpcs · 2025-07-30T03:39:42Z

@lpnpcs - would you be able to resolve the conflicts on this and we can get it merged?

Done!

Thanks, @lpnpcs - could you resolve the formatting fixes as well then we can merge?

Done.

GuCarpenter · 2025-08-20T00:22:13Z

In this way, we have a synchronize point before reduction and after reduction, which means overlap_comm not works anymore, is this the proper solution?

I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue deepspeedai#7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications. before ![image](https://github.com/user-attachments/assets/981d0829-e15f-4899-ae2c-4eca16ef138d) after ![image](https://github.com/user-attachments/assets/8b6b8403-d5df-4aa8-b573-195b9ee1fdfb) Signed-off-by: vinceliu <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Signed-off-by: lym <[email protected]>

sfc-gh-truwase · 2025-08-20T16:09:12Z

@anyinlover thanks for raising concerns about overlap_comm perf. Do you have any numbers that show degradation?

@lpnpcs and @jhwei I wonder if you have any thoughts on this? Did you measure with overlap_comm in your experiments?

I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue deepspeedai#7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications. before ![image](https://github.com/user-attachments/assets/981d0829-e15f-4899-ae2c-4eca16ef138d) after ![image](https://github.com/user-attachments/assets/8b6b8403-d5df-4aa8-b573-195b9ee1fdfb) Signed-off-by: vinceliu <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]>

lpnpcs requested review from tjruwase and tohtana as code owners June 19, 2025 08:07

lpnpcs force-pushed the fix_grad_nan branch from 5216c33 to 90ee7f6 Compare June 19, 2025 08:21

sfc-gh-truwase reviewed Jul 11, 2025

View reviewed changes

deepspeed/runtime/zero/stage_1_and_2.py Show resolved Hide resolved

sfc-gh-truwase approved these changes Jul 11, 2025

View reviewed changes

lpnpcs force-pushed the fix_grad_nan branch from 3b90b77 to 9ce7afc Compare July 29, 2025 06:12

lpnpcs force-pushed the fix_grad_nan branch from fc56e18 to 619f759 Compare July 30, 2025 03:11

lpnpcs requested review from GuanhuaWang, hwchen2017, jomayeri and loadams as code owners July 30, 2025 03:11

fix data race when swap ipg_index in zero 2

2e99dcb

Signed-off-by: vinceliu <[email protected]>

lpnpcs force-pushed the fix_grad_nan branch from e6df0a9 to 2e99dcb Compare July 30, 2025 03:18

lpnpcs and others added 5 commits July 31, 2025 10:32

Merge branch 'master' into fix_grad_nan

9f256ce

Merge branch 'master' into fix_grad_nan

bd33c62

Merge branch 'master' into fix_grad_nan

c13318f

Merge branch 'master' into fix_grad_nan

91ab4d3

Merge branch 'master' into fix_grad_nan

290f6c7

loadams enabled auto-merge (squash) August 4, 2025 18:20

loadams merged commit f897b67 into deepspeedai:master Aug 4, 2025
9 checks passed

fix #7188 #7371

fix #7188 #7371

Uh oh!

Conversation

lpnpcs commented Jun 19, 2025

Uh oh!

tjruwase commented Jun 19, 2025

Uh oh!

tjruwase commented Jun 19, 2025

Uh oh!

lpnpcs commented Jun 20, 2025

Uh oh!

hwchen2017 commented Jun 20, 2025

Uh oh!

lpnpcs commented Jun 23, 2025

Uh oh!

hwchen2017 commented Jun 24, 2025

Uh oh!

Uh oh!

sfc-gh-truwase commented Jul 11, 2025

Uh oh!

jhwei commented Jul 18, 2025

Uh oh!

loadams commented Jul 28, 2025

Uh oh!

lpnpcs commented Jul 29, 2025

Uh oh!

loadams commented Jul 29, 2025

Uh oh!

lpnpcs commented Jul 30, 2025

Uh oh!

Uh oh!

GuCarpenter commented Aug 20, 2025

Uh oh!

sfc-gh-truwase commented Aug 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants