-
Notifications
You must be signed in to change notification settings - Fork 4.7k
fix #7188 #7371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix #7188 #7371
Conversation
|
@lpnpcs, thanks for contributing this fix. I am a bit concerned of the perf impact of synchronizing the device. Are you able to measure the perf before/after the fix. This will help guide whether to pursue finer-grained synchronization on streams instead of device. |
|
@lpnpcs, please use the following to fix the formatting issues |
Thank you for your review. I conducted the following experiments to illustrate the impact on performance. I trained the Qwen2.5-vl-7b model using 8 A100 GPUs with 1,000 samples for 3 epochs. Below are the performances under several cases. Overall, adding synchronization makes the code slightly slower than the original, but it avoids bugs. Stream-level synchronization shows some improvement compared to device-level synchronization. Stream-level synchronization might be more precise and can also solve the issue, so I made some changes to the code. |
|
Hi @lpnpcs , can you share your full repo - including source code, dataset, and launch script? I’d be happy to help investigate further if I can reproduce the issue. |
Sorry, our dataset is somewhat sensitive. Everything except the dataset is public, and we used llamafactory to fine-tune qwen2.5-vl-7b. One characteristic of my dataset is that it contains a few pieces of dirty data, which might cause the grad_norm to become extremely large. At this point, the gradient turns into NaN, and then deepspeed processes it as -1. However, using zero3 or disabling overlap_comm or contiguous_gradients resolves the issue. |
|
Can you enable comm_overlap and run your code with CUDA sanitizer and share the output if any? |
|
@lpnpcs, please fix conflict. Thanks! |
|
I found similar issue recently while training Qwen2.5-VL-&B too. My solution is pretty similar. I think the key issue is discovered in some issues like #5545 and #5606, that is the default stream should wait for the reduction stream. PR 5606 claimed to have fixed this issue but it didn't. It does fix in some cases. but I think the default stream should wait for the reduction stream at the end of reduce_ipg_grads because the latter code in reduce_independent_p_g_buckets_and_remove_grads modifies the buffer at the same time. This PR fix the issue as well as it syncronize after |
|
@lpnpcs - would you be able to resolve the conflicts on this and we can get it merged? |
Done! |
Signed-off-by: vinceliu <[email protected]>
|
In this way, we have a synchronize point before reduction and after reduction, which means overlap_comm not works anymore, is this the proper solution? |
I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue deepspeedai#7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications. before  after  Signed-off-by: vinceliu <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Signed-off-by: lym <[email protected]>
I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue deepspeedai#7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications. before  after  Signed-off-by: vinceliu <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]>



I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue #7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications.
before

after
