Skip to content

Conversation

@zhupengyang
Copy link
Collaborator

  • xpu 上的 paddle.distributed.barrier(self.parallel_config.tp_group) 会导致其它的 all_reduce 出现随机错误。
  • 暂时用 threading.Barrier 替代 paddle.distributed.barrier

@paddle-bot
Copy link

paddle-bot bot commented Sep 19, 2025

Thanks for your contribution!

hong19860320
hong19860320 previously approved these changes Sep 22, 2025
Copy link
Collaborator

@hong19860320 hong19860320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 先按照这个方法解决,等修复 XPU 的 paddle.distributed.barrier 问题后再调整回来。

Copy link
Collaborator

@hong19860320 hong19860320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhupengyang zhupengyang merged commit 9082f62 into PaddlePaddle:develop Sep 23, 2025
25 of 28 checks passed
@zhupengyang zhupengyang deleted the xpu_barrier branch September 23, 2025 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants