Add barrier() at the end of init_process_group and new_group. #45181

pritamdamania87 · 2020-09-23T00:40:29Z

Stack from ghstack:

Add barrier() at the end of init_process_group and new_group. #45181 Add barrier() at the end of init_process_group and new_group.

init_process_group and new_group update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.

To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.

Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.

#Closes: #40434, #40378

Differential Revision: D23858025

`init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/) [ghstack-poisoned]

`init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/) ghstack-source-id: 112656257 Pull Request resolved: #45181

mrshenli

Stamp to unblock.

mrshenli · 2020-09-23T01:16:29Z

torch/distributed/distributed_c10d.py

+    # barrier at the end to ensure that once we return from this method, all
+    # process groups including global variables are updated correctly on all
+    # ranks.
+    barrier()


We might need to guard this with try-except or if-else as it's possible that 3rd party extensions for c10d does not support barrier. See #45126

I'm wondering how 3rd party extensions would work without supporting barrier? Currently ProcessGroup.hpp clearly defines the barrier method as pure virtual and as a result we should probably require extensions to support this. Do you know if there are any existing 3rd party extensions that do not support this? It looks like oneCCL and torch-ucc do support barrier().

IIRC, we have one test c10d extension whose barrier() implementation just throws an unsupported error.

dr-ci · 2020-09-23T01:29:31Z

💊 CI failures summary and remediations

As of commit 408ede9 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 21 times.

lw · 2020-09-23T08:05:06Z

How did you check that this fixes those failures? The OSX job on CircleCI is failing (for another reason) and from its logs it seems it's not running the DdpUnderDistAutogradTest suite.

pritamdamania87 · 2020-09-23T17:56:58Z

How did you check that this fixes those failures? The OSX job on CircleCI is failing (for another reason) and from its logs it seems it's not running the DdpUnderDistAutogradTest suite.

Shen pointed me to this: https://circleci.com/docs/2.0/ssh-access-jobs/. Basically you re-run an existing CircleCI job with ssh enabled. So I sshed into the vm, and ran the test 100 times and some runs failed. Then I locally applied the fix in this PR and ran the test 100 times again and all runs passed.

…up." `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/) [ghstack-poisoned]

Pull Request resolved: #45181 `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 ghstack-source-id: 112735141 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/)

mrshenli · 2020-09-24T01:21:48Z

torch/distributed/distributed_c10d.py

+    # barrier at the end to ensure that once we return from this method, all
+    # process groups including global variables are updated correctly on all
+    # ranks.
+    barrier()


cc @agolynski FYI, after this change, c10d extensions cannot throw in barrier(), otherwise they won't be able to initialize default process groups.

…up." `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/) [ghstack-poisoned]

Pull Request resolved: #45181 `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 ghstack-source-id: 112851804 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/)

codecov · 2020-09-25T01:10:26Z

Codecov Report

Merging #45181 into gh/pritamdamania87/164/base will decrease coverage by 0.24%.
The diff coverage is 16.66%.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/164/base   #45181      +/-   ##
===============================================================
- Coverage                        68.08%   67.84%   -0.25%     
===============================================================
  Files                              393      384       -9     
  Lines                            50960    50051     -909     
===============================================================
- Hits                             34697    33955     -742     
+ Misses                           16263    16096     -167

Impacted Files	Coverage Δ
.../testing/_internal/distributed/distributed_test.py	`30.82% <0.00%> (-0.04%)`	⬇️
torch/distributed/distributed_c10d.py	`28.34% <50.00%> (+1.01%)`	⬆️
torch/nn/modules/distance.py	`64.00% <0.00%> (-20.00%)`	⬇️
torch/utils/_benchmark/utils/common.py	`77.68% <0.00%> (-13.23%)`	⬇️
torch/testing/_internal/common_cuda.py	`54.21% <0.00%> (-9.83%)`	⬇️
torch/backends/cuda/__init__.py	`62.50% <0.00%> (-8.34%)`	⬇️
torch/nn/modules/loss.py	`93.97% <0.00%> (-3.78%)`	⬇️
torch/quantization/__init__.py	`86.66% <0.00%> (-0.84%)`	⬇️
torch/quantization/quantize.py	`90.27% <0.00%> (-0.82%)`	⬇️
torch/quantization/fx/quantize.py	`96.79% <0.00%> (-0.46%)`	⬇️
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bee1d44...408ede9. Read the comment docs.

…up." `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/) [ghstack-poisoned]

Pull Request resolved: #45181 `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 ghstack-source-id: 112880857 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/)

…up." `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/) [ghstack-poisoned]

Pull Request resolved: #45181 `init_process_group` and `new_group` update a bunch of global variables after initializing the actual process group. As a result, there is a race that after initializing the process group on say rank 0, if we immediately check the default process group on rank 1 (say via RPC), we might actually get an error since rank 1 hasn't yet updated its _default_pg variable. To resolve this issue, I've added barrier() at the end of both of these calls. This ensures that once these calls return we are guaranteed about correct initialization on all ranks. Since these calls are usually done mostly during initialization, it should be fine to add the overhead of a barrier() here. #Closes: #40434, #40378 ghstack-source-id: 112923112 Differential Revision: [D23858025](https://our.internmc.facebook.com/intern/diff/D23858025/)

facebook-github-bot · 2020-09-26T00:13:40Z

This pull request has been merged in a2b4177.

Prior to #45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. Differential Revision: [D24038839](https://our.internmc.facebook.com/intern/diff/D24038839/) [ghstack-poisoned]

Prior to #45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. Differential Revision: [D24038839](https://our.internmc.facebook.com/intern/diff/D24038839/) ghstack-source-id: 113302257 Pull Request resolved: #45642

Prior to #45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. Differential Revision: [D24038839](https://our.internmc.facebook.com/intern/diff/D24038839/) [ghstack-poisoned]

Pull Request resolved: #45642 Prior to #45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. ghstack-source-id: 113490343 Differential Revision: [D24038839](https://our.internmc.facebook.com/intern/diff/D24038839/)

Summary: Pull Request resolved: #45642 Prior to #45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. ghstack-source-id: 113490343 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D24038839 fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc

) Summary: Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut (see original PR: #45642). This PR is to merge it into the 1.7 branch. ---- Original Commit Description Follows --- Pull Request resolved: #45642 Prior to #45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. ghstack-source-id: 113490343 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D24038839 fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc

) (#46073) Summary: Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut (see original PR: #45642). This PR is to merge it into the 1.7 branch. ---- Original Commit Description Follows --- Pull Request resolved: #45642 Prior to #45181, initializing a NCCL process group would work even if no GPUs were present. Although, now since init_process_group calls `barrier()` this would fail. In general the problem was that we could initialize ProcessGroupNCCL without GPUs and then if we called a method like `barrier()` the process would crash since we do % numGPUs resulting in division by zero. ghstack-source-id: 113490343 Test Plan: waitforbuildbot Reviewed By: osalpekar Differential Revision: D24038839 fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc Co-authored-by: Pritam Damania <[email protected]>

If world_size is lesser than or equal to number of GPU's available then the rank can be directly mapped to corresponding GPU. This fixes the issue referenced in pytorch#45435 and pytorch#47629 For world_size = 3 and number of GPU's = 8, the rank to GPU mapping will be 0,2,4. This is due to the introduction of barrier, (refer pytorch#45181) the tensors in barrier is mapped to cuda0,1,2 and the tensors in the actual test cases are mapped to cuda0,2,4 resulting in different streams and leading to timeout. This issue is specific to default process group. Issue is not observed in new process group since the streams are created again after the initial barrier call. This patch maps the rank to corresponding GPU's when the world_size is less than or equal to the number of GPU's, in this case 0,1,2 Note: The barrier function in distributed_c10d.py should include new parameter to specify the tensor or rank to GPU mapping. In that case, this patch will be redundant but harmless since the tests can specify the tensors with appropriate GPU rankings.

Summary: If world_size is lesser than or equal to number of GPU's available then the rank can be directly mapped to corresponding GPU. This fixes the issue referenced in #45435 and #47629 For world_size = 3 and number of GPU's = 8, the rank to GPU mapping will be 0,2,4. This is due to the introduction of barrier, (refer PR #45181) the tensors in barrier is mapped to cuda0,1,2 and the tensors in the actual test cases are mapped to cuda0,2,4 resulting in different streams and leading to timeout. This issue is specific to default process group. Issue is not observed in new process group since the streams are created again after the initial barrier call. This patch maps the rank to corresponding GPU's when the world_size is less than or equal to the number of GPU's, in this case 0,1,2 Note: The barrier function in distributed_c10d.py should include new parameter to specify the tensor or rank to GPU mapping. In that case, this patch will be redundant but harmless since the tests can specify the tensors with appropriate GPU rankings. Fixes #47629 Pull Request resolved: #47898 Reviewed By: smessmer Differential Revision: D24956021 Pulled By: rohan-varma fbshipit-source-id: a88257f22a7991ba36566329766c106d3360bb4e

pritamdamania87 requested review from apaszke, mrshenli, pietern, rohan-varma and zhaojuanmao as code owners September 23, 2020 00:40

mrshenli approved these changes Sep 23, 2020

View reviewed changes

mrshenli mentioned this pull request Sep 23, 2020

Let init_process_group block until all peers are ready #45126

Closed

lw mentioned this pull request Sep 23, 2020

Fix RPC and ProcessGroup GIL deadlock #45088

Closed

mrshenli mentioned this pull request Sep 23, 2020

DISABLED test_backward_ddp_outside_uneven_inputs (__main__.ProcessGroupDdpUnderDistAutogradTestWithSpawn) #45105

Closed

mrshenli approved these changes Sep 24, 2020

View reviewed changes

facebook-github-bot closed this in a2b4177 Sep 25, 2020

facebook-github-bot added the merged label Sep 26, 2020

facebook-github-bot deleted the gh/pritamdamania87/164/head branch September 29, 2020 14:23

pritamdamania87 mentioned this pull request Oct 1, 2020

Disallow creation of ProcessGroupNCCL without GPUs. #45642

Closed

This was referenced Oct 9, 2020

[v1.7 patch] Disallow creation of ProcessGroupNCCL without GPUs. (#45642) #46070

Closed

[v1.7 patch] Disallow creation of ProcessGroupNCCL without GPUs. (#45… #46073

Merged

mruberry added the Merged label Oct 28, 2020

xwang233 mentioned this pull request Nov 3, 2020

Multi-GPU distributed test is running on single GPU machine and fail #47257

Closed

jaglinux mentioned this pull request Nov 13, 2020

distributed_test: Map rank to GPU accordingly #47898

Closed

pritamdamania87 mentioned this pull request May 11, 2023

[Experimental] Remove store barrier after PG init #99937

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add barrier() at the end of init_process_group and new_group. #45181

Add barrier() at the end of init_process_group and new_group. #45181

Uh oh!

pritamdamania87 commented Sep 23, 2020 •

edited

Loading

Uh oh!

mrshenli left a comment

Uh oh!

mrshenli Sep 23, 2020

Uh oh!

pritamdamania87 Sep 23, 2020

Uh oh!

mrshenli Sep 23, 2020

Uh oh!

dr-ci bot commented Sep 23, 2020 •

edited

Loading

Uh oh!

lw commented Sep 23, 2020

Uh oh!

pritamdamania87 commented Sep 23, 2020

Uh oh!

mrshenli Sep 24, 2020

Uh oh!

codecov bot commented Sep 25, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Add barrier() at the end of init_process_group and new_group. #45181

Add barrier() at the end of init_process_group and new_group. #45181

Uh oh!

Conversation

pritamdamania87 commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Sep 23, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Sep 23, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli Sep 23, 2020

Choose a reason for hiding this comment

Uh oh!

dr-ci bot commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

lw commented Sep 23, 2020

Uh oh!

pritamdamania87 commented Sep 23, 2020

Uh oh!

mrshenli Sep 24, 2020

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

facebook-github-bot commented Sep 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pritamdamania87 commented Sep 23, 2020 •

edited

Loading

dr-ci bot commented Sep 23, 2020 •

edited

Loading

codecov bot commented Sep 25, 2020 •

edited

Loading