Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

Summary:

Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.

---- Original Commit Description Follows ---

Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls barrier() this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like barrier() the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc

)

Summary:

Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.

---- Original Commit Description Follows ---

Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
@codecov
Copy link

codecov bot commented Oct 9, 2020

Codecov Report

Merging #46073 into release/1.7 will decrease coverage by 0.00%.
The diff coverage is 62.50%.

Impacted file tree graph

@@               Coverage Diff               @@
##           release/1.7   #46073      +/-   ##
===============================================
- Coverage        68.46%   68.46%   -0.01%     
===============================================
  Files              406      406              
  Lines            52267    52275       +8     
===============================================
+ Hits             35785    35790       +5     
- Misses           16482    16485       +3     
Impacted Files Coverage Δ
torch/testing/_internal/common_distributed.py 67.36% <62.50%> (-0.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 653d766...9af59ca. Read the comment docs.

@malfet malfet merged commit 305515d into release/1.7 Oct 12, 2020
@facebook-github-bot facebook-github-bot deleted the nccl_gpu_1.7_cherry_pick branch January 27, 2021 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants