Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Oct 1, 2020

Stack from ghstack:

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls barrier() this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like barrier() the process would crash
since we do % numGPUs resulting in division by zero.

Differential Revision: D24038839

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.

Differential Revision: [D24038839](https://our.internmc.facebook.com/intern/diff/D24038839/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 1, 2020
pritamdamania87 pushed a commit that referenced this pull request Oct 1, 2020
Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.

Differential Revision: [D24038839](https://our.internmc.facebook.com/intern/diff/D24038839/)

ghstack-source-id: 113302257
Pull Request resolved: #45642
@codecov
Copy link

codecov bot commented Oct 1, 2020

Codecov Report

Merging #45642 into gh/pritamdamania87/167/base will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/167/base   #45642      +/-   ##
===============================================================
- Coverage                        68.66%   68.66%   -0.01%     
===============================================================
  Files                              406      406              
  Lines                            52223    52223              
===============================================================
- Hits                             35858    35857       -1     
- Misses                           16365    16366       +1     
Impacted Files Coverage Δ
torch/testing/_internal/expecttest.py 77.55% <0.00%> (-1.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 181afd5...bd16074. Read the comment docs.

@osalpekar osalpekar self-requested a review October 2, 2020 18:56
Copy link
Member

@osalpekar osalpekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be cherry-picked into 1.7, since #45181 made it in?

@pritamdamania87
Copy link
Contributor Author

Will this be cherry-picked into 1.7, since #45181 made it in?

Yes, that is the plan.

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.

Differential Revision: [D24038839](https://our.internmc.facebook.com/intern/diff/D24038839/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Oct 2, 2020
Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Differential Revision: [D24038839](https://our.internmc.facebook.com/intern/diff/D24038839/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b5a2f04.

pritamdamania87 pushed a commit that referenced this pull request Oct 9, 2020
)

Summary:

Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.

---- Original Commit Description Follows ---

Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/167/head branch October 9, 2020 14:18
malfet pushed a commit that referenced this pull request Oct 12, 2020
) (#46073)

Summary:

Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.

---- Original Commit Description Follows ---

Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc

Co-authored-by: Pritam Damania <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants