Skip to content

Conversation

@jaglinux
Copy link
Contributor

If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in #45435 and #47629

For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer PR #45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.

This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2

Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.

Fixes #47629

If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in pytorch#45435 and pytorch#47629

For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer pytorch#45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.

This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2

Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.
@facebook-github-bot
Copy link
Contributor

Hi @jaglinux!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 13, 2020
@jaglinux
Copy link
Contributor Author

@jeffdaily @pruthvistony please review

@dr-ci
Copy link

dr-ci bot commented Nov 13, 2020

💊 CI failures summary and remediations

As of commit 35d02df (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_coverage_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Nov 13 18:41:58 sccache: error: couldn't connect to server
Nov 13 18:41:58 +++ eval 'extract_trap_cmd ' 
Nov 13 18:41:58 ++++ extract_trap_cmd 
Nov 13 18:41:58 ++++ printf '%s\n' '' 
Nov 13 18:41:58 +++ printf '%s\n' cleanup 
Nov 13 18:41:58 ++ trap -- ' 
Nov 13 18:41:58 cleanup' EXIT 
Nov 13 18:41:58 ++ [[ pytorch-linux-bionic-py3.8-gcc9-coverage-build != *pytorch-win-* ]] 
Nov 13 18:41:58 ++ which sccache 
Nov 13 18:41:58 ++ sccache --stop-server 
Nov 13 18:41:58 Stopping sccache server... 
Nov 13 18:41:58 sccache: error: couldn't connect to server 
Nov 13 18:41:58 sccache: caused by: Connection refused (os error 111) 
Nov 13 18:41:58 ++ true 
Nov 13 18:41:58 ++ rm /var/lib/jenkins/sccache_error.log 
Nov 13 18:41:58 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Nov 13 18:41:58 ++ true 
Nov 13 18:41:58 ++ [[ pytorch-linux-bionic-py3.8-gcc9-coverage-build == *rocm* ]] 
Nov 13 18:41:58 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Nov 13 18:41:58 ++ SCCACHE_IDLE_TIMEOUT=1200 
Nov 13 18:41:58 ++ RUST_LOG=sccache::server=error 
Nov 13 18:41:58 ++ sccache --start-server 

Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 5 times.

@codecov
Copy link

codecov bot commented Nov 13, 2020

Codecov Report

Merging #47898 (237ab81) into master (8ff0b6f) will decrease coverage by 0.15%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #47898      +/-   ##
==========================================
- Coverage   81.25%   81.10%   -0.16%     
==========================================
  Files        1839     1839              
  Lines      198269   198271       +2     
==========================================
- Hits       161107   160805     -302     
- Misses      37162    37466     +304     

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this fix! @jeffdaily Would you be able to verify if this fixes the issue on your end? @jaglinux Would you mind signing the CLA when you get a chance? Thank you!

@jaglinux
Copy link
Contributor Author

@rohan-varma thanks for the review. I signed CLA.
I have tested the patch on 8 GPU machine with world_size as 3.
All the test cases including the global test cases or the default pg passed.

Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma I completely trust @jaglinux here. He's part of our ROCm PyTorch team and was leading the triage effort for these related issues. He's done all the work of fixing and testing and we've had many discussions about it. Besides the suggested comment change, we're all good!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@rohan-varma
Copy link
Contributor

rohan-varma commented Nov 14, 2020

Did some local testing and I believe this PR should also fix #47676, #47675, and #47674

@facebook-github-bot
Copy link
Contributor

@rohan-varma merged this pull request in 1606899.

@facebook-github-bot
Copy link
Contributor

@rohan-varma merged this pull request in 1606899.

jaglinux added a commit to jaglinux/pytorch that referenced this pull request Nov 16, 2020
The PR pytorch#47898 fixes the global tests. Hence enabling the tests.

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>
facebook-github-bot pushed a commit that referenced this pull request Dec 5, 2020
Summary:
The PR #47898 fixes the global tests. Hence enabling the tests.

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

Pull Request resolved: #48023

Reviewed By: malfet, H-Huang

Differential Revision: D25347289

Pulled By: rohan-varma

fbshipit-source-id: 2b519a3046eae1cf1bfba98a125c09b4a6b01fde
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DDP mismatch in rank to GPU selection

5 participants