distributed_test: Map rank to GPU accordingly #47898

jaglinux · 2020-11-13T05:46:30Z

If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in #45435 and #47629

For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer PR #45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.

This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2

Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.

Fixes #47629

If world_size is lesser than or equal to number of GPU's available then the rank can be directly mapped to corresponding GPU. This fixes the issue referenced in pytorch#45435 and pytorch#47629 For world_size = 3 and number of GPU's = 8, the rank to GPU mapping will be 0,2,4. This is due to the introduction of barrier, (refer pytorch#45181) the tensors in barrier is mapped to cuda0,1,2 and the tensors in the actual test cases are mapped to cuda0,2,4 resulting in different streams and leading to timeout. This issue is specific to default process group. Issue is not observed in new process group since the streams are created again after the initial barrier call. This patch maps the rank to corresponding GPU's when the world_size is less than or equal to the number of GPU's, in this case 0,1,2 Note: The barrier function in distributed_c10d.py should include new parameter to specify the tensor or rank to GPU mapping. In that case, this patch will be redundant but harmless since the tests can specify the tensors with appropriate GPU rankings.

facebook-github-bot · 2020-11-13T05:46:38Z

Hi @jaglinux!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

jaglinux · 2020-11-13T05:48:10Z

@jeffdaily @pruthvistony please review

dr-ci · 2020-11-13T05:58:03Z

💊 CI failures summary and remediations

As of commit 35d02df (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 1/2 non-CircleCI failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_bionic_py3_8_gcc9_coverage_build (1/1)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Nov 13 18:41:58 sccache: error: couldn't connect to server

Nov 13 18:41:58 +++ eval 'extract_trap_cmd ' 
Nov 13 18:41:58 ++++ extract_trap_cmd 
Nov 13 18:41:58 ++++ printf '%s\n' '' 
Nov 13 18:41:58 +++ printf '%s\n' cleanup 
Nov 13 18:41:58 ++ trap -- ' 
Nov 13 18:41:58 cleanup' EXIT 
Nov 13 18:41:58 ++ [[ pytorch-linux-bionic-py3.8-gcc9-coverage-build != *pytorch-win-* ]] 
Nov 13 18:41:58 ++ which sccache 
Nov 13 18:41:58 ++ sccache --stop-server 
Nov 13 18:41:58 Stopping sccache server... 
Nov 13 18:41:58 sccache: error: couldn't connect to server 
Nov 13 18:41:58 sccache: caused by: Connection refused (os error 111) 
Nov 13 18:41:58 ++ true 
Nov 13 18:41:58 ++ rm /var/lib/jenkins/sccache_error.log 
Nov 13 18:41:58 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Nov 13 18:41:58 ++ true 
Nov 13 18:41:58 ++ [[ pytorch-linux-bionic-py3.8-gcc9-coverage-build == *rocm* ]] 
Nov 13 18:41:58 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Nov 13 18:41:58 ++ SCCACHE_IDLE_TIMEOUT=1200 
Nov 13 18:41:58 ++ RUST_LOG=sccache::server=error 
Nov 13 18:41:58 ++ sccache --start-server

Extra GitHub checks: 1 failed

Failed: GitHub Actions - flake8-py3

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 5 times.

codecov · 2020-11-13T09:00:59Z

Codecov Report

Merging #47898 (237ab81) into master (8ff0b6f) will decrease coverage by 0.15%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master   #47898      +/-   ##
==========================================
- Coverage   81.25%   81.10%   -0.16%     
==========================================
  Files        1839     1839              
  Lines      198269   198271       +2     
==========================================
- Hits       161107   160805     -302     
- Misses      37162    37466     +304

rohan-varma

Thank you for this fix! @jeffdaily Would you be able to verify if this fixes the issue on your end? @jaglinux Would you mind signing the CLA when you get a chance? Thank you!

jaglinux · 2020-11-13T18:08:58Z

@rohan-varma thanks for the review. I signed CLA.
I have tested the patch on 8 GPU machine with world_size as 3.
All the test cases including the global test cases or the default pg passed.

jeffdaily

@rohan-varma I completely trust @jaglinux here. He's part of our ROCm PyTorch team and was leading the triage effort for these related issues. He's done all the work of fixing and testing and we've had many discussions about it. Besides the suggested comment change, we're all good!

torch/testing/_internal/distributed/distributed_test.py

Co-authored-by: Jeff Daily <[email protected]>

facebook-github-bot

@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

rohan-varma · 2020-11-14T03:05:00Z

Did some local testing and I believe this PR should also fix #47676, #47675, and #47674

facebook-github-bot · 2020-11-14T09:14:44Z

@rohan-varma merged this pull request in 1606899.

facebook-github-bot · 2020-11-14T09:14:49Z

@rohan-varma merged this pull request in 1606899.

The PR pytorch#47898 fixes the global tests. Hence enabling the tests. Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

Summary: The PR #47898 fixes the global tests. Hence enabling the tests. Signed-off-by: Jagadish Krishnamoorthy <[email protected]> Pull Request resolved: #48023 Reviewed By: malfet, H-Huang Differential Revision: D25347289 Pulled By: rohan-varma fbshipit-source-id: 2b519a3046eae1cf1bfba98a125c09b4a6b01fde

jaglinux requested review from mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners November 13, 2020 05:46

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 13, 2020

pytorchbot added the open source label Nov 13, 2020

rohan-varma approved these changes Nov 13, 2020

View reviewed changes

facebook-github-bot added the cla signed label Nov 13, 2020

jeffdaily requested changes Nov 13, 2020

View reviewed changes

torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved

Update torch/testing/_internal/distributed/distributed_test.py

35d02df

Co-authored-by: Jeff Daily <[email protected]>

facebook-github-bot reviewed Nov 13, 2020

View reviewed changes

jeffdaily approved these changes Nov 13, 2020

View reviewed changes

facebook-github-bot closed this in 1606899 Nov 14, 2020

This was referenced Nov 14, 2020

DISABLED test_all_reduce_sum_cuda_async (__main__.TestDistBackendWithFork) #47676

Closed

DISABLED test_reduce_multigpu (__main__.TestDistBackendWithFork) #47675

Closed

DISABLED test_reduce_sum_cuda_twice (__main__.TestDistBackendWithFork) #47674

Closed

facebook-github-bot added the Merged label Nov 14, 2020

jaglinux added a commit to jaglinux/pytorch that referenced this pull request Nov 16, 2020

[ROCm] Enable skipped distributed global tests

15dfdcf

The PR pytorch#47898 fixes the global tests. Hence enabling the tests. Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

jaglinux mentioned this pull request Nov 16, 2020

[ROCm] Enable skipped distributed global tests #48023

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

distributed_test: Map rank to GPU accordingly #47898

distributed_test: Map rank to GPU accordingly #47898

Uh oh!

jaglinux commented Nov 13, 2020

Uh oh!

facebook-github-bot commented Nov 13, 2020

Uh oh!

jaglinux commented Nov 13, 2020

Uh oh!

dr-ci bot commented Nov 13, 2020 •

edited

Loading

Uh oh!

codecov bot commented Nov 13, 2020

Uh oh!

rohan-varma left a comment

Uh oh!

jaglinux commented Nov 13, 2020

Uh oh!

jeffdaily left a comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Uh oh!

rohan-varma commented Nov 14, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Nov 14, 2020

Uh oh!

facebook-github-bot commented Nov 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

distributed_test: Map rank to GPU accordingly #47898

distributed_test: Map rank to GPU accordingly #47898

Uh oh!

Conversation

jaglinux commented Nov 13, 2020

Uh oh!

facebook-github-bot commented Nov 13, 2020

Uh oh!

jaglinux commented Nov 13, 2020

Uh oh!

dr-ci bot commented Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_linux_bionic_py3_8_gcc9_coverage_build (1/1)

Extra GitHub checks: 1 failed

Uh oh!

codecov bot commented Nov 13, 2020

Codecov Report

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

jaglinux commented Nov 13, 2020

Uh oh!

jeffdaily left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Nov 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Nov 14, 2020

Uh oh!

facebook-github-bot commented Nov 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dr-ci bot commented Nov 13, 2020 •

edited

Loading

rohan-varma commented Nov 14, 2020 •

edited

Loading