-
Notifications
You must be signed in to change notification settings - Fork 26.3k
distributed_test: Map rank to GPU accordingly #47898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If world_size is lesser than or equal to number of GPU's available then the rank can be directly mapped to corresponding GPU. This fixes the issue referenced in pytorch#45435 and pytorch#47629 For world_size = 3 and number of GPU's = 8, the rank to GPU mapping will be 0,2,4. This is due to the introduction of barrier, (refer pytorch#45181) the tensors in barrier is mapped to cuda0,1,2 and the tensors in the actual test cases are mapped to cuda0,2,4 resulting in different streams and leading to timeout. This issue is specific to default process group. Issue is not observed in new process group since the streams are created again after the initial barrier call. This patch maps the rank to corresponding GPU's when the world_size is less than or equal to the number of GPU's, in this case 0,1,2 Note: The barrier function in distributed_c10d.py should include new parameter to specify the tensor or rank to GPU mapping. In that case, this patch will be redundant but harmless since the tests can specify the tensors with appropriate GPU rankings.
|
Hi @jaglinux! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
|
@jeffdaily @pruthvistony please review |
💊 CI failures summary and remediationsAs of commit 35d02df (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Codecov Report
@@ Coverage Diff @@
## master #47898 +/- ##
==========================================
- Coverage 81.25% 81.10% -0.16%
==========================================
Files 1839 1839
Lines 198269 198271 +2
==========================================
- Hits 161107 160805 -302
- Misses 37162 37466 +304 |
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this fix! @jeffdaily Would you be able to verify if this fixes the issue on your end? @jaglinux Would you mind signing the CLA when you get a chance? Thank you!
|
@rohan-varma thanks for the review. I signed CLA. |
jeffdaily
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohan-varma I completely trust @jaglinux here. He's part of our ROCm PyTorch team and was leading the triage effort for these related issues. He's done all the work of fixing and testing and we've had many discussions about it. Besides the suggested comment change, we're all good!
Co-authored-by: Jeff Daily <[email protected]>
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohan-varma has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@rohan-varma merged this pull request in 1606899. |
|
@rohan-varma merged this pull request in 1606899. |
The PR pytorch#47898 fixes the global tests. Hence enabling the tests. Signed-off-by: Jagadish Krishnamoorthy <[email protected]>
Summary: The PR #47898 fixes the global tests. Hence enabling the tests. Signed-off-by: Jagadish Krishnamoorthy <[email protected]> Pull Request resolved: #48023 Reviewed By: malfet, H-Huang Differential Revision: D25347289 Pulled By: rohan-varma fbshipit-source-id: 2b519a3046eae1cf1bfba98a125c09b4a6b01fde
If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in #45435 and #47629
For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer PR #45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.
This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2
Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.
Fixes #47629