-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Tests Generelization for multiple accelerator devices #139749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139749
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit bbd0afc with merge base 73278e6 ( UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
|
|
You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
5a25c57 to
468c0a4
Compare
hi @kwen2501 , can you please review and approve the changes |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort. Overall looks good to me. I left a couple comments. I think the goal is to use device_type as much as possible and reduce the dependency on cuda or hpu strings. You did that well in most places, just need to cover the few ones left. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education, why is there a numerical difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please restore the original format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this device_type definition used somewhere given that there is a
@property
def device_type(self) -> str:
below?
nit: can you restore the original two-line formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make device_id a property of the class?
@property
def device_id(self) -> torch.device:
device_count = torch.get_device_module(self.device_type).device_count()
return torch.device(device_type, self.rank % device_count)
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could truly generalize this line than adding another device string here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could first define DEVICE_TYPE then DEVICE_COUNT so that we can call _get_device_module(DEVICE_TYPE) then device_count()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could build a map instead? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about reusing the PG_BACKEND map above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we still keep those if...else statement, then a mapping list should be maintained and changed for each new backend added even it's out-of-tree.
Then, is it possible to use undefined backend which will find corresponding device backend by device type in process group init? e.g. nccl for cuda, gloo for cpu, hccl for hpu and xccl for xpu if hccl and xccl is registered out-of-tree.
In code logic,
-
Init process group with
undefinedbackend.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1569 -
By backend name to query a
backend_configwhich has adevice_backend_map.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1816
undefined backend name will use default device backend map , which is gloo for cpu, nccl for cuda. Out-of-tree backend will also be added to this default map by https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L254
- Register available backend in
backend_config.get_device_backend_map()to process group
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1818
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about reusing the
PG_BACKENDmap above?
hi @kwen2501 , modified. can you review please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we still keep those
if...elsestatement, then a mapping list should be maintained and changed for each new backend added even it's out-of-tree.Then, is it possible to use
undefinedbackend which will find corresponding device backend by device type in process group init? e.g.ncclfor cuda,gloofor cpu,hcclfor hpu andxcclfor xpu ifhcclandxcclis registered out-of-tree.In code logic,
- Init process group with
undefinedbackend.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1569- By backend name to query a
backend_configwhich has adevice_backend_map.
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1816
undefinedbackend name will use default device backend map , which isgloofor cpu,ncclfor cuda. Out-of-tree backend will also be added to this default map by https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L254
- Register available backend in
backend_config.get_device_backend_map()to process group
https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L1818
This can be straight one -to-one mapping for default backends :
#140536
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @kwen2501 please review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @kwen2501 CI ran fine, please approve after review.
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
5b02e11 to
5a18451
Compare
hi @kwen2501 please review. |
|
@kwen2501 : Could you please help with the approval , it would be good if we can push this before the code freeze for v2.6.0. Thank you |
|
@kwen2501 Gentle reminder ! |
|
@kwen2501 : Gentle reminder, could you please help with the approval. thank you. |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Sorry about the delay
|
@pytorchmergebot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to Command |
|
@pytorchmergebot rebase |
d1acd0c to
a72cbf2
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 3 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 3 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
"Try to land this since the failure is unrelated." |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Chnages: There are general changes in common_dtesnor module for device type generalization so that tests can be executed on non cuda devices too.
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o