-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Refactor core DistributedDataParallel tests #20235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Differential Revision: D15245429 Differential Version: 81320055
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If device_ids is indeed unnecessary, shall we consolidate?
| copy.deepcopy(model).cuda(gpus[0]), | ||
| device_ids=gpus, | ||
| copy.deepcopy(model).to(devices[0]), | ||
| device_ids=device_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_ids in DDP should be able to directly handle devices, as we wrote:
device_ids (list of int or torch.device): CUDA devices. This should ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing tests exercised both and I didn't want to change that. We should test both the list of integers and list of torch.device cases in the suite (it could be broken out as a separate lightweight test instead of performing a full end to end tests).
Differential Revision: D15245429 Differential Version: 81326667
Differential Revision: D15245429 Differential Version: 81330801
Differential Revision: D15245429 Differential Version: 81461821
|
@pytorchbot retest this please |
|
This pull request has been merged in f32c9bd. |
Stack:
:white_circle: #20236 Make DistributedDataParallel usable with CPU models 💚
:black_circle: #20235 Refactor core DistributedDataParallel tests 💚
:white_circle: #20234 Add c10d::broadcast_coalesced and tests 💚
The tests expected to only run for CUDA models. In a future commit we
need to update this to work for CPU models as well. Therefore, we can
no longer rely on only integers being passed for device identifiers.
With this change we pass both the materialized list of devices to use
(as
torch.Deviceobjects), as well as an optional list of integers.The latter is specified to exercise the code in the
DistributedDataParallel constructor that turns a list of integers into
CUDA devices, IFF it is used to wrap a single-device CUDA module.
This commit also groups together the 'str' and non-'str' tests. These
used to test passing the list of devices as integers or as
torch.Deviceinstances. These are now executed from the same test.Differential Revision: D15245429